Memory Transpose PL Kernel - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

The memory transpose PL kernel writes input samples from the front-end transforms in row-major order and then delivers samples to the back-end transforms reading in column-major order. This must be sustained via streaming over five I/O streams at full speed to achieve the 2.0 Gsps throughput target overall. The key challenge for this kernel is we need to partition the $256 \times 256$ data cube by a factor of 5 and read/write 10 samples per PL clock cycle (where we assume the PL is clocked at 312.5 MHz or 4 times slower than the AI Engine array).

We illustrate the design concept used by the memory transpose PL kernel using a smaller $16 \times 16$ example shown in the figure below. Note how we have zero-padded the 2D array with four extra rows at the bottom and four extra columns on the right. This new $20 \times 20$ data cube divisible by the 5 I/O streams we wish to use concurrently. The entire data cube is partitioned into 5 separate banks each containing 80 samples. Each bank is identified with a unique color in the figure below. Note how we can write the memory from left to write into consecutive rows at the same time with no bank contention – each write occurs into a different color. Similarly, we can read the memory from top to bottom from consecutive columns at the same time with no bank contention. This scheme allows us to establish sustained bandwidth @ 2 Gsps to feed the five instances each in the front and back end AI Engine subgraphs. This same concept is used by the Memory Transpose PL kernel but for its larger $256 \times 256$ size. Also note that we must read/write 10 samples per PL cycle so each memory must be dual-ported.

figure5

We implement the Memory Transpose PL kernel using HLS @ 312.5 MHz. The resource utilization and timing from out-of-context synthesis & place-and-route is given in the figure below.

figure6