The memory transpose PL kernel writes input samples from the front-end transforms in row-major order and then delivers samples to the back-end transforms reading in column-major order. This must be sustained through streaming over five I/O streams at full speed to achieve the 2.0 GSPS throughput target overall. The key challenge for this kernel is partitioning the \(256 \times 256\) data cube by a factor of five and reading/writing 10 samples per PL clock cycle (the PL is clocked at 312.5 MHz or four times slower than the AI Engine array).
The following figure shows the design concept used by the memory transpose PL kernel using a smaller \(16 \times 16\) example. Note how the design is the 2D array is zero-padded with four extra rows at the bottom and four extra columns on the right. This new \(20 \times 20\) data cube is divisible by the five I/O streams used concurrently. The entire data cube is partitioned into five separate banks each containing 80 samples. Each bank is identified with a unique color in the following figure. Note how the design can write the memory from left to right into consecutive rows at the same time with no bank contention. Each write occurs into a different color. Similarly, the design can read the memory from top to bottom from consecutive columns at the same time with no bank contention. This scheme allows establishing sustained bandwidth @ 2 GSPS to feed the five instances each in the front and back end AI Engine subgraphs. The Memory Transpose PL kernel uses the same concept but for its larger \(256 \times 256\) size. Also note that the design must read/write 10 samples per PL cycle so each memory must be dual-ported.
The Memory Transpose PL kernel implementation uses HLS @ 312.5 MHz. The following figure gives the resource utilization and timing from out-of-context synthesis and place-and-route.