INPUT PERMUTE Kernel - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

This PL kernel is implemented in HLS @ 312.5 MHz (SSR=4). Samples arriving on one 128-bit stream are written are written into a ping/pong buffer in 4X duplicate fashion. This is required since we must read or write 4 samples per cycle. The input permutation $P_i$ is stored in a LUT (again with 4X duplication) so the samples may be read back in the required permuted order. The latency of the design is 1008/4 cycles due to the ping/pong nature of the design. A single output streams delivers consecutive 7-pt transforms to the AI Engine array.

The figure below shows the input permutation required by the PFA-1008 design. The permutation ordering may be considered as a 3D mapping with \(R=7\) rows, \(C=9\) columns, and a depth of \(D=16\). The \(R\) dimension is given horizontally, the \(C\) dimension vertically, and the \(D\) dimension is identified by the “Tile-N” labels in the figure. The required permutation may be computed as $P=mod(C \times D \times R + R \times D \times C + R \times C \times D,1008)$. Note that this pattern cannot be generated automatically using the DMA buffer descriptors of the AI Engine Memory Tiles since it contains the “modulo 1008” operation which is not supported by the hardware. For this reason, we map the I/O permutation kernels to the PL.

figure2