FFT-7 Kernel - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

This AI Engine kernel implements the DFT-7 using the 1x4x8 aie::mmul() API which takes 2 to 3 cycles per operation. The complete DFT-7 compute may be partitioned over two compute tiles, one to compute the green portion and a second to compute the blue portion indentified in the figure below. The API computes a [1x4] x [4x8] matrix multiply, and we must must pad the DFT-7 with an extra row and column of zeros. The entire transform must be computed over 7 cycles (SSR=1). The actual API code computes 8 transforms in less than 56 cycles.

The algorithm is vectorized using a pair of 32-lane registers in each AIE-ML core. A set of 8 vector reads fully populates these registers with data from eight consecutive transforms. Once populated, a set of aie::shuffle_up() API operations are used to position the data in the proper lanes for computes performed by the aie::mmul() API routine.

The outputs from eight consecutive transforms are then packed together into seven 8-lane vectors using a third tile to perform this output combining. Here we use aie::shuffle_up() and aie::shuffle_down() API’s to perform this packing.

figure3