FFT-9 Kernel - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

This AI Engine kernel implements the DFT-9 using the 1x4x8 aie::mmul() API which takes 2 to 3 cycles per operation. The complete DFT-9 compute may be partitioned over three compute tiles, one to compute the pink portion, a second to compute the orange portion, and a third to compute the green portion indentified in the figure below. The API computes a [1x4] x [4x8] matrix multiply, and we must must pad the DFT-9 with an extra rows and columns of zeros. The entire transform must be computed over 9 cycles (SSR=1). The actual API code computes 8 transforms in less than 72 cycles.

The algorithm is vectorized using three 32-lane registers in each AIE-ML core. A set of ten vector reads fully populates these registers with data from eight consecutive transforms. Once populated, a set of aie::shuffle_up() API operations are used to position the data in the proper lanes for computes performed by the aie::mmul() API routine.

The outputs from eight consecutive transforms are then packed together into nine 8-lane vectors using a fourth tile to perform this output combining. Here we use aie::shuffle_up() and aie::shuffle_down() API’s to perform this packing.

figure5