FFT-9 Kernel - FFT-9 Kernel - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

This AI Engine kernel implements the DFT-9 using the 1x4x8 aie::mmul() API which takes two to three cycles per operation. The complete DFT-9 compute may be partitioned over three compute tiles, one to compute the pink portion, a second to compute the orange portion, and a third to compute the green portion identified in the following figure. The API computes a [1x4] x [4x8] matrix multiply, and the DFT-9 must be padded with extra rows and columns of zeros. The entire transform must be computed over nine cycles (SSR=1). The actual API code computes eight transforms in less than 72 cycles.

The algorithm is vectorized using three 32-lane registers in each AIE-ML core. A set of ten vector reads fully populates these registers with data from eight consecutive transforms. Once populated, a set of aie::shuffle_up() API operations are used to position the data in the proper lanes for computes performed by the aie::mmul() API routine.

The outputs from eight consecutive transforms are then packed together into nine 8-lane vectors using a fourth tile to perform this output combining. The aie::shuffle_up() and aie::shuffle_down() APIs are used to perform this packing.

figure5