A second means of improving the throughput of FFT designs lies in leveraging the pipelining afforded by the AI Engine array. The computation can be partitioned across multiple tiles in a row, passing intermediate results from one tile to the next using the shared memory buffers in between. This form of “pipelining” leads to increased throughput from using additional tile resources.
The Vitis DSP library design fft32_dsplib_split demonstrates this approach. The TP_CASC_LEN parameter identifies how many tiles in a row shall be used to perform the transform. When set to TP_CASC_LEN=3, the computation is split over three tiles. This choice aligns with the FFT-32 case because it is implemented as two Radix-4 stages followed by a third Radix-2 stage. So, each stage is assigned to its tile.
The figure below shows the AI Engine graph for this three-tile FFT design from Vitis DSP library. A shared memory buffer is placed between each pair of tiles in the chain. As using cint16 data here and the transform is only 32 points, it can pass intermediate results between tiles using the conventional 32-bit memory interfaces without degradation in bit fidelity. In other cases, such as larger transforms or when using 32-bit I/O where this is not possible, the Vitis DSP Library automatically elects to pass intermediate results using the 384-bit cascade accumulator streaming connections between tiles.
The paragraphs and markdown formatting are now correctly formatted.