The Vitis DSP library design fft32_dsplib_ssr demonstrates one final optimization technique that provides the most significant increase in throughput by mapping the transform computation over a 2D grid of tiles. This combines the advantages of the “pipelined” approach shown earlier with additional benefits of parallelism achieved by using multiple rows of the AI Engine array. Additionally, the 2D approach leverages the ability to bring in more data at higher bandwidths using multiple parallel PLIO streams.
The template parameter TP_PARALLEL_POWER can scale the FFT design across a 2D array of tiles. It also allows for support of transform sizes beyond the capabilities of a single tile. Essentially, this parameter splits the computation across T = 2^TP_PARALLEL_POWER tiles, with each tile performing a point size of N/T. These “front-end” tiles are called “subframe processors.” The design is fed by 2T stream ports. To synthesize the result, these T tiles will be combined with another log2(T) x T tiles of the Radix-2 stages. This is demonstrated in the following figure for three different values of TP_PARALLEL_POWER.
Each rectangle in the previous figure represents a combination of three kernels implemented in a single AI Engine tile. The orange rectangles represent a “stream-to-window” kernel that collects the dual-stream inputs into an input window buffer. The black rectangles represent a “window-to-stream” kernel that distributes the computed samples to the dual-stream outputs, which can then be distributed to their proper destinations over the stream routing network (according to the required Stockham addressing). The following table summarizes the number of AI Engine tiles required to implement the FFT IP as a function of the TP_PARALLEL_POWER parameter.
TP_PARALLEL_POWER |
# of AI Engine Tiles |
|---|---|
| 0 | 1 |
| 1 | 4 |
| 2 | 12 |
| 3 | 32 |
| 4 | 80 |
One scaling limitation of the Vitis DSP Library FFT IP is the following. The supported transform size for the front-end subframe processors when scaling with TP_PARALLEL_POWER is restricted to all powers of two between 16 and 4096 inclusive. Transform sizes outside of this range are not supported. Therefore, scaling the FFT-32 can only be done with TP_PARALLEL_POWER = 1 because this results in the smallest supported transform size of 16 for the subframe processors. The following figure shows the graph for this case.