The AIE API has a particular set of stage-based functions to compute the Fast Fourier Transform through the Stockham variant of the Cooley-Tukey algorithm. As explained in the FFT background document, such variant is a not in place algorithm, thus generally requires a temporary buffer to save intermediate calculations. However, it has the big advantage of being self-sorting and efficiently vectorizable.
To understand the FFT API usage it is important to consider that the FFT algorithms rely on the concept of decimation, for which an N point Discrete Fourier Transform can be divided into the sum of multiple smaller transforms, multiplied by complex rotation parameters called Twiddle Factors. The number of the resulting transforms depends linearly on the radix parameter: a radix-2 FFT stage will thus divide the original transform into two transforms, a radix-3 into three, and so on. Usually, the decimation process is repeated recursively until the discrete transform to be performed has a number of points equal to the radix of the last stage.
Fig. 2: Diagram of a radix-4 Stage Decimation.
The FFT staged API functions are C++ templatized functions. The API used for this work is the radix-4 stage function, whose declaration is the following:
void aie::fft_dit_r4_stage <unsigned Vectorization,
typename Input , typename Output , typename Twiddle>
( const Input *__restrict x,
const Twiddle *__restrict tw0,
const Twiddle *__restrict tw1,
const Twiddle *__restrict tw2,
unsigned n_points,
unsigned shift_tw,
unsigned shift,
bool inv,
Output *__restrict out)
The template parameters are:
The input datatype;
The output datatype;
The twiddles datatype;
The stage vectorization, that is the step between the indexes of the samples to be summed and rotated.
$$\text{vec}(\text{stage}_ 0)=\frac{N_ {\text{points}}}{\text{radix}_ {\text{stage}_ 0}}, \quad \text{vec}(\text{stage}_ m)=\frac{\text{vec}(\text{stage}_ {m-1})}{\text{radix}_ {\text{stage}_ m}}$$
Fig. 3: Example of vectorization of the three stages of an 8 points radix-2 DIT FFT.
From the figure above, note that the vectorization follows the equations above, where the stride in index (shown at the left of the image) between samples is 4-2-1.The function’s arguments are:
The pointers to the twiddle tables, that are in number equal to the used stage radix minus one, thus three pointers for a radix-4 stage implementation, as observable from figure 2;
The input data memory pointer;
The output data memory pointer;
The total number of points of the signal;
The twiddle shift parameter to handle the fixed-point graularity of the twiddles;
The stage shift parameter to handle the fixed-point graularity of the FFT stage;
The inverse flag, that sets the function to compute the FFT or its inverse function.
Note moreover that having an odd number of stages eliminates the need for a temporary buffer, because the output ping-pong buffers can be used as the auxiliary memory to compute the not in-place algorithm. This happens because every alternated sequence of two elements terminates on the second element only if the number of repetitions is odd. For example, if the three repetitions are done the sequence will be {in-out, out-in, in-out}, whereas with four repetitions it is {in-out, out-in, in-out, out-in}, thus a third auxiliary element would be needed to terminate the sequence on the “out” element.