Before coding the kernel, it is important to choose the best API function to carry out the computation. Since \(1024=4^5\), a radix-4 only implementation that comprehends five stages, therefore five API calls, is a suitable choice for this design. This API not only requires less API calls with respect to radix-2, thus less program execution control overhead, but it also requires less computations because of the increased number of trivial complex multiplications. Another important consideration to keep in mind is that the memory used to compute one FFT is equal to four times the memory needed to store the 1024 CINT16 samples, that is 16 kilobytes, plus the memory reserved for the twiddle tables. Such factor of four is due to the fact that ping-pong buffers are needed both at the input and at the output to avoid creating backpressure. This means that, since it has a 64 kilobytes local memory, multiple signals can be batched together to be computed into the same AIE-ML tile. In this design, we are batching two signals for each kernel.
Keeping in mind the design considerations, the next step is to code the AI Engine kernels. The file structure of choice to write the kernels is:
One header file containing all the twiddle factor tables defined as macros with the #define compiler directive.
One highly parametric header file where the fft1k_kernel class is defined.
One .cpp file where the kernel class run method is defined with the AIE API functions.