The first FFT design presented here for the Versal AI Engine is implemented using the AI Engine API. The API is a portable programming interface for AI Engine accelerators that is implemented as a C++ header-only library that provides data types and computational operations that get translated into efficient low-level intrinsic and VLIW assembly code for the AI Engine vector processor. The API provides high-level abstractions that improve code readability and simplify code maintenance.
This first design is a Radix-2 DIT 32-pt FFT using the Stockham approach. The design implements the complete transform computation using a single AI Engine tile. The code adopts the C++ Kernel Class programming style that works nicely with the API. The following code block shows the source code for fft32_r2_kernel.h that defines the C++ kernel class.
#pragma once
#include <adf.h>
#include <aie_api/aie.hpp>
#include "fft32_r2_twiddle.h"
using namespace adf;
class fft32_r2_kernel {
public:
typedef cint16 TT_DATA;
typedef cint16 TT_TWID;
static constexpr unsigned N = 32;
static constexpr unsigned SHIFT_TW = 15;
static constexpr unsigned SHIFT_DT = 15;
static constexpr bool INVERSE = false;
static constexpr unsigned REPEAT = 128;
static constexpr unsigned WIN_SIZE = N * REPEAT;
private:
// Temporary Buffers:
alignas(aie::vector_decl_align) TT_DATA tbuff[N];
// Twiddle factors:
alignas(aie::vector_decl_align) static constexpr TT_TWID tw1[ 1] = TWID1;
alignas(aie::vector_decl_align) static constexpr TT_TWID tw2[ 2] = TWID2;
alignas(aie::vector_decl_align) static constexpr TT_TWID tw4[ 4] = TWID4;
alignas(aie::vector_decl_align) static constexpr TT_TWID tw8[ 8] = TWID8;
alignas(aie::vector_decl_align) static constexpr TT_TWID tw16[16] = TWID16;
public:
// Constructor:
fft32_r2_kernel(void);
// Run:
void run( input_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_i,
output_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_o );
static void registerKernelClass( void )
{
REGISTER_FUNCTION( fft32_r2_kernel::run );
}
};
Some critical aspects of the previous code are highlighted in the following:
The data types for this design are defined as
TT_DATAandTT_TWID. This FFT uses 16-bit fixed-point data types with “cint16” for the data path and “cint16” twiddle factors. The design declares a data buffertbuffto provide the additional storage required for the Stockham approach.The
run()function defines the kernel interface. This kernel is designed to use buffer-based I/O. The buffer size is set using theWIN_SIZEparameter. Note the window size depends on the transform sizeNand theREPEATparameter used to establish “batch-processing” to increase throughput, as outlined in detail as follows.The
INVERSEparameter controls whether the kernel computes an FFT (true) or an IFFT (false).The twiddle factor tables are hard-coded to values supplied in the external included file
fft32_r2_twiddle.h. Note how all data tables are aligned to 128-bit boundaries as required to support vector memory access in the AI Engine architecture. The AI Engine API providesaie::vector_decl_alignas a future-proof means to supply the correct alignment (should this requirement change in future architectures).
The following code block shows the source code for fft32_r2_kernel.cpp that contains its implementation.
#include <adf.h>
#include <aie_api/aie.hpp>
#include "fft32_r2_kernel.h"
fft32_r2_kernel::fft32_r2_kernel( void )
{
aie::set_rounding(aie::rounding_mode::positive_inf);
aie::set_saturation(aie::saturation_mode::saturate);
}
void fft32_r2_kernel::run( input_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_i,
output_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_o )
{
// Set pointers to windows:
TT_DATA* ibuff = sig_i.data();
TT_DATA* obuff = sig_o.data();
// Perform FFT:
for (int rr=0; rr < REPEAT; rr++)
chess_prepare_for_pipelining
chess_loop_range(REPEAT,)
{
aie::fft_dit_r2_stage<16>(ibuff, tw1, N, SHIFT_TW, SHIFT_DT, INVERSE, tbuff);
aie::fft_dit_r2_stage< 8>(tbuff, tw2, N, SHIFT_TW, SHIFT_DT, INVERSE, ibuff);
aie::fft_dit_r2_stage< 4>(ibuff, tw4, N, SHIFT_TW, SHIFT_DT, INVERSE, tbuff);
aie::fft_dit_r2_stage< 2>(tbuff, tw8, N, SHIFT_TW, SHIFT_DT, INVERSE, ibuff);
aie::fft_dit_r2_stage< 1>(ibuff, tw16, N, SHIFT_TW, SHIFT_DT, INVERSE, obuff);
ibuff += N;
obuff += N;
}
}
Some critical aspects of the previous code are highlighted in the following:
AI Engine rounding and saturation modes are set in the kernel constructor using the AI Engine API.
The kernel signature is provided by the
run()function. As mentioned earlier, the kernel is buffer-based based on I/O. Pointers to the input and output buffers included here.The kernel contains a single “for-loop”. Each iteration of this loop computes a single 32-pt transform. The
REPEATparameter controls how many transforms are computed per kernel invocation. Sufficient input data must be supplied for exactly these many transforms.The
fft_dit_r2_stage()function implements each Stockham DIT stage. The design requires five such stages to implement a 32-pt transform. This routine is described in further detail below.The code demonstrates the “ping-pong” nature of the Stockham algorithm outlined earlier. Here, the input buffer (pointed to by
ibuff) forms one side, and thetbuffforms the other side of this ping-pong buffer. Each buffer serves alternately as input or output as the stage computations progress.
The fft_dit_r2_stage() provided by the API performs Radix-2 stages for DIT Stockham FFTs. These are documented fully in [3]. The routine is templatized and parameterized for flexibility to support various data types and transform sizes. The AIE API supports Radix-2, Radix-3, Radix-4, and Radix-5 stages depending on data types. See [3] for the current supported routines.
The following figure shows the AI Engine graph for this single-tile FFT design. The PLIO connects are made using streams, but these are translated to kernel interface buffers. In this case, the REPEAT parameter was set to unity, indicating the buffer size was set to match the transform size.