The alternative to custom-coded FFT designs using AI Engine API is to simply instantiate the high-performance IP available in the Vitis DSP library. These IP cores provide very high performance that is scalable from a single AI Engine tile to literally dozens of tiles. The IP is realized as a set of C++ template classes. The template parameters can be set according to the desired system-level performance and resource utilization suited to the application.
This section considers a “drop-in” replacement for the Radix-2 FFT-32 designed earlier using the AIE API. The transform from Vitis DSP library differs in a few minor ways from the earlier design:
The Vitis DSP library relies heavily on Radix-4 transforms to optimize QoR. Consequently, to implement an FFT-32, the design will be built with two Radix-4 stages and a final Radix-2 stage. This leads to better throughput as Radix-4 stages vectorize more efficiently to the AI Engine architecture. This also leads to better latency because there are fewer stages overall to compute.
The FFT designs in the Vitis DSP library use either
cfloatorcint32internally in all cases, even for I/O data delivered ascint16. This is done so bit growth need not be managed explicitly from stage to stage as this impacts throughput negatively. This also facilitates making the IP scalable.
The following code block shows the source code for instantiating the single-tile FFT-32 design from Vitis DSP library (fft32_dsplib). The following highlights some key aspects of the design:
The file must include the proper header file
fft_ifft_dit_1ch_graph.hppto access the FFT graph, and must identify the namespacexf::dsp::aie::fft::dit_1ch.The template parameters for the IP are configured using a number of
static constexprdefinitions. The transform size is configured to 32 points, I/O is defined for a 16-bit fixed-point, an FFT is selected, andTP_API=0selects a buffer-based interface. Two key optimization parameters,TP_PARALLELandTP_CASC_LEN, are left at their defaults, leading to a simple single-tile design.The IP graph is instantiated as
dut_fft.The
fft_iandfft_odefine arrays of I/O ports from PLIO. Only a single port is required for this initial design, but you will see later how additional ports can be added to increase the throughput of the design.
#pragma once
#include <adf.h>
#include <vector>
#include <fft_ifft_dit_1ch_graph.hpp>
using namespace adf;
using namespace xf::dsp::aie::fft::dit_1ch;
template<int ORIGIN_X, int ORIGIN_Y>
class fft32_dsplib_graph : public graph {
public:
typedef cint16 TT_TYPE;
typedef cint16 TT_TWIDDLE;
static constexpr int REPEAT = 1;
static constexpr int TP_POINT_SIZE = 32;
static constexpr int TP_FFT_NIFFT = 1;
static constexpr int TP_SHIFT = 0;
static constexpr int TP_CASC_LEN = 1;
static constexpr int TP_DYN_PT_SIZE = 0;
static constexpr int TP_WINDOW_SIZE = TP_POINT_SIZE * REPEAT;
static constexpr int TP_API = 0;
static constexpr int TP_PARALLEL_POWER = 0;
static constexpr int Nports = (TP_API == 1 ) ? (1 << (TP_PARALLEL_POWER+1)) : 1;
std::array<input_plio,Nports> fft_i;
std::array<output_plio,Nports> fft_o;
fft_ifft_dit_1ch_graph<TT_TYPE,TT_TWIDDLE,TP_POINT_SIZE,TP_FFT_NIFFT,TP_SHIFT,
TP_CASC_LEN,TP_DYN_PT_SIZE,TP_WINDOW_SIZE,TP_API,
TP_PARALLEL_POWER> dut_fft;
fft32_dsplib_graph(void)
{
for (int ii=0; ii < Nports; ii++) {
std::string fname_i = "data/sig" + std::to_string(ii) + "_i.txt";
std::string fname_o = "data/sig" + std::to_string(ii) + "_o.txt";
fft_i[ii] = input_plio::create("PLIO_i_"+std::to_string(ii),plio_64_bits,fname_i);
fft_o[ii] = output_plio::create("PLIO_o_"+std::to_string(ii),plio_64_bits,fname_o);
connect<stream>( fft_i[ii].out[0], dut_fft.in[ii] );
connect<stream>( dut_fft.out[ii], fft_o[ii].in[0] );
}
}
};
The AI Engine graph for fft32_dsplib is shown below. The AI Engine graphs corresponding to code for fft32_r2 and fft32_dsplib have similarities. In the latter, the additional data buffers required by the Stockham approach are made explicit at the graph level, whereas they were kept internal in the former. The REPEAT parameter was set to unity, indicating the buffer size was set to match the transform size.