Single-Tile DSPlib Design - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

The alternative to custom-coded FFT designs using AI Engine API is to simply instantiate the high-performance IP available in the Vitis DSP library. These IP cores provide very high performance that is scalable from a single AI Engine tile to literally dozens of tiles. The IP is realized as a set of C++ template classes. The template parameters can be set according to the desired system-level performance and resource utilization suited to the application.

This section considers a “drop-in” replacement for the Radix-2 FFT-32 designed earlier using the AIE API. The transform from Vitis DSP library differs in a few minor ways from the earlier design:

  • The Vitis DSP library relies heavily on Radix-4 transforms to optimize QoR. Consequently, to implement an FFT-32, the design will be built with two Radix-4 stages and a final Radix-2 stage. This leads to better throughput as Radix-4 stages vectorize more efficiently to the AI Engine architecture. This also leads to better latency because there are fewer stages overall to compute.

  • The FFT designs in the Vitis DSP library use either cfloat or cint32 internally in all cases, even for I/O data delivered as cint16. This is done so bit growth need not be managed explicitly from stage to stage as this impacts throughput negatively. This also facilitates making the IP scalable.

The following code block shows the source code for instantiating the single-tile FFT-32 design from Vitis DSP library (fft32_dsplib). The following highlights some key aspects of the design:

  • The file must include the proper header file fft_ifft_dit_1ch_graph.hpp to access the FFT graph, and must identify the namespace xf::dsp::aie::fft::dit_1ch.

  • The template parameters for the IP are configured using a number of static constexpr definitions. The transform size is configured to 32 points, I/O is defined for a 16-bit fixed-point, an FFT is selected, and TP_API=0 selects a buffer-based interface. Two key optimization parameters, TP_PARALLEL and TP_CASC_LEN, are left at their defaults, leading to a simple single-tile design.

  • The IP graph is instantiated as dut_fft.

  • The fft_i and fft_o define arrays of I/O ports from PLIO. Only a single port is required for this initial design, but you will see later how additional ports can be added to increase the throughput of the design.

#pragma once
#include <adf.h>
#include <vector>
#include <fft_ifft_dit_1ch_graph.hpp>
using namespace adf;
using namespace xf::dsp::aie::fft::dit_1ch;

template<int ORIGIN_X, int ORIGIN_Y>
class fft32_dsplib_graph : public graph {
public:
  typedef cint16                              TT_TYPE;
  typedef cint16                              TT_TWIDDLE;
  static constexpr int  REPEAT                = 1;
  static constexpr int  TP_POINT_SIZE         = 32;
  static constexpr int  TP_FFT_NIFFT          = 1;
  static constexpr int  TP_SHIFT              = 0;
  static constexpr int  TP_CASC_LEN           = 1;
  static constexpr int  TP_DYN_PT_SIZE        = 0;
  static constexpr int  TP_WINDOW_SIZE        = TP_POINT_SIZE * REPEAT;
  static constexpr int  TP_API                = 0;
  static constexpr int  TP_PARALLEL_POWER     = 0;
  static constexpr int  Nports = (TP_API == 1 ) ? (1 << (TP_PARALLEL_POWER+1)) : 1;

  std::array<input_plio,Nports>  fft_i;
  std::array<output_plio,Nports> fft_o;

  fft_ifft_dit_1ch_graph<TT_TYPE,TT_TWIDDLE,TP_POINT_SIZE,TP_FFT_NIFFT,TP_SHIFT,
                         TP_CASC_LEN,TP_DYN_PT_SIZE,TP_WINDOW_SIZE,TP_API,
                         TP_PARALLEL_POWER> dut_fft;
  fft32_dsplib_graph(void)
  {
    for (int ii=0; ii < Nports; ii++) {
      std::string fname_i = "data/sig" + std::to_string(ii) + "_i.txt";
      std::string fname_o = "data/sig" + std::to_string(ii) + "_o.txt";
      fft_i[ii] =  input_plio::create("PLIO_i_"+std::to_string(ii),plio_64_bits,fname_i);
      fft_o[ii] = output_plio::create("PLIO_o_"+std::to_string(ii),plio_64_bits,fname_o);
      connect<stream>( fft_i[ii].out[0], dut_fft.in[ii]  );
      connect<stream>( dut_fft.out[ii],  fft_o[ii].in[0] );
    }
  }
};

The AI Engine graph for fft32_dsplib is shown below. The AI Engine graphs corresponding to code for fft32_r2 and fft32_dsplib have similarities. In the latter, the additional data buffers required by the Stockham approach are made explicit at the graph level, whereas they were kept internal in the former. The REPEAT parameter was set to unity, indicating the buffer size was set to match the transform size.

figure