Single Tile AI Engine API Design - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

The first FFT design presented here for the Versal AI Engine is implemented using the AI Engine API. The API is a portable programming interface for AI Engine accelerators that is implemented as a C++ header-only library that provides data types and computational operations that get translated into efficient low-level intrinsic and VLIW assembly code for the AI Engine vector processor. The API provides high-level abstractions that improve code readability and simplify code maintenance.

This first design is a Radix-2 DIT 32-pt FFT using the Stockham approach. The design implements the complete transform computation using a single AI Engine tile. The code adopts the C++ Kernel Class programming style that works nicely with the API. The following code block shows the source code for fft32_r2_kernel.h that defines the C++ kernel class.

#pragma once
#include <adf.h>
#include <aie_api/aie.hpp>
#include "fft32_r2_twiddle.h"
using namespace adf;
class fft32_r2_kernel {
public:
  typedef cint16 TT_DATA;
  typedef cint16 TT_TWID;
  static constexpr unsigned N = 32;
  static constexpr unsigned SHIFT_TW = 15;
  static constexpr unsigned SHIFT_DT = 15;
  static constexpr bool     INVERSE  = false;
  static constexpr unsigned REPEAT   = 128;
  static constexpr unsigned WIN_SIZE = N * REPEAT;
private:
  // Temporary Buffers:
  alignas(aie::vector_decl_align) TT_DATA tbuff[N];
  // Twiddle factors:
  alignas(aie::vector_decl_align) static constexpr TT_TWID    tw1[ 1] = TWID1;
  alignas(aie::vector_decl_align) static constexpr TT_TWID    tw2[ 2] = TWID2;
  alignas(aie::vector_decl_align) static constexpr TT_TWID    tw4[ 4] = TWID4;
  alignas(aie::vector_decl_align) static constexpr TT_TWID    tw8[ 8] = TWID8;
  alignas(aie::vector_decl_align) static constexpr TT_TWID   tw16[16] = TWID16;
public:
  // Constructor:
  fft32_r2_kernel(void);
  // Run:
  void run(  input_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_i,
            output_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_o );
  static void registerKernelClass( void )
  {
    REGISTER_FUNCTION( fft32_r2_kernel::run );
  }
};

Some critical aspects of the previous code are highlighted in the following:

  • The data types for this design are defined as TT_DATA and TT_TWID. This FFT uses 16-bit fixed-point data types with “cint16” for the data path and “cint16” twiddle factors. The design declares a data buffer tbuff to provide the additional storage required for the Stockham approach.

  • The run() function defines the kernel interface. This kernel is designed to use buffer-based I/O. The buffer size is set using the WIN_SIZE parameter. Note the window size depends on the transform size N and the REPEAT parameter used to establish “batch-processing” to increase throughput, as outlined in detail as follows.

    • The INVERSE parameter controls whether the kernel computes an FFT (true) or an IFFT (false).

    • The twiddle factor tables are hard-coded to values supplied in the external included file fft32_r2_twiddle.h. Note how all data tables are aligned to 128-bit boundaries as required to support vector memory access in the AI Engine architecture. The AI Engine API provides aie::vector_decl_align as a future-proof means to supply the correct alignment (should this requirement change in future architectures).

The following code block shows the source code for fft32_r2_kernel.cpp that contains its implementation.

#include <adf.h>
#include <aie_api/aie.hpp>

#include "fft32_r2_kernel.h"

fft32_r2_kernel::fft32_r2_kernel( void )
{
  aie::set_rounding(aie::rounding_mode::positive_inf);
  aie::set_saturation(aie::saturation_mode::saturate);
}

void fft32_r2_kernel::run(  input_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_i,
                           output_buffer<TT_DATA,extents<WIN_SIZE> >& __restrict sig_o )
{
  // Set pointers to windows:
  TT_DATA* ibuff = sig_i.data();
  TT_DATA* obuff = sig_o.data();
  // Perform FFT:
  for (int rr=0; rr < REPEAT; rr++)
    chess_prepare_for_pipelining
    chess_loop_range(REPEAT,)
  {
    aie::fft_dit_r2_stage<16>(ibuff, tw1,  N, SHIFT_TW, SHIFT_DT, INVERSE, tbuff);
    aie::fft_dit_r2_stage< 8>(tbuff, tw2,  N, SHIFT_TW, SHIFT_DT, INVERSE, ibuff);
    aie::fft_dit_r2_stage< 4>(ibuff, tw4,  N, SHIFT_TW, SHIFT_DT, INVERSE, tbuff);
    aie::fft_dit_r2_stage< 2>(tbuff, tw8,  N, SHIFT_TW, SHIFT_DT, INVERSE, ibuff);
    aie::fft_dit_r2_stage< 1>(ibuff, tw16, N, SHIFT_TW, SHIFT_DT, INVERSE, obuff);
    ibuff += N;
    obuff += N;
  }
}

Some critical aspects of the previous code are highlighted in the following:

  • AI Engine rounding and saturation modes are set in the kernel constructor using the AI Engine API.

  • The kernel signature is provided by the run() function. As mentioned earlier, the kernel is buffer-based based on I/O. Pointers to the input and output buffers included here.

  • The kernel contains a single “for-loop”. Each iteration of this loop computes a single 32-pt transform. The REPEAT parameter controls how many transforms are computed per kernel invocation. Sufficient input data must be supplied for exactly these many transforms.

  • The fft_dit_r2_stage() function implements each Stockham DIT stage. The design requires five such stages to implement a 32-pt transform. This routine is described in further detail below.

  • The code demonstrates the “ping-pong” nature of the Stockham algorithm outlined earlier. Here, the input buffer (pointed to by ibuff) forms one side, and the tbuff forms the other side of this ping-pong buffer. Each buffer serves alternately as input or output as the stage computations progress.

The fft_dit_r2_stage() provided by the API performs Radix-2 stages for DIT Stockham FFTs. These are documented fully in [3]. The routine is templatized and parameterized for flexibility to support various data types and transform sizes. The AIE API supports Radix-2, Radix-3, Radix-4, and Radix-5 stages depending on data types. See [3] for the current supported routines.

The following figure shows the AI Engine graph for this single-tile FFT design. The PLIO connects are made using streams, but these are translated to kernel interface buffers. In this case, the REPEAT parameter was set to unity, indicating the buffer size was set to match the transform size.

figure