Version: Vitis 2024.1
Table of Contents
Introduction
The Vitis DSP Library currently implements the Fast Fourier Transform (FFT) using a Stockham algorithm. This algorithm computes the transform using a pair of buffers rather than the more popular in-place Cooley-Tukey approach, and maps more efficiently to the vector/SIMD architecture of the AI Engine. This approach yields excellent results, particularly for $N < 4096$ where the processing and storage requirements may be serviced by a single tile. In higher performance cases where sampling rates exceed 1 Gsps (ie. the SSR > 1 regime), this Stockham approach continues to be feasible but leads to an excessive use of tile resources. A more efficient approach is needed for these high throughput use cases.
A 1D FFT may be implemented on the AI Engine array using a 2D FFT algorithm with higher efficiency overall. This alternative “divide & conquer” approach provides a better solution on the AI Engine array since it is less reliant on “butterfly routing” and we can break large $N$ point sizes into much smaller factors of size $\sqrt N$. This results in a significant reduction in AI Engine tile memory and overall usage of fewer compute tiles.
The 2D architecture is not new to AI Engine. XAPP1356 first explored its use several years ago to achieve 3.7 Gsps with 10 AI Engine tiles. In this tutorial, we explore using the 2D architecture to implement large point transforms in the SSR > 1 regime: specifically a 64K-pt IFFT @ 2 Gsps. In this scenario, the design requires both AI Engine tile resources for compute and programmable logic (PL) resources for sample reordering as shown in detail below.
Matlab Model
A Matlab model of the 64K-pt IFFT implemented as a $256 \times 256$ 2D architecture is shown below. The model & algorithm consist of the following steps:
Load the input data in column-major order into a $256 \times 256$ matrix.
Take 256-pt transforms along each row
Perform a point-wise multiplication of the data matrix with a “twiddle matrix” (see code below for its detailed definition)
Take 256-pt transforms along each column
Extract the output data in row-major order
function [ifft_o] = ifft64k_256x256_model( ifft_i )
N = numel(ifft_i);
Nhalf = sqrt(N);
assert( N == 65536 );
% Need to use DIT when computing transforms up front:
twid = complex(zeros(Nhalf,Nhalf),zeros(Nhalf,Nhalf));
for rr = 0 : Nhalf-1
for cc = 0 : Nhalf-1
twid(1+rr,1+cc) = exp(+1i*2*pi*rr*cc/N);
end
end
% First compute transforms along rows:
data0 = reshape(ifft_i,Nhalf,Nhalf);
data1 = Nhalf*ifft(data0,Nhalf,2);
% Apply pointwise twiddle matrix:
% --> Note: first row is multiplication by unity
data2 = data1 .* twid;
% Second compute 16-K point transforms along columns:
data3 = Nhalf*ifft(data2,Nhalf,1);
ifft_o = reshape(data3.',1,[]);
end
Design Overview
This section provides an overview of the 64K-pt IFFT @ 2 Gsps using both AI Engine and PL resources.
Design Approach
We build this design using a combination of AI Engine tiles and PL resources as follows. Based on the five algorithm steps identified earlier, we partition the row and column 256-pt transforms to the AI Engine array. The goal here is to minimize compute resources by using the fewest tiles possible and still meet the 2 Gsps throughput requirement. The AI Engine array also performs point-wise twiddle multiplications in the third step. For a streaming design such as this, the act of performing “row” and “column” transforms translates into a requirement for a “memory transpose” operation. Between these “row” and “column” transforms, we must stream the samples into a storage buffer in row-major order and then extract the samples in column-major order. This is done over a number of parallel streams; the number of streams is chosen to meet the overall throughput requirement. Based on these concepts, the design consists of a “front-end” AI Engine subgraph performing “row” transforms, a “back-end” AI Engine subgraph performing “column” transforms, and a “memory transpose” operation located in PL in between.
IFFT-256 Prototyping
Based on this system architecture concept, it only remains to identify how many AI Engine resources are required to implement the (i) “row” transforms, (ii) “point-wise twiddle” multiplications, and (iii) “column” transforms. We do some early prototyping of the AI Engine subgraphs to identify the number of instances required. We identify two separate subgraphs to consider:
One “front-end” subgraph performing a “row” IFFT-256 followed by a “pointwise-twiddle” multiplication of the samples on that row, followed by zero-insertion.
One “back-end” subgraph performing a “column” IFFT-256 followed by zero-insertion.
The zero-insertion allows the design of the “memory transpose” in PL to be simplified as is outlined in detail below.
The throughput of prototypes of the these two subgraphs identifies the number of instances of each required to achieve our overall throughput target of 2 Gsps. The figure below shows traces in Vitis Analyzer for the front-end subgraph. This design is hand-coded in AIE API combining all three functions together into a single tile design with small memory footprint. It’s throughput is 592 us or ~430 Msps. Based on a target throughput of 2 Gsps, we will need to include 5 instances of this subgraph in the overall design.
The figure below shows traces in Vitis Analyzer for the back-end subgraph. This design is also hand-coded in AIE API and combines the IFFT-256 with zero-padding into a second single tile design with small memory footprint. It’s throughput is 422.4 ns or ~600 Msps. Based on a target throughput of 2 Gsps, we will need to include 4 instances of this subgraph in the overall design.
Based on these prototyping efforts, we identify the final design architecture shown in the diagram below. The design uses five instances of each front-end and back-end subgraph. We use 5 instances of the back-end subgraph even though only four are required as this simplifies the overall design architecture. These instances become time shared over all transform operations required by the 2D algorithm. However, we require only 256 transforms in each “row” and “column” dimension, yet this number is not divisible by 5. Consequently, we “zero-pad” the 2D data cube by appending four rows at the bottom and four columns at the right to create a 260 x 260 data cube. It then follows that we can perform 52 transforms per AI Engine tile instance in each case for the “front-end” and “back-end” subgraph. This also means the design supports five I/O streams into and out of each subgraph. This also applies to the “memory transpose” operation in the PL. An important side effect of this zero-padding is its simplification in the construction on that PL design. It may then be implemented using a 5-bank memory architecture as outlined in detail below.
Front-End IFFT-256 AI Engine Kernel
The front-end IFFT-256 kernel performs three functions: a 256-pt IFFT, a point-wise twiddle multiplication, and a zero-padding with 4 samples. The IFFT is built using a Stockham algorithm hand-coded using the AIE API. The point-wise twiddle multiplication is computed using the vectorized aie::mul()
API. The twiddle factors themselves are generated using the non-linear hardware accelerator connected to the scalar unit and vectorized by multiplying each scalar twiddle by a 8-sample vector containing consecutive twiddle phase steps. Finally, the zero-padding is performed by simply writing a vectorized null vector to the output buffer.
The figure below shows the AI Engine graph for the front-end IFFT-256 kernel. The design uses I/O buffers for simplicity and a scratchpad to act as a temporary buffer for the transform computation. Twiddle factors are stored in arrays as part of the system memory. A single tile is required. The full design uses five instances of this graph overall.
Memory Transpose PL Kernel
The memory transpose PL kernel writes input samples from the front-end transforms in row-major order and then delivers samples to the back-end transforms reading in column-major order. This must be sustained via streaming over five I/O streams at full speed to achieve the 2.0 Gsps throughput target overall. The key challenge for this kernel is we need to partition the $256 \times 256$ data cube by a factor of 5 and read/write 10 samples per PL clock cycle (where we assume the PL is clocked at 312.5 MHz or 4 times slower than the AI Engine array).
We illustrate the design concept used by the memory transpose PL kernel using a smaller $16 \times 16$ example shown in the figure below. Note how we have zero-padded the 2D array with four extra rows at the bottom and four extra columns on the right. This new $20 \times 20$ data cube divisible by the 5 I/O streams we wish to use concurrently. The entire data cube is partitioned into 5 separate banks each containing 80 samples. Each bank is identified with a unique color in the figure below. Note how we can write the memory from left to write into consecutive rows at the same time with no bank contention – each write occurs into a different color. Similarly, we can read the memory from top to bottom from consecutive columns at the same time with no bank contention. This scheme allows us to establish sustained bandwidth @ 2 Gsps to feed the five instances each in the front and back end AI Engine subgraphs. This same concept is used by the Memory Transpose PL kernel but for its larger $256 \times 256$ size. Also note that we must read/write 10 samples per PL cycle so each memory must be dual-ported.
We implement the Memory Transpose PL kernel using HLS @ 312.5 MHz. The resource utilization and timing from out-of-context synthesis & place-and-route is given in the figure below.
Back-End IFFT-256 AI Engine Kernel
The design of the back-end IFFT-256 is a simplified version of the front-end design. It merely implements the transform and the zero-padding – there is no point-wise twiddle multiplications. The graph looks nearly identical to the front-end design shown earlier. The figure below shows the physical array view of the back-end design. It also uses I/O buffers and requires a memory footprint that spills over to the neighboring tile.
Design Resources
The figure below summarizes the AI Engine and PL resources required to implement the design in the VC1902 device on the VCK190 eval board. The design is using 10 AI Engine tiles for compute and 17 tiles for buffering. The PL design includes the resources required to implement the DMA Source, Memory Transpose, and DMA Sink kernels. This design illustrates how Versal AI Engine and PL may be crafted to create a high performance tightly coupled custom data path tailored directly to the algorithm of interest.