Prime Factor FFT-1008 on AIE-ML - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2024-10-30
Version
2024.1 English



AI Engine Development

See Vitis™ Development Environment on xilinx.com See Vitis™ AI Development Environment on xilinx.com

Prime Factor FFT-1008 on AIE-ML

Version: Vitis 2024.2

Table of Contents

  1. Introduction

  2. Matlab Models

  3. Design Overview

  4. Design Resources

  5. Build and Run Design

References

Support

License

Introduction

The Prime Factor Algorithm (PFA) [1] is a Fast Fourier Transform (FFT) algorithm [2] discovered by Good & Thomas before the more popular Cooley-Tukey algorithm with some interesting properties. The PFA is another “divide and conquer” approach for computing a Discrete Fourier Transform (DFT) of size $N = N_1 \cdot N_2$ as a two-dimensional DFT of size $N_1 \times N_2$ as long as $N_1$ and $N_2$ are relatively prime (ie. share no common divisors). The smaller transforms of size $N_1$ and $N_2$ may be computed by some other technique, for example using the Winograd FFT Algorithm, or the PFA technique may be applied recursively again to both $N_1$ and $N_2$. It turns out Versal AI Engines compute DFT with small dimensions $N < 32$ very efficiently using direct vector/matrix multiplication. Consequently, the PFA approach using DFT on the individual prime factors provides an efficient approach to the FFT on Versal AI Engines.

A second advantage of the PFA approach is that unlike the popular Cooley-Tukey FFT, no extra multiplications by “twiddle factors” need be performed between stages. This fact falls out of the DFT factorization when $N_1$ and $N_2$ share no common factors. This provides a computational advantage compared to the more traditional Cooley-Tukey formulation, but the PFA incurs a drawback in that a complicated re-indexing or permutation of it’s I/O samples is required. For Versal devices with both AI Engines and Programmable Logic (PL), however, this drawback is solved easily by leveraging the PL to implement these permutations as part of a custom data flow tailored to the PFA signal flow graph.

An earlier tutorial implemented a PFA-1008 transform on AIE architecture in the VC1902 device. This tutorial maps the PFA-1008 transform to AIE-ML architecture in the VE2802 device. Once again we map the short-length DFT-7, DFT-9 and DFT-16 transforms to AI Engine using vector-matrix DFT’s but this time to the AIE-ML architecture. The intermediate “memory transpose” operations mapped earlier to the programmable logic (PL) are instead mapped here to the Memory Tiles contained in the AIE-ML array. This simplifies data flow and keeps most of the graph inside the the array. The input and output permutation blocks remain implemented in the PL as RTL obtained using VItis High Level Synthesis (HLS) from untimed C++ models. These cannot be mapped to Mem Tiles as they require a type of modulo addressing not supported by the Memory Tile buffer descriptors (BDs).

Matlab Models

This tutorial relies on the same Matlab models from the original tutorial. These models have been replicated here in the matlab folder of the repo. These apply to both the signal processing functions as well as the I/O permutations and matrix transpose addressing operations. All remain identical.

Design Overview

The figure below shows a block diagram of a 3D PFA-1008 hardware design implemented in Versal using AI Engines and PL. The design targets a 1 Gsps throughput (SSR=1). AI Engines implement the three DFT kernels, specifically DFT-7, DFT-9 and DFT-16, using a vector-matrix multiplication approach. The design implements the matrix transpose kernels in the AI Engine array using Memory Tiles, and maps the I/O permutation blocks to PL using Vitis HLS.

figure1

Some details on each kernel design is given in the sections below.

INPUT PERMUTE Kernel

This PL kernel is implemented in HLS @ 312.5 MHz (SSR=4). Samples arriving on one 128-bit stream are written are written into a ping/pong buffer in 4X duplicate fashion. This is required since we must read or write 4 samples per cycle. The input permutation $P_i$ is stored in a LUT (again with 4X duplication) so the samples may be read back in the required permuted order. The latency of the design is 1008/4 cycles due to the ping/pong nature of the design. A single output streams delivers consecutive 7-pt transforms to the AI Engine array.

The figure below shows the input permutation required by the PFA-1008 design. The permutation ordering may be considered as a 3D mapping with $R=7$ rows, $C=9$ columns, and a depth of $D=16$. The $R$ dimension is given horizontally, the $C$ dimension vertically, and the $D$ dimension is identified by the “Tile-N” labels in the figure. The required permutation may be computed as $P=mod(C \times D \times R + R \times D \times C + R \times C \times D,1008)$. Note that this pattern cannot be generated automatically using the DMA buffer descriptors of the AI Engine Memory Tiles since it contains the “modulo 1008” operation which is not supported by the hardware. For this reason, we map the I/O permutation kernels to the PL.

figure2

FFT-7 Kernel

This AI Engine kernel implements the DFT-7 using the 1x4x8 aie::mmul() API which takes 2 to 3 cycles per operation. The complete DFT-7 compute may be partitioned over two compute tiles, one to compute the green portion and a second to compute the blue portion indentified in the figure below. The API computes a [1x4] x [4x8] matrix multiply, and we must must pad the DFT-7 with an extra row and column of zeros. The entire transform must be computed over 7 cycles (SSR=1). The actual API code computes 8 transforms in less than 56 cycles.

The algorithm is vectorized using a pair of 32-lane registers in each AIE-ML core. A set of 8 vector reads fully populates these registers with data from eight consecutive transforms. Once populated, a set of aie::shuffle_up() API operations are used to position the data in the proper lanes for computes performed by the aie::mmul() API routine.

The outputs from eight consecutive transforms are then packed together into seven 8-lane vectors using a third tile to perform this output combining. Here we use aie::shuffle_up() and aie::shuffle_down() API’s to perform this packing.

figure3

TRANSPOSE-0 Kernel

This AIE kernel implements the matrix transpose operation required to feed the proper 9-point input samples to the DFT-9 on the second dimension of the 3D cube. For AIE-ML technology, this matrix transpose may be implemented completely within the array using the Memory tile eliminating the need to exit the array to perform the operation in the PL.

Buffer descriptions control the sample ordering employed by the Memory tile on both input and output. A “write BD” controls the sample ordering on input to the Memory tile. The write BD is configured using an ADF graph programming model shown below. The 3D pattern required here has dimensions ${7,9,16}$. This is configured with a buffer_dimension of ${8,16,16}$ since we must ensure alignment to the 32-bit boundaries of the Memory tile. The write address pattern is linear in dimensions 7, 9, and 16, and so the tile_traversal is configure in this order with address wrapping occuring at the dimensions $0,1,2$.

tiling_parameters write_bd = {
      .buffer_dimension = {8,16,16},
      .tiling_dimension = {1,1,1},
      .offset = {0,0,0},
      .tile_traversal = {{.dimension=0, .stride=1, .wrap=7},
                         {.dimension=1, .stride=1, .wrap=9},
                         {.dimension=2, .stride=1, .wrap=16}} };

The read BD is configured using a similar ADF graph programming model shown below. In this case, the 3D patter required here has dimensions ${9,16,7}$ since we send data first along the 2nd $N_2=9$ dimension, then electing to process $N_3=16$ second and $N_1=7$ last. This is configured with a tile_traversal along dimensions $1,2,0$ with wrapping applied as before.

Finally, a repetition_count of 8 is specified because both the write BD and the read BD must be repeated four times each to match the number of transforms computed per kernel invocation by each DFT-7, DFT-9 and DFT-16 AIE compute kernel. The num_buffers is set to 2 because a ping/pong buffer arrangement is required here to support a full streaming data flow model.