Stockham Fast Fourier Transform - Stockham Fast Fourier Transform - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

The Cooley-Tukey FFT factorizations have one significant drawback: the need for bit reversal operations to establish natural sample order at either the input or output. Traditional hardware solutions manage these bit-reversal permutations cheaply. However, they become expensive when implementing FFTs on coarse-grained CPU or GPU architectures due to incoherent memory access.

The Stockham FFT targets vector computers with a specific factorization that achieves maximal vector parallelism. Graphics processors (GPUs) have widely adopted the Stockham FFT as their method of choice [4]. The Versal AI Engine and its SIMD vector pipeline architecture benefit equally from this approach.

The Cooley-Tukey FFT implements all computations using the “in-place” approach identified earlier but requires costly bit-reversal operations. In contrast, the Stockham FFT expends additional storage to achieve the natural ordering of I/O samples.

The following figure shows a signal flow graph for a Stockham Radix-2 DIF 8-point transform. Its structure is like earlier Cooley/Tukey formulations. The number of stages and butterflies is identical. Still, the I/O addressing of each butterfly differs. The computation is “out-of-place” and requires an extra storage buffer. The addressing permutation changes at each stage but resolves itself such that both the input and output of the signal flow graph maintain a natural ordering.

figure

Unique variants of the Stockham algorithm for vector computing leverage additional tricks to improve performance, like refactoring the algorithm to replace “power-of-two” memory strides with unit strides in the access of twiddle factors and by precomputing twiddle factors in separate contiguous memory.

The following figure shows a signal flow graph for the Stockham Radix-2 DIT 8-point transform. Its structure resembles the earlier one but uses “out-of-place” butterflies. The permutation order is reversed (compared to the Stockham DIF version) to achieve a natural I/O ordering.

figure

In practice, the Stockham FFT often uses two buffers, X and Y, in a ping-pong fashion. Consider this in the context of the previous figures: data enters buffer X in the first stage, and the stage computes butterfly outputs and stores them in buffer Y. The second stage then computes outputs from inputs in Y and stores them back into X. Finally, the third stage computes outputs from inputs in X and stores them back into Y.

The following code block provides a MATLAB implementation of the Stockham DIT FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the \(S=\log_2(N)\) stages of butterflies. The code clearly shows its “out-of-place” character: input addresses are separated by one stride, whereas output addresses are separated by “Nhalf” strides, and the stride changes on each stage. The code also clearly shows the “ping/pong” nature of the Stockham approach. The twiddle factor is applied before the sum and difference computations. A following section revisits this when discussing Versal AI Engine hardware support for FFT.

function [fft_o] = fft_stockham_dit(fft_i)
   N = numel(fft_i);
   Nstage = log2(N);
   buff_i = fft_i;
   buff_t = zeros(size(buff_i));
   for stage = Nstage : -1 : 1
     Nhalf = N / 2^stage;
     twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
     stride = 2^(stage-1);
     for p = 0 : Nhalf-1
       for q = 0 : stride-1
         % Note: address is not "in-place"
         idx0_i = q + stride*(2*p+0);
         idx1_i = q + stride*(2*p+1);
         idx0_o = q + stride*(p+0    );
         idx1_o = q + stride*(p+Nhalf);
         a = buff_i(1+idx0_i);
         b = buff_i(1+idx1_i) * twid(1+p);
         buff_t(1+idx0_o) = a + b;
         buff_t(1+idx1_o) = a - b;
       end
     end
     buff_i = buff_t;
   end
   fft_o = buff_i;
end

The following code block provides a MATLAB implementation of the Stockham DIF FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the \(S=\log_2(N)\) stages of butterflies. Comments like those made for the DIT version earlier can be made here. The butterfly computation occurs “out-of-place” in this DIF variant. The twiddle factors are applied after the butterfly computation (as compared to before the butterfly in the DIT case).

function [fft_o] = fft_stockham_dif(fft_i)
   N = numel(fft_i);
   Nstage = log2(N);
   buff_i = fft_i;
   buff_t = zeros(size(buff_i));
   for stage = 1 : Nstage
     Nhalf = N / 2^stage;
     twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
     stride = 2^(stage-1);
     for p = 0 : Nhalf-1
       for q = 0 : stride-1
         % Note: address is not "in-place"
         idx0_i = q + stride*(p+0    );
         idx1_i = q + stride*(p+Nhalf);
         idx0_o = q + stride*(2*p+0);
         idx1_o = q + stride*(2*p+1);
         a = buff_i(1+idx0_i);
         b = buff_i(1+idx1_i);
         buff_t(1+idx0_o) =  a + b;
         buff_t(1+idx1_o) = (a - b) * twid(1+p);
       end
     end
     buff_i = buff_t;
   end
   fft_o = buff_i;
end

Copyright © 2020–2026 Advanced Micro Devices, Inc.

Terms and Conditions