The Cooley-Tukey FFT factorizations have one significant drawback: the need for bit reversal operations to establish natural sample order at either the input or output. Traditional hardware solutions manage these bit-reversal permutations cheaply. However, they become expensive when implementing FFTs on coarse-grained CPU or GPU architectures due to incoherent memory access.
The Stockham FFT targets vector computers with a specific factorization that achieves maximal vector parallelism. Graphics processors (GPUs) have widely adopted the Stockham FFT as their method of choice [4]. The Versal AI Engine and its SIMD vector pipeline architecture benefit equally from this approach.
The Cooley-Tukey FFT implements all computations using the “in-place” approach identified earlier but requires costly bit-reversal operations. In contrast, the Stockham FFT expends additional storage to achieve the natural ordering of I/O samples.
The following figure shows a signal flow graph for a Stockham Radix-2 DIF 8-point transform. Its structure is like earlier Cooley/Tukey formulations. The number of stages and butterflies is identical. Still, the I/O addressing of each butterfly differs. The computation is “out-of-place” and requires an extra storage buffer. The addressing permutation changes at each stage but resolves itself such that both the input and output of the signal flow graph maintain a natural ordering.
Unique variants of the Stockham algorithm for vector computing leverage additional tricks to improve performance, like refactoring the algorithm to replace “power-of-two” memory strides with unit strides in the access of twiddle factors and by precomputing twiddle factors in separate contiguous memory.
The following figure shows a signal flow graph for the Stockham Radix-2 DIT 8-point transform. Its structure resembles the earlier one but uses “out-of-place” butterflies. The permutation order is reversed (compared to the Stockham DIF version) to achieve a natural I/O ordering.
In practice, the Stockham FFT often uses two buffers, X and Y, in a ping-pong fashion. Consider this in the context of the previous figures: data enters buffer X in the first stage, and the stage computes butterfly outputs and stores them in buffer Y. The second stage then computes outputs from inputs in Y and stores them back into X. Finally, the third stage computes outputs from inputs in X and stores them back into Y.
The following code block provides a MATLAB implementation of the Stockham DIT FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the \(S=\log_2(N)\) stages of butterflies. The code clearly shows its “out-of-place” character: input addresses are separated by one stride, whereas output addresses are separated by “Nhalf” strides, and the stride changes on each stage. The code also clearly shows the “ping/pong” nature of the Stockham approach. The twiddle factor is applied before the sum and difference computations. A following section revisits this when discussing Versal AI Engine hardware support for FFT.
function [fft_o] = fft_stockham_dit(fft_i)
N = numel(fft_i);
Nstage = log2(N);
buff_i = fft_i;
buff_t = zeros(size(buff_i));
for stage = Nstage : -1 : 1
Nhalf = N / 2^stage;
twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
stride = 2^(stage-1);
for p = 0 : Nhalf-1
for q = 0 : stride-1
% Note: address is not "in-place"
idx0_i = q + stride*(2*p+0);
idx1_i = q + stride*(2*p+1);
idx0_o = q + stride*(p+0 );
idx1_o = q + stride*(p+Nhalf);
a = buff_i(1+idx0_i);
b = buff_i(1+idx1_i) * twid(1+p);
buff_t(1+idx0_o) = a + b;
buff_t(1+idx1_o) = a - b;
end
end
buff_i = buff_t;
end
fft_o = buff_i;
end
The following code block provides a MATLAB implementation of the Stockham DIF FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the \(S=\log_2(N)\) stages of butterflies. Comments like those made for the DIT version earlier can be made here. The butterfly computation occurs “out-of-place” in this DIF variant. The twiddle factors are applied after the butterfly computation (as compared to before the butterfly in the DIT case).
function [fft_o] = fft_stockham_dif(fft_i)
N = numel(fft_i);
Nstage = log2(N);
buff_i = fft_i;
buff_t = zeros(size(buff_i));
for stage = 1 : Nstage
Nhalf = N / 2^stage;
twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
stride = 2^(stage-1);
for p = 0 : Nhalf-1
for q = 0 : stride-1
% Note: address is not "in-place"
idx0_i = q + stride*(p+0 );
idx1_i = q + stride*(p+Nhalf);
idx0_o = q + stride*(2*p+0);
idx1_o = q + stride*(2*p+1);
a = buff_i(1+idx0_i);
b = buff_i(1+idx1_i);
buff_t(1+idx0_o) = a + b;
buff_t(1+idx1_o) = (a - b) * twid(1+p);
end
end
buff_i = buff_t;
end
fft_o = buff_i;
end
Copyright © 2020–2026 Advanced Micro Devices, Inc.