One significant drawback of the Cooley-Tukey FFT factorizations is the need for the bit reversal operations required to establish natural sample order at either the input or output. These bit-reversal permutations can be managed cheaply in traditional hardware (FPGA fabric) solutions but are expensive when implementing FFTs on coarse-grained CPU or GPU architectures due to incoherent memory access. Stockham FFT was developed to target vector computers whose specific factorization attempts to achieve maximal vector parallelism. For this reason, the Stockham FFT has undergone broad adoption as the method of choice for graphics processors (GPUs) [4]. This situation applies equally well to the Versal AI Engine and its SIMD vector pipeline architecture.
Unlike the Cooley-Tukey FFT, which implements all computations using the “in-place” approach identified earlier but requires costly bit-reversal operations, the Stockham FFT expends additional storage to achieve the natural ordering of I/O samples.
The following figure shows a signal flow graph for a Stockham Radix-2 DIF 8-point transform. Its structure is like earlier Cooley/Tukey formulations. The number of stages and butterflies is identical. Still, the I/O addressing of each butterfly differs – the computation is “out-of-place” and requires an extra storage buffer. The addressing permutation changes at each stage but resolves itself such that a natural ordering is maintained at both the input and output of the signal flow graph.
Unique variants of the Stockham algorithm for vector computing leverage additional tricks to improve performance, like refactoring the algorithm to replace “power-of-two” memory strides with unit strides in the access of twiddle factors and by precomputing twiddle factors in separate contiguous memory.
The following figure shows a signal flow graph for the Stockham Radix-2 DIT 8-point transform. Once again, its structure is like the earlier one but uses “out-of-place” butterflies with a reversed (as compared to the Stockham DIF version) permutation order to achieve a natural I/O ordering.
In practice, the Stockham FFT is often used using two buffers, X and Y, in a ping-pong fashion. This can be imagined in the context of the previous figures, wherein data is input into buffer X in the first stage, and the butterfly outputs are computed and stored in buffer Y. The output of the second stage is then computed from inputs in Y and stored back into X. Finally, the third stage outputs are computed from inputs in X and stored back into Y.
The following code block provides a MATLAB implementation of the Stockham DIT FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the $S=\log_2(N)$ stages of butterflies. The code clearly shows its “out-of-place” character – input addresses are separated by one stride, whereas output addresses are separated by “Nhalf” strides, and the stride changes on each stage. The code also clearly shows the “ping/pong” nature of the Stockham approach. The twiddle factor is applied before the sum and difference computations. This fact will be seen again below regarding Versal AI Engine hardware support for FFT.
function [fft_o] = fft_stockham_dit(fft_i)
N = numel(fft_i);
Nstage = log2(N);
buff_i = fft_i;
buff_t = zeros(size(buff_i));
for stage = Nstage : -1 : 1
Nhalf = N / 2^stage;
twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
stride = 2^(stage-1);
for p = 0 : Nhalf-1
for q = 0 : stride-1
% Note: address is not "in-place"
idx0_i = q + stride*(2*p+0);
idx1_i = q + stride*(2*p+1);
idx0_o = q + stride*(p+0 );
idx1_o = q + stride*(p+Nhalf);
a = buff_i(1+idx0_i);
b = buff_i(1+idx1_i) * twid(1+p);
buff_t(1+idx0_o) = a + b;
buff_t(1+idx1_o) = a - b;
end
end
buff_i = buff_t;
end
fft_o = buff_i;
end
The following code block provides a MATLAB implementation of the Stockham DIF FFT algorithm. Like the Cooley/Tukey formulation in FFT Code, the algorithm consists of a triple nested “for-loop” with the outer loop running over the $S=\log_2(N)$ stages of butterflies. Comments like those made for the DIT version earlier can be made here. The butterfly computation occurs “out-of-place” in this DIF variant. The twiddle factors are applied after the butterfly computation (as compared to before the butterfly in the DIT case).
function [fft_o] = fft_stockham_dif(fft_i)
N = numel(fft_i);
Nstage = log2(N);
buff_i = fft_i;
buff_t = zeros(size(buff_i));
for stage = 1 : Nstage
Nhalf = N / 2^stage;
twid = exp(-1i*2*pi*[0:Nhalf-1]/(2*Nhalf));
stride = 2^(stage-1);
for p = 0 : Nhalf-1
for q = 0 : stride-1
% Note: address is not "in-place"
idx0_i = q + stride*(p+0 );
idx1_i = q + stride*(p+Nhalf);
idx0_o = q + stride*(2*p+0);
idx1_o = q + stride*(2*p+1);
a = buff_i(1+idx0_i);
b = buff_i(1+idx1_i);
buff_t(1+idx0_o) = a + b;
buff_t(1+idx1_o) = (a - b) * twid(1+p);
end
end
buff_i = buff_t;
end
fft_o = buff_i;
end
Copyright © 2020–2025 Advanced Micro Devices, Inc.