Adapting the Cooley-Tuckey algorithm to vector computers with local memory (such as the AI Engine) requires three key concepts: in-place computation, constant geometry, and self-sorting.
In-place computation: The memory space occupied by the data remains constant throughout all algorithm steps. There is no buffer memory overhead for data computation.
Constant geometry: The data indexes remain unchanged from stage to stage. This increases parallelization for SIMD machines by avoiding index book-keeping.
Self-sorting: The output data address is not bit-reversed, meaning the data appears in order. This can spare computation because you do not need a bit-reverse stride permutation matrix to order the data.
Research has not yet found an FFT algorithm that possesses all three properties.
The AI Engine APIs employ the Stockham FFT variant as a solution. This variant sacrifices in-place computation to gain self-sorting.
Fig. 2: Butterfly diagram of an 8-point, radix-2 Stockham DIT FFT
The preceding figure shows how re-indexing the stage’s computational nodes creates the Stockham variant of the FFT by adding a given increasing offset to the indexes. The following example figure shows this more explicitly.
Fig. 3: Butterfly diagram of an 8-point, radix-2 Stockham DIF FFT with explicit re-indexing
The Stockham variant sacrifices in-place computation to gain self-sorting and efficient vectorization.
These features make this algorithm suitable for performing the FFT on SIMD machines such as the AI Engine ML.