The figure below shows the processing performed by Stage 3. Here four rounds of butterflies perform local reordering of 16-tuples of consecutive samples. As with the previous three stages, SIMD instructions perform a total of 8 parallel comparisons per cycle. Notice how the last two rounds of butterfly processing is identical to the last rounds from Stage 2.
The code block below implements the first round of Stage 3 using AIE API. In this case the “bottom” set of butterfly inputs must be reversed in order to perform the required sample comparisons. We use the aie::reverse()
API for this purpose. After sample comparison a second reversal is used to restore sample placement prior to storage back to the 16-lane register. This round is simpler than previous cases as no I/O permutations are required during sample extraction. Profiling reveals this function requires 27 cycles per invocation.
void __attribute__((noinline)) bitonic_fp16::stage3a( aie::vector<float,16>& vec )
{
aie::vector<float,8> v_top = vec.extract<8>(0);
aie::vector<float,8> v_bot = aie::reverse(vec.extract<8>(1));;
aie::vector<float,8> v_mx = aie::max(v_top,v_bot);
aie::vector<float,8> v_mn = aie::min(v_top,v_bot);
vec = aie::concat(v_mn,aie::reverse(v_mx));
}