The figure below shows the processing performed by Stage 2. Here three rounds of butterflies perform local reordering of 8-tuples of consecutive samples. As with the previous two stages, SIMD instructions perform a total of 8 parallel comparisons per cycle. Notice again how the third round of butterfly processing is identical to the last round from both Stage 0 and Stage 1.
The code block below implements the first round of Stage 2 using AIE API. It uses the same 16-lane vector register along with the aie::max()
and aie::min()
routines for sample comparison, and the fpshuffle16()
intrinsic to perform I/O sample extraction for the “top” and “bottom” samples of each butterfly. Notice how the code here is identical to that used for the first round of Stage 1 except for the I/O sample extraction permutations. This is due only to the nature of the “top” and “bottom” butterfly sample being located within different positions in the 16-lane vector register. The code for the second round of Stage 2 (not shown here) exhibits exactly the same structure with yet another distinct set of permutations. Profiling reveals both of these function require 27 cycles per invocation.
void __attribute__((noinline)) bitonic_fp16::stage2a( aie::vector<float,16>& vec )
{
static constexpr unsigned BFLY_STAGE2a_TOP_I = 0xBA983210;
static constexpr unsigned BFLY_STAGE2a_BOT_I = 0xCDEF4567;
static constexpr unsigned BFLY_STAGE2a_TOP_O = 0x89AB3210;
static constexpr unsigned BFLY_STAGE2a_BOT_O = 0xCDEF7654;
aie::vector<float,8> v_mx;
aie::vector<float,8> v_mn;
vec = fpshuffle16(vec,0,BFLY_STAGE2a_TOP_I,BFLY_STAGE2a_BOT_I);
v_mx = aie::max(vec.extract<8>(0),vec.extract<8>(1));
v_mn = aie::min(vec.extract<8>(0),vec.extract<8>(1));
vec = aie::concat(v_mn,v_mx);
vec = fpshuffle16(vec,0,BFLY_STAGE2a_TOP_O,BFLY_STAGE2a_BOT_O);
}