The figure below shows the processing performed by Stage 0. Here a single round of butterflies performs local reordering of pairs of consecutive samples. A total of 8 parallel comparisons must be performed per stage. With float
data types, this may be done with the fpmax()
and fpmin()
intrinsics or using the some of the AIE API calls as shown below.
The code block below implements Stage 0 using intrinsics. The full compliment of 16 input samples are stored in a 16-lane vector register. The fpmax()
and fpmin()
intrinsics provide the core sorting functionality, each performin 8 parallel comparisons in SIMD fashion in a single cycle. The fpshuffle16()
intrinsics perform input and output data shuffling so that all eight “top” samples of each butterfly are moved to a single 8-lane vector register, and similarly for the “bottom” samples of each butterfly. After the maximum and minimum samples are identified, they are stored back to the 16-lane vector with smallest values in the top positions and largest values in the bottom positions. Profiling with aiesimulator
shows this intrinsic code requires 27 cycles per invocation.
void __attribute__((noinline)) bitonic_fp16::stage0_intrinsic( aie::vector<float,16>& vec )
{
static constexpr unsigned BFLY_STAGE0_TOP_I = 0xECA86420;
static constexpr unsigned BFLY_STAGE0_BOT_I = 0xFDB97531;
static constexpr unsigned BFLY_STAGE0_TOP_O = 0xB3A29180;
static constexpr unsigned BFLY_STAGE0_BOT_O = 0xF7E6D5C4;
vec = fpshuffle16(vec,0,BFLY_STAGE0_TOP_I,BFLY_STAGE0_BOT_I);
aie::vector<float,8> v_top = vec.extract<8>(0);
aie::vector<float,8> v_bot = vec.extract<8>(1);
aie::vector<float,8> v_mx = fpmax(v_top,v_bot);
aie::vector<float,8> v_mn = fpmin(v_top,v_bot);
vec = aie::concat(v_mn,v_mx);
vec = fpshuffle16(vec,0,BFLY_STAGE0_TOP_O,BFLY_STAGE0_BOT_O);
}
The code below implements Stage 0 using AIE API. The full compliment of 16 input samples are stored in a 16-lane vector register. Here, the aie::filter_even()
API pulls out the top butterfly samples by selecting the even numbered lanes. The aie::filter_odd()
pulls out the bottom butterfly samples by selecting the odd numbered lanes. The aie::max()
and aie::min()
API’s identify the largest and smallest samples for each butterfly. Finally, the aie::interleave_zip()
API collects the two 8-lane inputs into a 16-lane output vector, assigning even lanes from the first vector and odd lanes from the second vector. This code is functionally equivalent to the intrinsic version above. Profiling reveals it requires 28 cycles per invocation.
void __attribute__((noinline)) bitonic_fp16::stage0_api( aie::vector<float,16>& vec )
{
aie::vector<float,8> v_top = aie::filter_even(vec);
aie::vector<float,8> v_bot = aie::filter_odd(vec);
aie::vector<float,8> v_mx = aie::max(v_top,v_bot);
aie::vector<float,8> v_mn = aie::min(v_top,v_bot);
std::tie(v_mn,v_mx) = aie::interleave_zip(v_mn,v_mx,1);
vec = aie::concat(v_mn,v_mx);
}