The kernel source code in the .cpp file is quite straightforward because it just implements the kernel class constructor and run method. After including the other header files and the ADF and AIE API libraries, the kernel constructor can be implemented. The only operations done in the constructor are setting the rounding and saturation modes for the kernels. In this case the choice is to use the whole datatype representable numeric range to saturate, and to round positively if the results are multiples of \(0.5\).
fft1k_kernel::fft1k_kernel(void)
{
aie::set_rounding(aie::rounding_mode::positive_inf);
aie::set_saturation(aie::saturation_mode::saturate);
}
In the run function the data is read from the input buffer and written to the output buffer in linear fashion. To do so, the data() method of the buffer is used to get the pointer to the memory space reserved for the data. After getting the pointer to the data from the buffer structure pointer, a loop iterating through the batched signals instances is performed by using the REPEAT parameter from the header kernel file. Inside the loop, the staged FFT API is called for each stage on the pointed data, and the buffer data pointer is updated to the next instance of the data batch. It is also worth noting that in the following code block are used two chess compiler directives: one to prepare for software pipelining, and one to mark the loop to be expected to iterate REPEAT times.
void fft1k_kernel::run( input_buffer<TT_DATA,extents<BUF_SIZE>> &
__restrict din,
output_buffer<TT_DATA,extents<BUF_SIZE>> &
__restrict dout )
{
TT_DATA* ibuff = din.data();
TT_DATA* obuff = dout.data();
for (int i=0; i < REPEAT; i++)
chess_prepare_for_pipelining
chess_loop_range(REPEAT,)
{
TT_DATA *__restrict in_data = ibuff;
TT_DATA *__restrict out_data = obuff;
aie::fft_dit_r4_stage<256>(in_data, tw0_1, tw0_0, tw0_2, N, SHIFT_TW, SHIFT_DT, INVERSE, out_data);
aie::fft_dit_r4_stage<64>(out_data, tw1_1, tw1_0, tw1_2, N, SHIFT_TW, SHIFT_DT, INVERSE, in_data);
aie::fft_dit_r4_stage<16>(in_data, tw2_1, tw2_0, tw2_2, N, SHIFT_TW, SHIFT_DT, INVERSE, out_data);
aie::fft_dit_r4_stage<4>(out_data, tw3_1, tw3_0, tw3_2, N, SHIFT_TW, SHIFT_DT, INVERSE, in_data);
aie::fft_dit_r4_stage<1>(in_data, tw4_1, tw4_0, tw4_2, N, SHIFT_TW, SHIFT_DT, INVERSE, out_data);
ibuff += N;
obuff += N;
}
}
For this implementation, the REPEAT parameter is set to 2. Thus two FFTs are batched together in one kernel. Therefore, in first instance, it must be replicated 64 times to perform all FFTs in parallel.