Inspecting design files in the farrow_optimize1/aie
folder, you can observe that the amount of vector registers you need every cycle (v_buff,f_coeff,del,y3-y0,z2,z1) exceeds the total supported by the chip as specified in the Versal Adaptive SoC AI Engine Architecture Manual (AM009). This leads to “vector register spillage” where the processor must use additional cycles to save intermediate compute results from vector registers to the stack memory (and vice-versa) to manage the vector register hardware resource. Refactoring the code to use fewer register resources can eliminate this additional overhead.
Also, given the AI Engine Fixed-point Vector Unit Multiplication and Upshift Paths also specified in the Versal Adaptive SoC AI Engine Architecture Manual (AM009) shown below, the multiplication of vector and accumulator registers is not supported.
Therefore, intermediate output z2 shown in Figure 2 needs to pass through Shift-round Saturate (SRS) Path so it is converted from accumulator register into vector register before it gets used in the next aie::mac()
instruction (same applies to intermediate output z1).
This restriction presents a challenge to the compiler limiting pipelined scheduling opportunities.
Due to reasons above, breaking the single for
loop into multiple smaller ones is expected to improve performance.
To accomplish this, intermediate compute results need to be stored in scratch pad tile memory before they are read as input to each subsequent for
loop. Reserving memory for these intermediate outputs is shown in farrow_kernel.h
, example alignas(32) TT_SIG y3[BUFFER_SIZE];
.
Accessing that memory location is done through vector iterator defined infarrow_kernel.cpp
; for example, auto p_y3 = aie::begin_restrict_vector<8>(y3);
.
The use of _restrict
is intended to allow more aggressive compiler optimization, by explicitly stating that no memory dependency will be caused by pointer aliasing. For more information, see AI Engine Kernel and Graph Programming Guide (UG1079) - Restrict Keyword.
Finally, replace:
acc_x = aie::mul(*p_y3++,del);
*p_z2++ = aie::add(acc_x.to_vector<TT_SIG>(DNSHIFT),*p_y2++);
with:
acc_x = aie::mac(aie::from_vector<TT_ACC>(*p_y2++,DNSHIFT), *p_y3++,del);
*p_z2++ = acc_x.to_vector<TT_SIG>(DNSHIFT);
While these are functionally equivalent, the second code snippet allows for better pipelining and scheduling opportunities.
Once those changes are implemented into the design files in the farrow_optimization2/aie
folder, you can repeat the previously mentioned steps to characterize the design.
After running make all
, the console should display:
*** LOOP_II *** Tile: 25_0 minII: 16 beforeII: 29 afterII: 16 Line: 62 File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0 minII: 3 beforeII: 16 afterII: 3 Line: 94 File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0 minII: 3 beforeII: 16 afterII: 3 Line: 110 File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0 minII: 3 beforeII: 16 afterII: 3 Line: 126 File: farrow_kernel.cpp
Raw Throughput = 768.1 MSPS
Max error LSB = 1
Because you have four for
loops, make get_II
generates four II numbers, one for each loop. These loops run consecutively, so the total II is the sum of all, which is 25 > 16. To meet your budget of 16 cycles, you will need to split your loops into two tiles, with the first tile containing the first loop and the second tile contains the three remaining loops.
Launch vitis_analyzer vitis_analyzer aiesimulator_output/default.aierun_summary
. The current implementation generates array view shown below. Notice the increased size of the sysmem to accommodate scratch pad memory reserved for intermediate kernel results.
Figure 9 - Farrow Filter Optimize2 Implementation Array View