Second Farrow Optimization - Second Farrow Optimization - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

Inspect the design files in farrow_optimize1/aie tp evaluate vector register usage per cycle. The required registers (v_buff,f_coeff, del ,y3-y0 ,z2 , z1) exceed the total supported by the chip as specified in the Versal Adaptive SoC AI Engine Architecture Manual (AM009). This leads to vector register spillage where the processor must use additional cycles to save intermediate compute results from vector registers to the stack memory (and vice-versa) to manage the vector register hardware resource. Refactoring the code to use fewer register resources can remove this additional overhead.

figure7

Given the AI Engine Fixed-point vector unit multiplication and upshift paths specified in the Versal Adaptive SoC AI Engine Architecture Manual (AM009), the multiplication of vector and accumulator registers is not supported.

Pass intermediate output z2 shown in Figure 2 through the shift-round saturate (SRS) path. Convert it from an accumulator register into a vector register. This happens before it gets used in the next aie::mac() instruction (the same applies to intermediate output z1).

This restriction presents a challenge to the compiler limiting pipeline scheduling opportunities.

figure8

Due to reasons previously mentioned, you can expect improved performance by breaking the single for loop into multiple smaller ones.

To accomplish this, store intermediate compute results in scratch pad tile memory before reading them as input for each subsequent for loop. Reserve memory for intermediate outputs as shown in farrow_kernel.h, for example alignas(32) TT_SIG y3[BUFFER_SIZE];. Access the reserved memory using a vector iterator defined infarrow_kernel.cpp, for example auto p_y3 = aie::begin_restrict_vector<8>(y3);.

Use _restrict to enable more aggressive compiler optimizations by stating no memory dependency occurs from pointer aliasing. For more information, see AI Engine Kernel and Graph Programming Guide (UG1079) - Restrict Keyword.

Finally, replace the following lines:

    acc_x = aie::mul(*p_y3++,del);
    *p_z2++ = aie::add(acc_x.to_vector<TT_SIG>(DNSHIFT),*p_y2++);

With the optimized version:

    acc_x = aie::mac(aie::from_vector<TT_ACC>(*p_y2++,DNSHIFT), *p_y3++,del);
    *p_z2++ = acc_x.to_vector<TT_SIG>(DNSHIFT);

While these are functionally equivalent, the second code snippet remains functionally equivalent but enables better pipelining and scheduling opportunities.

Implement these changes in the design files in farrow_optimization2/aie. Repeat the previous characterization steps to measure the updated design’s performance. After running make all, the console displays:

*** LOOP_II *** Tile: 25_0	minII: 16	beforeII: 29	afterII: 16	Line: 62	File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0	minII: 3	beforeII: 16	afterII: 3	Line: 94	File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0	minII: 3	beforeII: 16	afterII: 3	Line: 110	File: farrow_kernel.cpp
*** LOOP_II *** Tile: 25_0	minII: 3	beforeII: 16	afterII: 3	Line: 126	File: farrow_kernel.cpp
Raw Throughput = 768.3 MSPS
Max error LSB = 1

Because you have four for loops, make get_II generates four II numbers, one for each loop. These loops run consecutively, so the total II is the sum of all, which is 25 > 16. To meet your budget of 16 cycles, you need to split your loops into two tiles, with the first tile containing the first loop and the second tile contains the three remaining loops.

Launch vitis_analyzer with vitis_analyzer aiesimulator_output/default.aierun_summary. The current implementation generates array view as shown in the following. Notice the increased system memory (sysmem) size to accommodate scratch pad memory reserved for intermediate kernel results.

figure9

Figure 9 - Farrow Filter Optimize2 Implementation Array View