The final version of the implementation splits the four for loops into two kernels as previously discussed. The final optimization performed in this version of the implementation is with regards to the storage of intermediate result z2 and z1 shown in Figure 2.
The loops in farrow_kernel2.cpp run sequentially. Memory banks support simultaneous read and write per clock cycle. Store both results in the same memory bank. Use different pointer addresses to store z2 and z1 within the shared bank.
Implement the changes the farrow_final/aie design files. Repeat the earlier characterization steps to evaluate performance. Run make all and confirm the console displays the expected output:
*** LOOP_II *** Tile: 24_1 minII: 3 beforeII: 16 afterII: 3 Line: 50 File: farrow_kernel2.cpp
*** LOOP_II *** Tile: 24_1 minII: 3 beforeII: 16 afterII: 3 Line: 66 File: farrow_kernel2.cpp
*** LOOP_II *** Tile: 24_1 minII: 3 beforeII: 16 afterII: 3 Line: 82 File: farrow_kernel2.cpp
*** LOOP_II *** Tile: 25_0 minII: 16 beforeII: 29 afterII: 16 Line: 53 File: farrow_kernel1.cpp
Raw Throughput = 1150.0 MSPS
Max error LSB = 1
Launch vitis_analyzer, vitis_analyzer Work/farrow_app.aiecompile_summary. The current implementation generates the summary view. The final design uses two compute tiles and a total of five tiles when taking buffers into consideration.
Figure 10 - Farrow Filter Final Implementation Summary View
Launch vitis_analyzer with vitis_analyzer aiesimulator_output/default.aierun_summary. The current implementation generates the views as shown in the following figure. Observe the new ping-pong buffers associated with the intermediate outputs connected between the two kernels.
Figure 11 - Farrow Filter Final Implementation Graph View
Figure 12 - Farrow Filter Final Implementation Array View
Figure 13 - Farrow Filter Final Implementation Trace View
Steady state throughput is 1024/913e-6 = 1122 Msps.