Initial Farrow Design - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

Navigating to farrow_initial and inspecting farrow_kernel.cpp, you see a version of the implementation that is coded primarily to get a functionally correct output, without spending any effort on optimizing throughput performance.

Once inside farrow_initial, you can perform x86 functional simulation and compare against the golden output generated by the MATLAB model by running the following command:

$ make x86compile
$ make x86sim
$ make check_sim_output_x86

The first command compiles the graph code for simulation on an x86 processor, the second command runs the simulation, and the final command invokes MATLAB to compare the simulator output against golden test vectors.

Alternatively, you can issue make x86all. The console should output Max error LSB = 1.

To understand the performance of your initial implementation, you can perform AI Engine emulation using the SystemC simulator by entering the following sequence of commands. In the context of AI Engine processors, Initiation Interval is defined as how often (in cycles) a new iteration of the loop can start.

For example, if a new iteration of the loop can start every II=16 cycles, and each loop iteration produces 16 samples, that means the processor is producing the equivalent of one sample per clock (excluding processor overhead).

Assuming your AI Engine clock is 1.25 GHz, that means your throughput can potentially reach 1.25 Gsps excluding any processor overhead. Output throughput is defined as number of samples produced from your kernel per second. Run the following command:

$ make compile
$ make sim
$ make get_II
$ make check_sim_output_aie

The first command compiles graph code for the SystemC simulator, the second command runs the simulation, the third command calls a python script to extract Initiation Interval from the compiled design, and the final command invokes MATLAB to compare simulation output with test vectors and compute raw throughput.

Alternatively, you can issue make all. The console should output:

*** LOOP_II *** Tile: 24_0	minII: 43	beforeII: 123	afterII: 123	Line: 77	File: farrow_kernel.cpp
Raw Throughput = 204.7 MSPS
Max error LSB = 1

Launch vitis_analyzer vitis_analyzer aiesimulator_output/default.aierun_summary. The current implementation generates a graph and array view shown below.

figure5

Figure 5 - Farrow Filter Initial Implementation Graph View

figure6

Figure 6 - Farrow Filter Initial Implementation Array View

Because every loop iteration produces 16 samples, you need II=16 to achieve your desired throughput. Your first design achieved II=123, so this version of the implementation clearly has no chance of achieving the desired throughput. You can get a rough estimate of the expected throughput using the expected versus achieved II.

In this case, 16/123 x 1.25 GHz = 163 Msps. Indeed, this is confirmed by the reported Raw Throughput, which is measured across all graph iterations. A more accurate throughput measurement can be made by measuring the steady state achieved in the final graph iteration. In vitis_analyzer, select the trace view and set markers to measure the throughput of this final iteration as shown below. Because each graph iteration processes 1024 samples, throughput = 1024/6.398 \(us\) = 160 Msps.

figure7

Figure 7 - Farrow Filter Initial Implementation Trace View