Profiling Data Generation
In the simulation framework, the AI Engine simulator can generate a profiling report for the complete application. This
report is generated using the flag –profile
.
aiesimulator –pkg-dir=Work –profile
Text files and xml files are generated in the directory aiesimulator_output
. Two types of files are generated
for the tile located in column C and row R. The *_funct
reports the number of calls and number of cycles for each
function. The *_instr
is a report that goes down
to the assembly code. To visualize the report, use the Vitis Analyzer.
vitis_analyzer aiesimulator_output/default.aierun_summary
- Summary
- Reports the total cycle count, total instruction count, and program size in memory.
- Function Reports
- Shows several key indicators, function by function, in a table and
graphs.
- Number of calls
- Total function time (cycles and %)
- Total function + descendant time (cycles and %)
- Min/Avg/Max function time (cycles)
- Min/Avg/Max function + descendant time (cycles)
- Program counter Low/High
- Profile Details
- Shows the assembly code, function by function, with useful precisions. The columns are as shown in the following table.
Column Name | Content |
---|---|
PC | Program counter |
Instruction | Up to 16 bytes for each line |
Assembly | Assembly code mnemonic with the full 7-way instruction word |
Exe-count | Number of times this line has been executed by the processor |
Cycles | Number of cycles required |
User Count | |
Wait States | For some instructions you may have memory conflicts which end up into a number of wait-states |
Relative cycle use within function | Shown as ‘*’ lines where the relative length visually shows the relative cycles use of this instruction within the function |
Relative cycle use within simulation | Shown as ‘*’ lines where the relative length visually shows the
relative cycles use of this instruction within the simulation
(including main() and all functions) |
Relative wait-state use within function | Shown as ‘W’ lines where the relative length visually shows the relative cycles used by wait-states during this instruction within the function |
Relative wait state use within simulation | Shown as ‘*’ lines where the relative length visually shows the
relative cycles used by wait states during this instruction within
the simulation (including main() and all
functions) |
Performance Debug with Profiling Data
Performance improvement of a design should start by optimizing the function that takes most of the cycles. After it is done, you can optimize functions that take less and less proportion of the cycles. For this purpose, the total function time graph will help you in selecting these functions.
For the optimization itself, you can use pragmas (chess_prepare_for_pipelining, chess_loop_range) for the usual pipelining, unrolling of loops. The Profile Details tab provides insight about wait states. Even if your inner-loop is perfectly optimized, you may lose some cycles due to wait-states. These wait-states occur when there are conflicts in resource access.
- Two reads or one read and one write within the same memory bank, either from the local AI Engine, or two contiguous AI Engines.
- The local AI Engine tries to access a bank (either read or write) while a memory DMA is accessing it for some data transfer.
Here is an example of profile details.
In this screenshot, the inner-loop is first localized with ls and le which are the loop-start and loop-end PC. This inner loop is shown within the blue rectangle. This seems correct as the Exe-Count is much higher on these lines than on the others.
In the second step, an instruction that is constantly taking two clock cycles instead of one, due to wait-states is localized:
- Exe-count = 56
- Cycles = 112
The VMUL instruction takes data in register yd and coefficients in wc0. A load taking seven clock cycles before the data is effectively in the register. The third step is to analyze the code seven clock cycles before the first iteration (3) or within the loop on the previous iteration (3’). You can see here that there are two loads in each case which are on the same bank (all the reads are from the same window in the source code).