Profiling Data Generation
In the simulation framework, the AI Engine simulator can generate a profiling report for the complete application. The flag
–-profile generates this report..
aiesimulator –pkg-dir=Work –-profile
The simulator generates Text and XML files in the aiesimulator_output directory. It generates two file types for the
tile located in column C and row R. The *_funct
reports the number of calls and number of cycles for each function. The *_instr is a report that goes down to the assembly
code. To visualize the report, use the Analysis View of the Vitis Unified IDE.
vitis -a aiesimulator_output/default.aierun_summary
The Profile tab opens the Profile report, which shows a menu of sections displaying the following information:
- Summary
- Reports the total cycle count, total instruction count, and program size in memory.
- Function Reports
- Shows several key indicators of the functions in the
graphs.
- Number of calls
- Reports the number of times the function executes
- Total function time (cycles and %)
- Reports the function execution time (in cycles and as a percent). This is the time required to execute the code within a function, exclusive of any calls to its descendants.
- Total function + descendant time (cycles and %)
- Reports the function execution time, as well as the execution time of
the descendant functions. (Descendant functions are functions
called by the function whose profile information is being
reported.) The "Total Function+descendant time" represents the
total time required to execute the code within a function and in
any function it calls. This includes the time spent in its
descendant functions.Note: Time includes the time spent in the function itself as well as the time spent in all the functions it calls, directly or indirectly.
- Min/Avg/Max function time (cycles)
- Reports the minimum/average/maximum function execution time (in cycles and as a percent).
- Min/Avg/Max function + descendant time (cycles)
- Reports the minimum/average/maximum function execution time, as well as the execution time of the descendant functions. (Descendant functions are functions called by the function whose profile information is being reported.)
- Program counter Low/High
- Reports the lowest and highest program counter value for a specific function.
- Profile Details
- Shows the assembly code, function by function, with useful precisions. The columns are as shown in the following table.
| Column Name | Content |
|---|---|
| PC | Program counter |
| Instruction | Up to 16 bytes for each line |
| Assembly | Assembly code mnemonic with the full 7-way instruction word |
| Exe-count | Number of times this line has been executed by the processor |
| Cycles | Number of cycles required |
| Wait States | For some instructions you can have memory conflicts which end up into a number of wait-states |
| Relative cycle use within function | Shown as ‘*’ lines where the relative length visually shows the relative cycles use of this instruction within the function |
| Relative cycle use within simulation | Shown as ‘*’ lines where the relative length visually shows the
relative cycles use of this instruction within the simulation
(including main() and all functions) |
| Relative wait-state use within function | Shown as ‘W’ lines where the relative length visually shows the relative cycles used by wait-states during this instruction within the function |
| Relative wait state use within simulation | Shown as ‘*’ lines where the relative length visually shows the
relative cycles used by wait states during this instruction within
the simulation (including main() and all
functions) |
Performance Debug with Profiling Data
To improve the performance of a design, first optimize the function that takes most of the cycles. Then optimize functions that take less and less proportion of the cycles. The total function time graph helps you to select these functions.
For the optimization itself, you can use pragmas (chess_prepare_for_pipelining, chess_loop_range) for the usual pipelining, unrolling of loops. The Profile Details tab provides insight about wait states. Even if your inner-loop is perfectly optimized, you can lose some cycles due to wait-states. Wait-states occur when there are conflicts in resource access such as the following:
- Two reads or one read and one write within the same memory bank, either from the local AI Engine, or two contiguous AI Engines.
- The local AI Engine tries to access a bank (either read or write) while a memory DMA is accessing it for some data transfer.
The following figure shows an example of profile details.
The preceding figure first shows the loop-start (ls) and loop-end (le) localizing the inner-loop. The blue rectangle shows the inner loop. This seems correct as the Exe-Count is much higher on these lines than on the others.
The second step localizes an instruction. It is constantly taking two clock cycles instead of one, due to wait-states:
- Exe-count = 56
- Cycles = 112
The VMUL instruction takes data in register yd and coefficients in wc0. A load takes seven clock cycles before the data is effectively in the register.
The third step is to analyze the code seven clock cycles before the first iteration (3) or within the loop on the previous iteration (3’). There are two loads in each case which are on the same bank (all the reads are from the same window in the source code).