AI Engine Simulation-Based Profiling - 2023.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-12-04
Version
2023.2 English

Profiling Data Generation

In the simulation framework, the AI Engine simulator can generate a profiling report for the complete application. This report is generated using the flag –-profile.

aiesimulator –pkg-dir=Work –-profile

Text files and XML files are generated in the directory aiesimulator_output. Two types of files are generated for the tile located in column C and row R. The *_funct reports the number of calls and number of cycles for each function. The *_instr is a report that goes down to the assembly code. To visualize the report, use the Analysis View of the Vitis unified IDE.

vitis -a aiesimulator_output/default.aierun_summary

The Profile tab opens the Profile report, which shows a menu of sections that show information.

Summary
Reports the total cycle count, total instruction count, and program size in memory.
Function Reports
Shows several key indicators of the functions in the graphs.
Number of calls
Reports the number of times the function is executed
Total function time (cycles and %)
Reports the function execution time (in cycles and as a percent). This is the time required to execute the code within a function, exclusive of any calls to its descendants.
Total function + descendant time (cycles and %)
Reports the function execution time, as well as the execution time of the descendant functions (descendant functions are functions called by the function whose profile information is being reported). The "Total Function+descendant time" represents the total time required to execute the code within a function and in any function it calls, including the time spent in its descendant functions. It is important to note that the time includes the time spent in the function itself as well as the time spent in all the functions it calls, directly or indirectly.
Min/Avg/Max function time (cycles)
Reports the minimum/average/maximum function execution time (in cycles and as a percent).
Min/Avg/Max function + descendant time (cycles)
Reports the minimum/average/maximum function execution time, as well as the execution time of the descendant functions (descendant functions are functions called by the function whose profile information is being reported).
Program counter Low/High
Reports the lowest and highest program counter value for a specific function.
Profile Details
Shows the assembly code, function by function, with useful precisions. The columns are as shown in the following table.
Table 1. Profile Details Column Description
Column Name Content
PC Program counter
Instruction Up to 16 bytes for each line
Assembly Assembly code mnemonic with the full 7-way instruction word
Exe-count Number of times this line has been executed by the processor
Cycles Number of cycles required
User Count  
Wait States For some instructions you may have memory conflicts which end up into a number of wait-states
Relative cycle use within function Shown as ‘*’ lines where the relative length visually shows the relative cycles use of this instruction within the function
Relative cycle use within simulation Shown as ‘*’ lines where the relative length visually shows the relative cycles use of this instruction within the simulation (including main() and all functions)
Relative wait-state use within function Shown as ‘W’ lines where the relative length visually shows the relative cycles used by wait-states during this instruction within the function
Relative wait state use within simulation Shown as ‘*’ lines where the relative length visually shows the relative cycles used by wait states during this instruction within the simulation (including main() and all functions)

Performance Debug with Profiling Data

Performance improvement of a design should start by optimizing the function that takes most of the cycles. After it is done, you can optimize functions that take less and less proportion of the cycles. For this purpose, the total function time graph will help you in selecting these functions.

For the optimization itself, you can use pragmas (chess_prepare_for_pipelining, chess_loop_range) for the usual pipelining, unrolling of loops. The Profile Details tab provides insight about wait states. Even if your inner-loop is perfectly optimized, you may lose some cycles due to wait-states. These wait-states occur when there are conflicts in resource access.

  • Two reads or one read and one write within the same memory bank, either from the local AI Engine, or two contiguous AI Engines.
  • The local AI Engine tries to access a bank (either read or write) while a memory DMA is accessing it for some data transfer.

Here is an example of profile details.

Figure 1. Profile Details