After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.
During simulation of the AI Engine graph, the AI Engine simulator or hardware
emulation, captures performance and activity metrics and writes the report to the
output directory ./aiesimulator_output and
./sim/behav_waveform/xsim
. The generated
summary is called default.aierun_summary.
The run_summary can be viewed in the Vitis IDE. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:
vitis -a ./aiesimulator_output/default.aierun_summary
The Vitis IDE opens displaying the Summary page of the report. The tool lists the different reports that are available in the summary. For a complete understanding of the Analysis view, see Working with the Analysis View (Vitis Analyzer) in the Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393).
default.aierun_summary
also contains the some of the
same reports as <GRAPH_TB_FILE_NAME>.aiecompile_summary
. These reports are
Graph and
Array. To see those
reports go to the Viewing Compilation Results in the Analysis View of the Vitis Unified IDE.Report Summary
This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
Profile Summary
When the aiesimulator --profile
option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level
view of the AI Engine graphs, kernels-mapped to
processors, with tables and graphic presentation of metric data.
The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the raw Profile Summary report is shown.
Specific tables can be used to see profile information specific to the kernels. This is shown as a chart with a table showing what is running on the tiles. The following is an example chart.
In this view, you can see a chart that shows a Total Function Time which is the total cycles the function used in running the graph. The y-axis shows the id of the function that can be referenced in the ID column of the table below. This information can be useful in determining where time is being spent in a function and helps with potential optimization or debug. The table lists the:
- ID of the function profiled
- The function name
- The number of times the function was executed
- The total time taken in cycles to execute the function
- The total function execution time as a percent of the total execution time of the graph
- The total time taken in cycles to execute the function and the functions(descendents) called from within it
- The total time as a percent to execute the function and the functions(descendents) called from within it
Trace Report
Issues such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are difficult to debug using traditional interactive debug techniques. Event trace provides a systematic way of collecting system level traces for the program events, providing direct support for generation, collection, and streaming of hardware events as a trace. The following image shows the Trace report open in the Vitis IDE.
-
_main
- Core
main
function. This is different from the function used in the top-level file.
-
_main_init
- Kernel
init
function that runs once per graph execution.
-
_cxa_finalize
- Calls destructors of global C++ objects.
-
_fini
- This section holds executable instructions that terminate the process. When a program exits normally, the system runs the code in this section.
aiesimulator --pkg-dir=./Work --online -wdb -text
Features of the trace report include the following.
- Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
- There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
- By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
- The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
- If a lock is not released, a red bar extends through the end of simulation time.
- Clicking the left or right arrows takes you to the start and end of a state, respectively.
- The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.