Use the following command to view the profile results obtained using the XRT flow:
vitis -a xrt.run_summary
To launch the Vitis IDE to view the profiling information in the XSDB flow, use the following command.
vitis -a aie_trace_profile.run_summary
Performance Annotation in Graph View
Depending on what you profile in the hardware, you can view the performance data in specific tables in the graph view. Hover the mouse over the corresponding object to display the performance.
The following example shows how output throughput is measured in hardware and where the graph view displays the results.
- Add the content to
xrt.ini:[Debug] aie_profile=true [AIE_profile_settings] tile_based_interface_tile_metrics=25:25:output_throughputs - Run application to get the profiling result. View it in the Vitis IDE.
The Interface Channels table contains the
performance data. Hover on the profiled port to display performance data, as shown
in the following figure.
Example of heat_map Core Metrics and conflicts Memory Metrics
The following image shows the design's active time, stall time, cumulative
instruction count, and vector instruction count. These are part of heat_map metric and memory conflict time. It also shows
the cumulative memory error time of conflicts
metrics for ten tiles of an example design.
Consider the AI Engine located in (15,0). During the active utilization time (5.120 ms) it performs 5120000 vector instructions which represents 87% of the active time. This is an excellent performance that indicates a well-optimized core.
Example of Stalls Core Metrics and dma_locks Memory Metrics
The following figure shows the design's memory stall time, stream stall time,
cascade stall time, and lock stall time as part of stalls metrics. The figure also shows cumulative DMA activity time and
cumulative DMA locks count of dma_locks metrics for
ten tiles of an example design.
On the core (24,2), the DMA was active for 70.645 ms (77.8 millions instructions), but stalled 298 times.
Example of execution Core Metrics and conflicts Memory Metrics
The following image shows the design's cumulative instruction count, vector
instruction count, load instruction count, and store instruction count. These counts
are part of execution metrics and memory conflict
time, as well as cumulative memory error time of conflicts metrics for ten tiles of an example design.
Observe that core (15,1) experiences minor memory conflicts you must identify. The occurrence, being very small, could be due to some DMA or some other kernel access interference.
Example of read_throughputs and write_throughputs AI Engine Metrics and dma_stalls_s2mm and dma_stalls_mm2s AI Engine Memory Metrics
The following image shows the design's stream and cascade read and write
instruction count as part of read_throughputs and
write_throughputs metrics and s2mm and mm2s
channel0 and channel1 stalls time of dma_stalls_s2mm and dma_stalls_mm2s
metrics for ten tiles of an example design.
In the preceding figure, there is a cascade write and a stream write more than 45% of the time in the AI Engine kernels. This is necessary to keep the AI Engine active because the stream throughput is much less than the memory throughput.
Example of heat_map Core Metrics and dma_locks Memory Metrics
The following figure shows the design's active time, stall time, cumulative
instruction count and vector_instruction_count as part of heat_map metrics. The figure also shows cumulative DMA activity time,
as well as cumulative DMA locks count of dma_lock
metrics for ten tiles of an example design.
The cumulative DMA Activity time jointly with the Cumulative DMA Locks count allows you to see if there is any discrepancy between lock acquisition number and the number of data transferred through the DMAs. You can use the relative number of locks count to interpret the relative number of iterations of each core.
Example of input_throughputs Interface Metrics
The following figure shows the design's input throughput at the PLIO level.
This `is part of the input_throughputs:0 metric in
an 8 x 8 cascaded tiles design.
In this graph, the channel 0 throughput for all input PLIOs is approximately 95% which is close to the achievable maximum. After this profiling step, verify that the AI Engines are not starving for data.
Report Consolidation in the Vitis IDE
During the profiling stage, not all metrics can be used at the same time
during runtime. You can run the design in hardware multiple times by rebooting the
board each run using different profile metric sets in xrt.ini. Typically, for AI Engine interface throughput profiling, a single channel
(the same for all PLIOs) can be profiled during runtime. Multiple channel profiling
necessitates multiple runs.
In the Vitis IDE, you can consolidate multiple
reports from different runs of the same design. That enables you to display the
throughput of multiple interface channels, for example. While vitis -a is run with the xrt.run_summary of a specific run of the design, other xrt.run_summary reports can be opened by clicking the
+ toolbar button in the main toolbar and a window
toolbar, as shown below.
After consolidating the profiling data for input PLIOs channels 0 and 4, and output PLIOs channel 0, the Vitis IDE displays the following table.