AI Engine profiling uses performance counters at all level of the device:
- Runtime event performance counters for the AI Engine modules
- Runtime memory counters for memory modules and memory tiles
- Runtime interface counters for AI Engine-PL interface tiles.
These performance counters can be configured to track a variety of events in the AI Engine, the memory module and the interface tile. Various features such as error-correction code (ECC) scrubbing, event trace and profiling can use these performance counters. Performance counters count occurrences of a given event in a profile configuration. The profile feature offers several different configurations of these performance counters that can be dynamically applied at runtime to collect various profiling statistics.
No changes are required in PS host code when using performance counters. These counters can be configured, read and collected at runtime while the design is executing in hardware. The following table lists the number of performance counters that are available at different configurations.
Various metrics exist for all different part of the array:
| Metric Name | Description |
|---|---|
| heat_map | Reports time where the AI Engine was active, stalled, executing vector instruction. |
| stalls | Reports time the AI Engine is not active due to memory access, stream access, cascade access or lock acquisition. |
| execution | Reports the time spent by the AI Engine on vector instructions, load/store Instructions and cumulative instruction time |
| floating_point | Reports time spent on floating-point exceptions |
| aie_trace | Reports the amount of data for trace, back-pressure, memory module and memory module back-pressure produced by the AI Engine. |
| write_throughputs | Reports the time spent by the AI Engine on executing write operations on streams, cascade interface. There is also the write throughput on these interface |
| read_throughput | Reports the time spent by the AI Engine on executing read operations on streams, cascade interface. There is also the write throughput on these interface |
| stream_put_get | Reports time spent on executing cascade and stream operations |
| Metric Name | Description |
|---|---|
| conflicts | Reports time spent on memory conflicts and ECC errors |
| dma_locks | Reports time spent on stalled locks on both channels |
| dma_stalls_s2mm | Reports the time spent by each S2MM channel on stalls due to lock acquisition |
| dma_stalls_mm2s | Reports the time spent by each MM2S channel on stalls due to lock acquisition |
| s2mm_throughputs | Reports the number of BD packets and the throughput of each S2MM channel. In AI Engine-ML the back-pressure time is also available. |
| mm2s_throughputs | Reports the number of BD packets and the throughput of each MM2S channel. In AI Engine-ML the back-pressure time is also available. |
| Metric Name | Description |
|---|---|
| s2mm_channels | Reports Transfert/Stalled time, Number of AXI4-Stream packets and BD packets transferred over memory tile input channel |
| s2mm_channels_details | Reports Transfer, Backpressure, lock stall and stream starvation time on input streams |
| mm2s_channels | Reports Transfert/Stalled time, Number of AXI4-Stream packets and BD packets transferred over memory tile output channel. |
| mm2s_channels_details | Reports Transfer, Backpressure, lock stall and stream starvation time on output streams |
| memory_stats | Reports Group Errors on Memory |
| s2mm_throughputs | Reports Transfer, Starvation, Backpressure, lock stall time along with S2MM Channel Throughput. |
| mm2s_throughputs | Reports Transfer, Starvation, Backpressure, lock stall time along with MM2S Channel Throughput. |
| conflict_statsN | Reports the number of 4 consecutive memory bank conflicts, starting at bank 4N. N=0,1,2,3 |
| Metric Name | Description |
|---|---|
| input_throughputs | Reports Transfer, Stalled, Idle time as well as throughput |
| output_throughputs | Reports Transfer, Stalled, Idle time as well as throughput |
| input_stalls | Reports Stall and Idle time for channel 0. For AI Engine-ML it will be Backpressure and Starvation time for channels 0 and 1 |
| output_stalls | Reports Stall and Idle time for channel 0. For AI Engine-ML it will be Backpressure and Lock Stall time for channels 0 and 1 |
| packets | Reports number of packets (input/output) |
| start_to_bytes_transferred | Total clock cycles to transfer byte count for specified graph/port |
| interface_tile_latency | Total latency in clock cycles between graph1:port1 and graph2:port2 |
For more details on these metrics, see the chapters on Profiling the AI Engine, Memory Module and Interface Tile in AI Engine Tools and Flows User Guide (UG1076).
Launch AI Engine Profiling
There are two ways to launch AI Engine profiling in Hardware:
- XRT flow
- XSDB flow
XRT Flow
In order to use the XRT flow, create the xrt.ini file at the same location where the PS host application is
located. Specify a line making AI Engine
profiling possible, followed by multiple lines specifying the exact settings of the
metrics to be used.
xrt.ini file is as
follows:[Debug]
#
# Profile Counters
#
aie_profile = true
[AIE_profile_settings]
# Sample interval (in usec)
interval_us = 100
# All tiles
tile_based_aie_metrics = all:heat_map
tile_based_aie_memory_metrics = all:conflicts
tile_based_interface_tile_metrics = all:s2mm_throughputs:0
where:
-
[Debug] - Specifies debug section for XRT, this is case sensitive.
-
aie_profile - Enables profile configuration.
-
[aie_profile_settings] - Specifies profile settings for XRT.
-
aie_profile_interval_us - Profiles data collection interval in micro seconds.
-
tile_based_aie_metrics - Configures metric to be applied to the AI Engine on a tile basis.
-
tile_based_aie_memory_metrics - Configures memory metric to be applied on a tile basis.
-
tile_based_interface_tile_metrics - Configures interface metric to be applied on a tile basis.
There exist many ways to define the tiles you want to select for profiling based on tiles or on graph.
For more details, see the chapters on Profiling the AI Engine in Hardware, Profiling Flow and XRT Flow in the AI Engine Tools and Flows User Guide (UG1076).
XSDB Flow
When running the application, the profile data is captured in
counters that can be retrieved by the debugging and profiling IP. To capture and
evaluate this data, you must connect to the hardware device using xsdb. This command is typically used to program the
device and debug applications. Connect your system to the hardware platform or
device over JTAG, launch the xsdb command in a
command shell, and run the following sequence of commands:
xsdb% connect
xsdb% ta 1
xsdb% source $::env(XILINX_VITIS)/scripts/vitis/util/aie_profile.tcl
xsdb% aieprofile start -graphs myGraph -work-dir ./Work \
-graph-based-aie-metrics "dut:kernel1:heat_map" \
-tile-based-aie-metrics "all:stalls" \
-graph-based-aie-memory-metrics "dut:all:write_throughputs" \
-tile-based-aie-memory-metrics "{4,1}:{6,2}:conflicts; {8,3}:dma_locks" \
-tile-based-interface-tile-metrics "2:10:input_throughputs:3" \
-interval 20 -samples 100
where:
-
connect - Launches the hw_server and connects
xsdbto the device. -
source $::env(XILINX_VITIS)/scripts/vitis/util/aie_profile.tcl - Sources the Tcl trace command to set up the
xsdbenvironment.
For more details, see the chapters on Profiling the AI Engine in Hardware, Profiling Flow and XRT Flow in the AI Engine Tools and Flows User Guide (UG1076).