Hardware Emulation and Hardware Run Profiling - 2025.2 English - UG1701

Embedded Design Development Using Vitis User Guide (UG1701)

Document ID
UG1701
Release Date
2025-11-20
Version
2025.2 English

AI Engine profiling uses performance counters at all level of the device:

  • Runtime event performance counters for the AI Engine modules
  • Runtime memory counters for memory modules and memory tiles
  • Runtime interface counters for AI Engine-PL interface tiles.

These performance counters can be configured to track a variety of events in the AI Engine, the memory module and the interface tile. Various features such as error-correction code (ECC) scrubbing, event trace and profiling can use these performance counters. Performance counters count occurrences of a given event in a profile configuration. The profile feature offers several different configurations of these performance counters that can be dynamically applied at runtime to collect various profiling statistics.

No changes are required in PS host code when using performance counters. These counters can be configured, read and collected at runtime while the design is executing in hardware. The following table lists the number of performance counters that are available at different configurations.

Various metrics exist for all different part of the array:

Table 1. AI Engine Metrics
Metric Name Description
heat_map Reports time where the AI Engine was active, stalled, executing vector instruction.
stalls Reports time the AI Engine is not active due to memory access, stream access, cascade access or lock acquisition.
execution Reports the time spent by the AI Engine on vector instructions, load/store Instructions and cumulative instruction time
floating_point Reports time spent on floating-point exceptions
aie_trace Reports the amount of data for trace, back-pressure, memory module and memory module back-pressure produced by the AI Engine.
write_throughputs Reports the time spent by the AI Engine on executing write operations on streams, cascade interface. There is also the write throughput on these interface
read_throughput Reports the time spent by the AI Engine on executing read operations on streams, cascade interface. There is also the write throughput on these interface
stream_put_get Reports time spent on executing cascade and stream operations
Table 2. Memory Module Metrics
Metric Name Description
conflicts Reports time spent on memory conflicts and ECC errors
dma_locks Reports time spent on stalled locks on both channels
dma_stalls_s2mm Reports the time spent by each S2MM channel on stalls due to lock acquisition
dma_stalls_mm2s Reports the time spent by each MM2S channel on stalls due to lock acquisition
s2mm_throughputs Reports the number of BD packets and the throughput of each S2MM channel. In AI Engine-ML the back-pressure time is also available.
mm2s_throughputs Reports the number of BD packets and the throughput of each MM2S channel. In AI Engine-ML the back-pressure time is also available.
Table 3. Memory Tile Metrics
Metric Name Description
s2mm_channels Reports Transfert/Stalled time, Number of AXI4-Stream packets and BD packets transferred over memory tile input channel
s2mm_channels_details Reports Transfer, Backpressure, lock stall and stream starvation time on input streams
mm2s_channels Reports Transfert/Stalled time, Number of AXI4-Stream packets and BD packets transferred over memory tile output channel.
mm2s_channels_details Reports Transfer, Backpressure, lock stall and stream starvation time on output streams
memory_stats Reports Group Errors on Memory
s2mm_throughputs Reports Transfer, Starvation, Backpressure, lock stall time along with S2MM Channel Throughput.
mm2s_throughputs Reports Transfer, Starvation, Backpressure, lock stall time along with MM2S Channel Throughput.
conflict_statsN Reports the number of 4 consecutive memory bank conflicts, starting at bank 4N. N=0,1,2,3
Table 4. Interface Tile Metrics
Metric Name Description
input_throughputs Reports Transfer, Stalled, Idle time as well as throughput
output_throughputs Reports Transfer, Stalled, Idle time as well as throughput
input_stalls Reports Stall and Idle time for channel 0. For AI Engine-ML it will be Backpressure and Starvation time for channels 0 and 1
output_stalls Reports Stall and Idle time for channel 0. For AI Engine-ML it will be Backpressure and Lock Stall time for channels 0 and 1
packets Reports number of packets (input/output)
start_to_bytes_transferred Total clock cycles to transfer byte count for specified graph/port
interface_tile_latency Total latency in clock cycles between graph1:port1 and graph2:port2

For more details on these metrics, see the chapters on Profiling the AI Engine, Memory Module and Interface Tile in AI Engine Tools and Flows User Guide (UG1076).

Launch AI Engine Profiling

There are two ways to launch AI Engine profiling in Hardware:

  • XRT flow
  • XSDB flow

XRT Flow

In order to use the XRT flow, create the xrt.ini file at the same location where the PS host application is located. Specify a line making AI Engine profiling possible, followed by multiple lines specifying the exact settings of the metrics to be used.

An example of xrt.ini file is as follows:
[Debug]
#
# Profile Counters
#
aie_profile = true

[AIE_profile_settings]
# Sample interval (in usec)
interval_us = 100
#   All tiles
tile_based_aie_metrics = all:heat_map
tile_based_aie_memory_metrics = all:conflicts
tile_based_interface_tile_metrics = all:s2mm_throughputs:0

where:

[Debug]
Specifies debug section for XRT, this is case sensitive.
aie_profile
Enables profile configuration.
[aie_profile_settings]
Specifies profile settings for XRT.
aie_profile_interval_us
Profiles data collection interval in micro seconds.
tile_based_aie_metrics
Configures metric to be applied to the AI Engine on a tile basis.
tile_based_aie_memory_metrics
Configures memory metric to be applied on a tile basis.
tile_based_interface_tile_metrics
Configures interface metric to be applied on a tile basis.

There exist many ways to define the tiles you want to select for profiling based on tiles or on graph.

For more details, see the chapters on Profiling the AI Engine in Hardware, Profiling Flow and XRT Flow in the AI Engine Tools and Flows User Guide (UG1076).

XSDB Flow

When running the application, the profile data is captured in counters that can be retrieved by the debugging and profiling IP. To capture and evaluate this data, you must connect to the hardware device using xsdb. This command is typically used to program the device and debug applications. Connect your system to the hardware platform or device over JTAG, launch the xsdb command in a command shell, and run the following sequence of commands:

xsdb% connect
xsdb% ta 1
xsdb% source $::env(XILINX_VITIS)/scripts/vitis/util/aie_profile.tcl
xsdb% aieprofile start -graphs myGraph -work-dir ./Work \
      -graph-based-aie-metrics "dut:kernel1:heat_map" \
      -tile-based-aie-metrics "all:stalls" \
      -graph-based-aie-memory-metrics "dut:all:write_throughputs" \
      -tile-based-aie-memory-metrics "{4,1}:{6,2}:conflicts; {8,3}:dma_locks" \
      -tile-based-interface-tile-metrics "2:10:input_throughputs:3" \
      -interval 20  -samples 100

where:

connect
Launches the hw_server and connects xsdb to the device.
source $::env(XILINX_VITIS)/scripts/vitis/util/aie_profile.tcl
Sources the Tcl trace command to set up the xsdb environment.

For more details, see the chapters on Profiling the AI Engine in Hardware, Profiling Flow and XRT Flow in the AI Engine Tools and Flows User Guide (UG1076).