Measuring Performance with AI Engine Run Time Event APIs - 2024.1 English

Versal Adaptive SoC System Integration and Validation Methodology Guide (UG1388)

Document ID
Release Date
2024.1 English

After the graph is compiled using the Vitis tools or aiecompiler, each AI Engine array interface (or shim tile) can be monitored to count for specific events. You can use a few profiling events to count valid AXI4-Stream data transactions within the AI Engine array interface. When the APIs are called, the PS issues a sequence of AXI4-MM commands to configure the AI Engine array interface to count for valid events. The event counters in the AI Engine array interface provide a helpful way to measure the system without adding any additional hardware to the system.

Note: Each AI Engine array interface has only two performance counters, but there are fourteen 64b streams in each AI Engine array interface. Therefore, only two AI Engine-PL interfaces can be monitored at one time using these probing APIs.

The following example uses the io_stream_start_to_bytes_transferred_cycles event API to measure the throughput of the graph. This API uses two performance counters to track both the bytes transferred and cycles taken. This event API captures and calculates the sum of the total active, stall, and idle cycles that transfer the specified amount of data through the graph. This API can be used on both input and output streams.

event::handle handle = event::start_profiling(plio_out,
event::io_stream_start_to_bytes_transferred_cycles, 256*sizeof(int32));;
long long cycle_count = event::read_profiling(handle);
double throughput = (double)256 * sizeof(int32) / (cycle_count * 1e-9); //
byte per second

You can use an alternative event API when the number of bytes being transferred is unknown. The following example uses the io_stream_running_event_count event API to measure the throughput of the graph. The streams run for a specific interval of time, and the number of stream active events is captured.

using namespace adf;
event::handle handle_0;
PLIO duc_plio[2] = {*duc_in0, *duc_out0};
while(d < NUM_DUC_SLAVES) {
    long long throughput_out_min = 990000000; // initial value to some high number
    long long throughput_out_max = 0;
    int iter=0;
    while(iter < 5) {
        long long count_start, count_end;
        long long throughput;
        handle_0 = event::start_profiling(duc_plio[d], event::io_stream_running_event_count);
        count_start = event::read_profiling(handle_0);
        //precision of usleep is dependent on linux system call
        usleep(1000000); //1s
        count_end = event::read_profiling(handle_0);
        if (count_end > count_start) throughput = (count_end-count_start);
        else throughput = (count_end-count_start+0x100000000); //roll over correction for 32b performance counter
        if (throughput<throughput_out_min) throughput_out_min = throughput;
        if (throughput>throughput_out_max) throughput_out_max = throughput;
    printf("[throughput] %d\tMin:%llu\tMax:%llu\tRange:%llu\n", d, throughput_out_min, throughput_out_max, throughput_out_max-throughput_out_min );
printf("[main] Performance measurements Done ... \n");

For information, see this link in the AI Engine Tools and Flows User Guide (UG1076).