Profiling Graph Bandwidth - 2024.1 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2024-06-27
Version
2024.1 English

The event::io_total_stream_running_to_idle_cycles profiling event can be used to track the running and stall events that occurs on the profiled AI Engine - PL interface. This means that it will track the number of cycles that the interface is active (data flowing through) and the number of cycles the interface is stalled. It does not track the number of cycles that the interface is in idle state.

After event::start_profiling(), the performance counter will wait for the running data to start, and it will be paused if the stream is idle. After the performance counter is paused, it will resume when data flow is resumed. After event::stop_profiling(), the performance counter will be cleared and released. This API reports how well the AI Engine and PL kernels utilize the available bandwidth. It is not a measure of best-case utilization of the port bandwidth.

Profile Graph Bandwidth Using the Input Port

The bandwidth of the graph can be defined as a percentage of the time that the graph can accept data.

An example code to measure the graph bandwidth via graph input port is as follows:
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
event::handle handle = event::start_profiling(gr_pl.in, event::io_total_stream_running_to_idle_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
auto mm2s_run = mm2s(nullptr, OUTPUT_SIZE_MM2S);
gr_pl.run(iterations);
gr_pl.wait(); 
long long cycle_count = event::read_profiling(handle);
double bandwidth = (double) (WINDOW_SIZE_in_bytes*iterations/4) / cycle_count; 
event::stop_profiling(handle);//Performance counter is released and cleared

where, the total running cycles can be calculated from how many bytes are transferred. With four bytes a cycle, the total running cycles are WINDOW_SIZE_in_bytes*iterations/4. The total running and stalled cycles are read from the performance counter by event::read_profiling().

If the profiled bandwidth is 1, it means that the graph is running faster than the PL kernel mm2s, the input port has not been stalled.

If the profiled bandwidth is less than 1, it means that PL kernel mm2s can send data faster than the graph or PL kernel s2mm can receive. You might need to evaluate if the bandwidth drop is caused by the graph or PL kernel s2mm.

Profile Graph Bandwidth Using the Output Port

The bandwidth of the graph can be defined as percentage of the time that the graph can send data. If the profiled bandwidth is 1, it means that the graph is not blocked by PL kernel s2mm. If the profiled bandwidth is less than 1, it means that the graph is blocked in some percentage by s2mm due to back pressure. An example code to profile graph bandwidth via graph output port is as follows:
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
event::handle handle = event::start_profiling(gr_pl.dataout, event::io_total_stream_running_to_idle_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
gr_pl.wait(); 
long long cycle_count = event::read_profiling(handle);
double bandwidth = (double) (WINDOW_SIZE_in_bytes*iterations/4) / cycle_count; 
event::stop_profiling(handle);//Performance counter is released and cleared