Profiling Graph Latency - 2023.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-12-04
Version
2023.2 English

The event::io_stream_start_difference_cycles enumeration can be used to measure the latency between two PLIO or GMIO ports. After event::start_profiling() API, two performance counters starts incrementing each cycle, waiting two independent nets to receive their first data. After the first data passes either net, the corresponding performance counter will stop. The value read back by event::read_profiling() is the number difference between the two performance counters.

After event::stop_profiling(), the performance counter is cleared and released.

Profile Graph Latency

Graph latency can be defined as the time spent from receiving the first input data to producing the first output data. It is not dependent on the number of iterations the graph is run for. The following examples shows how to profile graph latency using the event API in AI Engine simulation flow and hardware/hardware emulation flows.

Note: event::start_profiling() has two different PLIO parameters.
In AI Engine simulation:
event::handle handle = event::start_profiling(gr_pl.in, gr_pl.dataout, event::event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations); //Data transfer starts after graph.run()
gr_pl.wait();
long long cycle_count = event::read_profiling(handle);
printf("Latency cycles=: %d\n", cycle_count);
event::stop_profiling(handle);//Performance counter is released and cleared
In hardware and hardware emulation flows:
auto s2mm_run = s2mm(out_bo, nullptr, OUTPUT_SIZE);
event::handle handle = event::start_profiling(gr_pl.in, gr_pl.dataout, event::event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
auto mm2s_run = mm2s(nullptr, OUTPUT_SIZE_MM2S); //input data transfer starts
s2mm_run.wait();//make sure both ports have data transferred
long long cycle_count = event::read_profiling(handle);
printf("Latency cycles=: %d\n", cycle_count);
event::stop_profiling(handle);//Performance counter is released and cleared
Note: Input data transfers starts immediately after the PL kernel mm2s starts. To avoid any overhead that graph.run() may introduce in the profiling graph latency, in the profiling code, start PL kernel mm2s after event::start_profiling, and after graph.run().

Profile Latency Difference Between Two Ports

This method is not limited to profile latency between input port and output port of the same graph. It can be used to profile latency between any two ports. For example, it can profile latency between two output ports that have a common input port.

AI Engine Simulation

Example code in this simulation flow is as follows:
event::handle handle = event::start_profiling(gr_pl.dataout, gr_pl.dataout2, event::event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
gr_pl.wait();
long long cycle_count = event::read_profiling(handle);
printf("Latency cycles=: %d\n", cycle_count);
event::stop_profiling(handle);//Performance counter is released and cleared

Hardware Emulation and Hardware

Example code in these flows is as follows:

auto s2mm_run = s2mm(out_bo, nullptr, OUTPUT_SIZE);event::handle handle = event::start_profiling(gr_pl.dataout, gr_pl.dataout2, event::event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
auto mm2s_run = mm2s(nullptr, OUTPUT_SIZE_MM2S);
s2mm_run.wait();//make sure both ports have data transferred
long long cycle_count = event::read_profiling(handle);
printf("Latency cycles=: %d\n", cycle_count);
event::stop_profiling(handle);//Performance counter is released and cleared

where, a positive number indicates that the data arrives gr_pl.dataout2 later than gr_pl.dataout, while a negative number indicates that data arrives gr_pl.dataout2 earlier than gr_pl.dataout.