Profiling Graph Throughput - 2023.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-12-04
Version
2023.2 English

Graph throughput can be defined as the average number of bytes produced (or consumed) per second. The event::io_stream_start_to_bytes_transferred_cycles enumeration can be used to record the number of cycles taken to transfer a certain amount of data.

After event::start_profiling(), two performance counters performance counter 0 and performance counter 1 work together. performance counter 0 starts incrementing a counter after it receives the first data. performance counter 1 increments after it receives data. When performance counter 1 equals the amount of data specified in event::start_profiling, it generates an event that notifies performance counter 0 to stop. The value read back by event::read_profiling() is the performance counter 0 value. After performance counter 0 stops, the value of the counter represents the number of cycles taken to transfer the data.

If the specified amount of data in event::start_profiling has not been transferred, performance counter 0 does not stop. After the specified amount of data has been transferred, the performance counter 0 will stop. This technique is useful when you want to profile the time taken to transfer a known amount of data. However, if additional data is transferred, the performance counter 0 continues counting.
Warning: For any method to profile graph throughput, the overhead of the graph latency or graph and kernel API calls may not be negligible if the iteration number is too small. It is recommended to run a large number of iterations to minimize the impact of such overhead.

Profile Graph Throughput Using the Graph Output

The following example shows how to profile graph throughput using graph output:
auto s2mm_run = s2mm(out_bo, nullptr, OUTPUT_SIZE);
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.dataout, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
s2mm_run.wait();//performance counter 0 stops, assumming s2mm able to receive all data 
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared

Note that in above code, the run waits for s2mm to complete to ensure that all data are transferred through PLIO.

When using the API in the AI Engine simulation flow, you can use graph.wait() instead. Note that after graph.wait(), the API still requires additional cycles to transfer data from the window buffer to PLIO. One solution is use large enough number of iterations, so that the overhead is small enough to be negligible. Another solution is to use graph.wait(<NUM_CYCLES>) for a number of cycles that is long enough to make sure all data is transferred through PLIO.

Profile Graph Throughput Using the Graph Input

The stream from PLIO to kernel and the DMA for the input buffers are ready to receive data right after configuration. When the PL kernel mm2s is asserted, the input net can start receiving data even before graph::run. One way to profile a PLIO input is to assert the PL after event::start_profiling(). The following example shows how to profile graph throughput using graph input:
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.in, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
auto mm2s_run = mm2s(nullptr, OUTPUT_SIZE_MM2S);//After start profiling, send data from mm2s
gr_pl.wait();//performance counter 0 stops, assumming s2mm able to receive all data 
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared
If the amount of data transferred is not known, it is also possible to estimate graph throughput with this method. For example, if PL kernels are free-running and if the graph output AI Engine - PL interface column is running out of performance counters, you can still profile the graph throughput via graph input:
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.in, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
gr_pl.run(iterations);
gr_pl.wait();//performance counter 0 does not stop
//Read performance counter value immediately
//Assuming that overhead can be negligible if iteration is large enough
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared