Profiling Kernel Code - 2025.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2025-11-26
Version
2025.2 English
Every AI Engine has a 64-bit counter. The AI Engine API class aie::tile has method cycles() to read this counter value. See the following example:
aie::tile tile=aie::tile::current(); //get the tile of the kernel
unsigned long long time=tile.cycles();//cycle counter of the tile counter
The counter is continuously running. The counter is not limited by how many times you can read the counter. The counter value read by the kernel can be written to memory or streamed out for further analysis. For example, to profile the latency of the code below, the counter value is read prior to profiling the code, and again after the code has run:
aie::tile tile=aie::tile::current();
unsigned long long time1=tile.cycles(); //first time

for(...){...}

unsigned long long time2=tile.cycles(); //second time
long long time=time2-time1; 
writeincr(out,time);

The latency of the loop in the kernel can then be examined in the host application by the second time minus the first time.

Compare the data read back between different kernel executions or loop iterations to calculate latency. For example, the following code tries to get the latency of certain operations on an asynchronous buffer:

aie::tile tile=aie::tile::current();
for(...){//outer loop
  unsigned long long time=tile.cycles(); //read counter value
  writeincr(out,time);
  win_in.acquire();
  for(...){...} //inner loop
  win_in.release();
}

The latency of asynchronous buffer acquiring and release, plus the inner loop execution time can then be calculated by the second time minus the first time.

You can also write the counter value into data memory. The value can be read back by printf in simulation, or read back by host code in hardware. If no other code uses the written value, apply the volatile qualifier to ensure the counter value is stored. This qualifier ensures that the compiler optimizations do not eliminate this variable. For example:
static unsigned long long cycle_num[2];
aie::tile tile=aie::tile::current();
volatile unsigned long long *p_cycle=cycle_num;
*p_cycle=tile.cycles();//cycle_num[0]

for(...){...}

*(p_cycle+1)=tile.cycles();//cycle_num[1]
printf("cycles=%lld\n",cycle_num[1]-cycle_num[0]);