Every AI Engine has a 64-bit counter. The AI Engine API class
aie::tile
has method cycles() to read this counter value. See
the following
example:aie::tile tile=aie::tile::current(); //get the tile of the kernel
unsigned long long time=tile.cycles();//cycle counter of the tile counter
The counter is continuously running. The counter is not limited by how many
times you can read the counter. The counter value read by the kernel can be written
to memory or streamed out for further analysis. For example, to profile the latency
of the code below, the counter value is read prior to profiling the code, and again
after the code has
run:
aie::tile tile=aie::tile::current();
unsigned long long time1=tile.cycles(); //first time
for(...){...}
unsigned long long time2=tile.cycles(); //second time
long long time=time2-time1;
writeincr(out,time);
The latency of the loop in the kernel can then be examined in the host application by the second time minus the first time.
Compare the data read back between different kernel executions or loop iterations to calculate latency. For example, the following code tries to get the latency of certain operations on an asynchronous buffer:
aie::tile tile=aie::tile::current();
for(...){//outer loop
unsigned long long time=tile.cycles(); //read counter value
writeincr(out,time);
win_in.acquire();
for(...){...} //inner loop
win_in.release();
}
The latency of asynchronous buffer acquiring and release, plus the inner loop execution time can then be calculated by the second time minus the first time.
You can also write the counter value into data memory. The value can be read
back by
printf in simulation, or read back by host
code in hardware. If no other code uses the written value, apply the
volatile qualifier to ensure the counter value is stored. This
qualifier ensures that the compiler optimizations do not eliminate this variable.
For
example:static unsigned long long cycle_num[2];
aie::tile tile=aie::tile::current();
volatile unsigned long long *p_cycle=cycle_num;
*p_cycle=tile.cycles();//cycle_num[0]
for(...){...}
*(p_cycle+1)=tile.cycles();//cycle_num[1]
printf("cycles=%lld\n",cycle_num[1]-cycle_num[0]);