Every AI Engine has a 64-bit counter. The
AI Engine API class
aie::tile
has method cycles()
to read this counter value. For
example:aie::tile tile=aie::tile::current(); //get the tile of the kernel
unsigned long long time=tile.cycles();//cycle counter of the tile counter
The counter is continuously running. It is not limited by how many times you
can read the counter. The value read back by the kernel can be written to memory, or
it can be streamed out for further analysis. For example, to profile the latency of
the code below, the counter value is read prior to the code being profiled, and
again after the code has
run:
aie::tile tile=aie::tile::current();
unsigned long long time1=tile.cycles(); //first time
for(...){...}
unsigned long long time2=tile.cycles(); //second time
long long time=time2-time1;
writeincr(out,time);
The latency of the loop in the kernel can then be examined in the host application by the second time minus the first time.
By comparing the data read back in between different executions of the kernel, or between different iterations of the loop, the data can be used to calculate latency. For example, the following code tries to get the latency of certain operations on an asynchronous buffer:
aie::tile tile=aie::tile::current();
for(...){//outer loop
unsigned long long time=tile.cycles(); //read counter value
writeincr(out,time);
win_in.acquire();
for(...){...} //inner loop
win_in.release();
}
The latency of asynchronous buffer acquiring and release, plus the inner loop execution time can then be calculated by the second time minus the first time.
The counter value can also be written into data memory. The value can be read
back by
printf
in simulation, or read back by host
code in hardware. If the written value is not used by any other code, the volatile
qualifier can be used to enforce the storage
of the value of the counter. This qualifier ensures that the compiler optimizations
do not eliminate this variable. For
example:static unsigned long long cycle_num[2];
aie::tile tile=aie::tile::current();
volatile unsigned long long *p_cycle=cycle_num;
*p_cycle=tile.cycles();//cycle_num[0]
for(...){...}
*(p_cycle+1)=tile.cycles();//cycle_num[1]
printf("cycles=%lld\n",cycle_num[1]-cycle_num[0]);