Profiling using AI Engine Cycles Received from AI Engine Kernels - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
Release Date
2023.2 English

In this design, AI Engine cycles output at the end of each iteration. Each iteration produces 256 int32 data and a very long AI Engine cycle counter number. The first and the last cycle of all AI Engine kernels to be profiled are recorded because multiple AI Engine kernels can start at different cycles though they are enabled by the same graph::run. Thus, the system throughput for all the kernels can be calculated.

Note: There is some gap between the actual performance and the calculated number because there are some data transfers before and after the recorded cycles.

The code to get AI Engine cycles and calculate the system throughput is as follows:

   unsigned long long start[NUM];
   unsigned long long end[NUM];
   unsigned long long very_beginning=0xFFFFFFFFFFFFFFFF;
   unsigned long long the_last=0;
    for(int i=0;i<NUM;i++){
        start[i]=*(unsigned long long*)&doutArray[i][256];
        end[i]=*(unsigned long long*)&doutArray[i][BLOCK_SIZE_out_Bytes/sizeof(int)-2];
        std::cout<<"Throughput (by AIE kernel cycles in="<<NUM<<",out="<<NUM<<") ="<<(double)(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*NUM/((double)(the_last-very_beginning)*0.8)*1000<<"M Bytes/s"<<std::endl;

The code is guarded by macro __AIE_CYCLES__. To use this method of profiling, define __AIE_CYCLES__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++17 -D__AIE_CYCLES__ ......

The commands to build and run in hardware are the same as previously shown. The output in hardware is similar as follows:

Throughput (by AIE kernel cycles in=4,out=4) =10561.8M Bytes/s