Performance Profiling Methods

Performance Profiling Methods - 2022.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID

XD100

Release Date

2022-12-01

Version

2022.2 English

In this example, we will introduce some methods for profiling the design. The code to be profiled is in aie/graph.cpp:

d8d61191… Updated headers and footers, fixed image references (#33)

//Profile starts here
for(int i=0;i<num;i++){
  gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
  gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
}
for(int i=0;i<num;i++){
  gr.gmioOut[i].wait();
}
//Profile ends here

Note: This tutorial assumes that AI Engine runs at 1 GHz.

Profile by C++ class API

The code to use C++ class API is common for Linux system for various platforms. The Timer is defined as follows:

class Timer {
  std::chrono::high_resolution_clock::time_point mTimeStart;
  public:
    Timer() { reset(); }
    long long stop() {
      std::chrono::high_resolution_clock::time_point timeEnd =
          std::chrono::high_resolution_clock::now();
      return std::chrono::duration_cast<std::chrono::microseconds>(timeEnd -
                                                                  mTimeStart)
          .count();
    }
    void reset() { mTimeStart = std::chrono::high_resolution_clock::now(); }
};

The code to start profiling is as follows:

Timer timer;

The code to end profiling and calculate performance is as follows:

double timer_stop=timer.stop();
double throughput=(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/timer_stop*1000000/1024/1024;
std::cout<<"Throughput (by timer GMIO in num="<<num<<",out num="<<num<<"):\t"<<throughput<<"M Bytes/s"<<std::endl;

The code is guarded by macro __TIMER__. To use this method of profiling, define __TIMER__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__TIMER__ -I$(XILINX_HLS)/include/ -I${SYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

To run it in hardware, use the following make command to build the hardware image:

make package TARGET=hw

After the package is done, run the following commands in the Linux prompt after booting Linux from an SD card:

export XILINX_XRT=/usr
cd /run/media/mmcblk0p1
./host.exe a.xclbin

The output in hardware is as follows:

GMIO::malloc completed
Throughput (by timer GMIO in num=1,out num=1):5076.64M Bytes/s
Throughput (by timer GMIO in num=2,out num=2):8335.5M Bytes/s
Throughput (by timer GMIO in num=4,out num=4):9543.97M Bytes/s
Throughput (by timer GMIO in num=8,out num=8):9717.18M Bytes/s
Throughput (by timer GMIO in num=16,out num=16):10154.1M Bytes/s
Throughput (by timer GMIO in num=32,out num=32):10246.8M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!

Profile by AI Engine cycles got from AI Engine kernels

In this design, the AI Engine cycles output at the end of each iteration. Each iteration produces 256 int32 data, plus a long long AI Engine cycle counter number. The very beginning cycle and the last cycle of all the AI Engine kernels to be profiled are recorded because multiple AI Engine kernels start at different cycles though they are enabled by the same graph::run. Thus, the system throughput for all the kernels can be calculated.

Note that there is some gap between the actual performance and the calculated number because there is already some data transfer before the recorded starting cycle. However, the overhead is negligible when the total iteration number is high, which is 512 in this example.

The code to get AI Engine cycles and calculate the system throughput is as follows:

long long start[32];
long long end[32];
long long very_beginning=0x0FFFFFFFFFFFFFFF;
long long the_last=0;
for(int i=0;i<num;i++){
  start[i]=*(long long*)&doutArray[i][256];
  end[i]=*(long long*)&doutArray[i][BLOCK_SIZE_out_Bytes/sizeof(int)-2];
  if(start[i]<very_beginning){
    very_beginning=start[i];
  }
  if(end[i]>the_last){
    the_last=end[i];
  }
}
std::cout<<"Throughput (by AIE kernel cycles in="<<num<<",out="<<num<<") ="<<(double)(BLOCK_SIZE_in_Bytes+BLOCK_SIZE_out_Bytes)*num/(double)(the_last-very_beginning)*1000<<"M Bytes/s"<<std::endl;

The code is guarded by macro __AIE_CYCLES__. To use this method of profiling, define __AIE_CYCLES__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__AIE_CYCLES__ -I$(XILINX_HLS)/include/ -I${SYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware are the same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
Throughput (by AIE kernel cycles in=1,out=1) =6545.68M Bytes/s
Throughput (by AIE kernel cycles in=2,out=2) =10267.4M Bytes/s
Throughput (by AIE kernel cycles in=4,out=4) =11036.8M Bytes/s
Throughput (by AIE kernel cycles in=8,out=8) =10807.9M Bytes/s
Throughput (by AIE kernel cycles in=16,out=16) =10958.2M Bytes/s
Throughput (by AIE kernel cycles in=32,out=32) =10892.8M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!

Profile by event API

The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. The API used in this example is to profile graph throughput regarding the specific GMIO port. There may be confliction when multiple GMIO ports are used for event API because of the restriction that performance counter is shared between GMIO ports that access the same AI Engine-PL interface column. Thus, only one GMIO output is profiled to show this methodology.

The code to start profiling is as follows:

std::cout<<"total input/output num="<<num<<std::endl;
event::handle handle[32];
for(int i=0;i<1;i++){
  handle[i] = event::start_profiling(gr.gmioOut[i], event::io_stream_start_to_bytes_transferred_cycles, BLOCK_SIZE_out_Bytes);
}

The code to end profling and calculate performance is as follows:

long long cycle_count[32];
for(int i=0;i<1;i++){
  cycle_count[i] = event::read_profiling(handle[i]);
  event::stop_profiling(handle[i]);
}
for(int i=0;i<1;i++){
  double bandwidth = (double)BLOCK_SIZE_out_Bytes / ((double)cycle_count[i] ) *1000; //byte per second
  std::cout<<"Throughput (by event API) gmioOut["<<i<<"] bandwidth="<<bandwidth<<"M Bytes/s"<<std::endl;
}

In this example, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event to the event that indicates BLOCK_SIZE_out_Bytes bytes have been transferred, assuming that the stream stops right after the specified number of bytes are transferred.

For detailed usage about event API, refer to the Versal ACAP AI Engine Programming Environment User Guide ([UG1076]).

The code is guarded by macro __USE_EVENT_PROFILE__. To use this method of profiling, define __USE_EVENT_PROFILE__ for g++ cross compiler in sw/Makefile:

CXXFLAGS += -std=c++14 -D__USE_EVENT_PROFILE__ -I$(XILINX_HLS)/include/ -I${SDKTARGETSYSROOT}/usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=${SDKTARGETSYSROOT} -I${XILINX_VITIS}/aietools/include ${HOST_INC}

The commands to build and run in hardware are the same as previously shown. The output in hardware is as follows:

GMIO::malloc completed
total input/output num=1
Throughput (by event API) gmioOut[0] bandwidth=3321.1M Bytes/s
total input/output num=2
Throughput (by event API) gmioOut[0] bandwidth=2737.72M Bytes/s
total input/output num=4
Throughput (by event API) gmioOut[0] bandwidth=1608.08M Bytes/s
total input/output num=8
Throughput (by event API) gmioOut[0] bandwidth=1044.15M Bytes/s
total input/output num=16
Throughput (by event API) gmioOut[0] bandwidth=972.539M Bytes/s
total input/output num=32
Throughput (by event API) gmioOut[0] bandwidth=989.08M Bytes/s
AIE GMIO PASSED!
GMIO::free completed
PASS!

It is seen that when the used GMIO ports number increases, the performance for the specific GMIO port drops, indicating that the total system throughput is limited by NOC and DDR bandwidth.