Scalar Golden Reference - 2022.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2022-05-25
Version
2022.1 English

AI Engine contains a scalar processor that can be used to implement scalar math operations, non-linear functions, and so on. Sometimes it can be helpful to have a golden scalar reference version of code. But note that usually the scalar version of code takes much more time to run in simulation and hardware.

The following provides example code for the scalar version of 32 taps filter:
static cint16 eq_coef[32]={{1,2},{3,4},...};

//keep margin data between different executions of graph 
static cint16 delay_line[32];

__attribute__((noinline)) void fir_32tap_scalar(input_stream<cint16> * sig_in,
      output_stream<cint16> * sig_out){
  //For profiling only 
  unsigned cycle_num[2];
  aie::tile tile=aie::tile::current();
  cycle_num[0]=tile.cycles();//cycle counter of the AI Engine tile

  for(int i=0;i<SAMPLES;i++){
    cint64 sum={0,0};//larger data to mimic accumulator
    for(int j=0;j<32;j++){
      //auto integer promotion to prevent overflow
      sum.real+=delay_line[j].real*eq_coef[j].real-delay_line[j].imag*eq_coef[j].imag;
      sum.imag+=delay_line[j].real*eq_coef[j].imag+delay_line[j].imag*eq_coef[j].real;
    }
    sum=sum>>SHIFT;
    //produce one sample per loop iteration
    writeincr(sig_out,{(int16)sum.real,(int16)sum.imag});

    for(int j=0;j<32;j++){
      if(j==31){
        delay_line[j]=readincr(sig_in);
      }else{
        delay_line[j]=delay_line[j+1];
      }
    }
  }
  
  //For profiling only 
  cycle_num[1]=tile.cycles();//cycle counter of the AI Engine tile
  printf("start=%d,end=%d,total=%d\n",cycle_num[0],cycle_num[1],cycle_num[1]-cycle_num[0]);
}
void fir_32tap_scalar_init()
{
  //initialize data
  for (int i=0;i<32;i++){
    int tmp=get_ss(0);
    delay_line[i]=*(cint16*)&tmp;
  }
};
Note:
  • Function fir_32tap_scalar_init is used as initialization function for kernel, which will only be called once after graph.run().
  • Rounding and saturation modes are not supported in scalar processor. They can be implemented via standard C operations, like shift.
  • Tile counter is used for profiling the main loop of code.

From the profiling result, you can see that each sample takes around 3050 cycles.

For more information about graph construction and different kinds of profiling techniques, AI Engine Simulation-Based Performance Analysis and Performance Analysis of AI Engine Graph Application on Hardware in Versal ACAP AI Engine Programming Environment User Guide (UG1076).