Introduction to Scalar and Vector Programming - 2024.1 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-06-06
Version
2024.1 English

This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization needed is covered in following sections.

The following example uses only the scalar engine. It demonstrates a for loop iterating through buffers of 512 int16 elements. Each loop iteration performs a single multiply of an int16 element a with an int16 element b storing the result in c and writing it to an output buffer.

Iterators are used to read from the input buffers and write to the output buffer. For details on the iterators, see Iterators.

void scalar_mul(input_buffer<int16>& __restrict data1, 
    input_buffer<int16>& __restrict data2, 
    output_buffer<int32>& __restrict out){ 
  // iterator to access input "data1"  
  auto inIter1=aie::begin(data1); 
  auto inIter2=aie::begin(data2);
  // iterator to access output "out"
  auto outIter=aie::begin(out); 

  for(int i=0;i<512;i++) { 
    // read data from buffer and point to next entry
    int16 a=*inIter1++; 
    int16 b=*inIter2++; 
    int32 c=a*b; 
    // write result to buffer and point to next entry
    *outIter++=c;    
    }
}

The following example is a vectorized version for the same kernel that is executed on the vector processor.

void vect_mul(  input_buffer<int16>& __restrict data1, 
     input_buffer<int16>& __restrict data2,
     output_buffer<int32>& __restrict out){ 
  // iterator to access a vector (a collection of elements) 
  // in the buffer "data1"
  auto inIter1=aie::begin_vector<16>(data1);
  // iterator to access a vector (a collection of elements) 
  // in the buffer "data2"
  auto inIter2=aie::begin_vector<16>(data2);
  // iterator to access a vector (a collection of elements) 
  // in the buffer "out"
  auto outIter=aie::begin_vector<16>(out);

  for(int i=0;i<512/16;i++)
  chess_prepare_for_pipelining { 
     //read 8 elements from the buffer and point to the next entry
     auto va=*inIter1++;
     auto vb=*inIter2++;

     //element-by-element multiplication, with results 
     // in an accumulator register
     auto vt=aie::mul(va,vb); 

     // move data from accumulator register to vector register 
     // with a shift of zero, and transfer to output buffer; 
     // increment the iterator to point to the next entry
     *outIter++=vt.to_vector<int32>(0);
    }
}

The iterators used in this vectorized version are vector iterators that read a aie::vector<int16,16> vector at a time. A for loop iterates through buffers of 512 int16 elements, sixteen samples at a time. Each loop iteration performs a multiply of sixteen int16 element va with sixteen int16 element vb, storing the result in vc and writing it to an output buffer. The output of aie::mul is an accumulator vector that is reduced by the to_vector function which returns a value of type aie::vector<int32,16> that is written to the output buffer. Details on the data types supported by the AI Engine-ML are covered in the following sections.

The __restrict keyword used on the input and output parameters of the functions allows for more aggressive compiler optimization by explicitly stating independence between data.

chess_prepare_for_pipelining is a compiler pragma that explicitly directs the kernel compiler to do pipelining for the loop. It might introduce code inlining with the directive, and thus affects program memory size.

The scalar version of this example function needs around 1040 cycles of execution time, while the vectorized and optimized version only needs around 84 cycles. There is more than a 10x speedup in terms of execution time for the vectorized version of the kernel.

The sections that follow describe in detail the various data types that can be used, registers available, and also the kinds of optimizations that can be achieved on the AI Engine-ML using concepts like software pipelining in loops and keywords like __restrict.