This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization needed is covered in following sections.
The following example uses only the scalar engine. It demonstrates a for loop iterating through buffers of 512 int16 elements. Each loop iteration performs a single
multiply of an int16 element a with an int16
element b storing the result in c and writing it to an output buffer.
Iterators are used to read from the input buffers and write to the output buffer. For details on the iterators, see Iterators.
void scalar_mul(input_buffer<int16>& __restrict data1,
input_buffer<int16>& __restrict data2,
output_buffer<int32>& __restrict out){
// iterator to access input "data1"
auto inIter1=aie::begin(data1);
auto inIter2=aie::begin(data2);
// iterator to access output "out"
auto outIter=aie::begin(out);
for(int i=0;i<512;i++) {
// read data from buffer and point to next entry
int16 a=*inIter1++;
int16 b=*inIter2++;
int32 c=a*b;
// write result to buffer and point to next entry
*outIter++=c;
}
}
The following example is a vectorized version for the same kernel that is executed on the vector processor.
void vect_mul( input_buffer<int16>& __restrict data1,
input_buffer<int16>& __restrict data2,
output_buffer<int32>& __restrict out){
// iterator to access a vector (a collection of elements)
// in the buffer "data1"
auto inIter1=aie::begin_vector<16>(data1);
// iterator to access a vector (a collection of elements)
// in the buffer "data2"
auto inIter2=aie::begin_vector<16>(data2);
// iterator to access a vector (a collection of elements)
// in the buffer "out"
auto outIter=aie::begin_vector<16>(out);
for(int i=0;i<512/16;i++)
chess_prepare_for_pipelining {
//read 8 elements from the buffer and point to the next entry
auto va=*inIter1++;
auto vb=*inIter2++;
//element-by-element multiplication, with results
// in an accumulator register
auto vt=aie::mul(va,vb);
// move data from accumulator register to vector register
// with a shift of zero, and transfer to output buffer;
// increment the iterator to point to the next entry
*outIter++=vt.to_vector<int32>(0);
}
}
The input buffer is read using special vector iterators that produce aie::vector<int16,16> values. A for loop iterates through buffers of 512 int16 elements, sixteen samples at a time. va and vb are two
vectors of sixteen int16 elements, loaded using
the vector iterators. Each loop iteration performs the multiplication of va by vb element by
element, storing the result in vc and writing it
to an output buffer. The output of aie::mul is an
accumulator vector that is reduced by the to_vector function which returns a value of type aie::vector<int32,16> that is written to the
output buffer. Details on the data types supported by the AIE-ML
and AIE-ML v2 architectures are covered in the following
sections.
The __restrict keyword used on the
input and output parameters of the functions allows for more aggressive compiler
optimization by explicitly stating independence between data.
chess_prepare_for_pipelining is a compiler
pragma that explicitly directs the kernel compiler to do pipelining for the loop. It
might introduce code inlining with the directive, and thus affects program memory
size.
The scalar version of this example function needs around 1040 cycles of execution time, while the vectorized and optimized version only needs around 84 cycles. There is more than a 10x speedup in terms of execution time for the vectorized version of the kernel.
The sections that follow describe in detail the various data types that can
be used, registers available, and also the kinds of optimizations that can be
achieved on the AI Engine using concepts like software pipelining
in loops and keywords such as __restrict.