This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization needed is covered in following sections.
The following example uses only the scalar engine. It demonstrates a for
loop iterating through buffers of 512 int16
elements. Each loop iteration performs a single
multiply of an int16
element a
with an int16
element b
storing the result in c
and writing it to an output buffer.
Iterators are used to read from the input buffers and write to the output buffer. For details on the iterators, see Iterators.
void scalar_mul(input_buffer<int16>& __restrict data1,
input_buffer<int16>& __restrict data2,
output_buffer<int32>& __restrict out){
// iterator to access input "data1"
auto inIter1=aie::begin(data1);
auto inIter2=aie::begin(data2);
// iterator to access output "out"
auto outIter=aie::begin(out);
for(int i=0;i<512;i++) {
// read data from buffer and point to next entry
int16 a=*inIter1++;
int16 b=*inIter2++;
int32 c=a*b;
// write result to buffer and point to next entry
*outIter++=c;
}
}
The following example is a vectorized version for the same kernel that is executed on the vector processor.
void vect_mul( input_buffer<int16>& __restrict data1,
input_buffer<int16>& __restrict data2,
output_buffer<int32>& __restrict out){
// iterator to access a vector (a collection of elements)
// in the buffer "data1"
auto inIter1=aie::begin_vector<16>(data1);
// iterator to access a vector (a collection of elements)
// in the buffer "data2"
auto inIter2=aie::begin_vector<16>(data2);
// iterator to access a vector (a collection of elements)
// in the buffer "out"
auto outIter=aie::begin_vector<16>(out);
for(int i=0;i<512/16;i++)
chess_prepare_for_pipelining {
//read 8 elements from the buffer and point to the next entry
auto va=*inIter1++;
auto vb=*inIter2++;
//element-by-element multiplication, with results
// in an accumulator register
auto vt=aie::mul(va,vb);
// move data from accumulator register to vector register
// with a shift of zero, and transfer to output buffer;
// increment the iterator to point to the next entry
*outIter++=vt.to_vector<int32>(0);
}
}
The iterators used in this vectorized version are vector iterators that read
a aie::vector<int16,16>
vector at a time. A
for
loop iterates through buffers of 512
int16
elements, sixteen samples at a time.
Each loop iteration performs a multiply of sixteen int16
element va
with sixteen
int16
element vb
, storing the result in vc
and
writing it to an output buffer. The output of aie::mul
is an accumulator vector that is reduced by the to_vector
function which returns a value of type
aie::vector<int32,16>
that is written to
the output buffer. Details on the data types supported by the AI Engine-ML are covered in the following
sections.
The __restrict
keyword used on the
input and output parameters of the functions allows for more aggressive compiler
optimization by explicitly stating independence between data.
chess_prepare_for_pipelining
is a compiler
pragma that explicitly directs the kernel compiler to do pipelining for the loop. It
might introduce code inlining with the directive, and thus affects program memory
size.
The scalar version of this example function needs around 1040 cycles of execution time, while the vectorized and optimized version only needs around 84 cycles. There is more than a 10x speedup in terms of execution time for the vectorized version of the kernel.
The sections that follow describe in detail the various data types
that can be used, registers available, and also the kinds of optimizations that can
be achieved on the AI Engine-ML using concepts
like software pipelining in loops and keywords like __restrict
.