This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization skills will be seen in following sections.
The following example uses only the scalar engine. It demonstrates a for loop
iterating through 512 int32 elements. Each loop iteration performs a single multiply
of int32 a and int32 b storing the result in c and writing it to an output buffer.
The scalar_mul kernel operates on two input blocks (buffer) of data input_buffer<int32>
and produces an output
buffer of data output_buffer<int32>
.
Buffers are accessed through scalar and vector iterators. For additional details on the buffer APIs, see Streaming Data API.
void scalar_mul(input_buffer<int32> & data1,
input_buffer<int32> & data2,
output_buffer<int32> & out){
auto pin1 = aie::begin(data1);
auto pin2 = aie::begin(data2);
auto pout = aie::begin(out);
for(int i=0;i<512;i++)
{
int32 a=*pin1++;
int32 b=*pin2++;
int32 c=a*b;
*pout++ = c;
}
}
The following example is a vectorized version for the same kernel.
void vect_mul(input_buffer<int32> & __restrict data1,
input_buffer<int32> & __restrict data2,
output_buffer<int32> & __restrict out){
auto pin1 = aie::begin_vector<8>(data1);
auto pin2 = aie::begin_vector<8>(data2);
auto pout = aie::begin_vector<8>(out);
for(int i=0;i<64;i++)
chess_prepare_for_pipelining
{
aie::vector<int32,8> va=*pin1++;
aie::vector<int32,8> vb=*pin2++;
aie::accum<acc80,8> vt=mul(va,vb);
aie::vector<int32,8> vc=srs(vt,0);
*pout++ = vc;
}
}
Note the data types vector<int32,8>
and
accum<acc80,8>
are used in the previous kernel code. The
buffer API begin_vector<8>
returns an
iterator that will iterate over vectors of 8 int32s and stores them in variables
named va
and vb
.
These two variables are vector type variables and they are passed to the intrinsic
function mul
which outputs vt
which is a accum<acc80,8>
data type. The accum<acc80,8>
type is reduced by a shift round saturate function srs
that allows a vector<int32,8>
type, variable vc, to be returned and then
written to the output buffer. Additional details on the data types supported by the
AI Engine are covered in the following
sections.
The __restrict
keyword used on the input
and output parameters of the vect_mul
function,
allows for more aggressive compiler optimization by explicitly stating independence
between data.
chess_prepare_for_pipelining
is a compiler
pragma that directs kernel compiler to achieve optimized pipeline for the loop.
The scalar version of this example function takes 1055 cycles while the
vectorized version takes only 99 cycles. As you can see there is more than 10 times
speedup for vectorized version of the kernel. Vector processing itself would give 8x
the throughput for int32 multiplication but has a higher latency and would not get
8x the throughput overall. However, with the loop optimizations done, it can get
close to 10x. The sections that follow describe in detail the various data types
that can be used, registers available, and also the kinds of optimizations that can
be achieved on the AI Engine using concepts like
software pipelining in loops and keywords like __restrict
.