This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization skills will be seen in following sections.
The following example uses only the scalar engine. It demonstrates a for loop
iterating through 512 int32 elements. Each loop iteration performs a single multiply
of int32 a and int32 b storing the result in c and writing it to an output window.
The scalar_mul kernel operates on two input blocks (window) of data input_window_int32
and produces an output window of
data output_window_int32
.
The APIs window_readincr
and window_writeincr
are used to read and write to the
circular buffers outside the kernel. For additional details on the window APIs, see
Window and Streaming Data
API in the AI Engine Tools and
Flows User Guide (UG1076).
void scalar_mul(input_window_int32* data1,
input_window_int32* data2,
output_window_int32* out){
for(int i=0;i<512;i++)
{
int32 a=window_readincr(data1);
int32 b=window_readincr(data2);
int32 c=a*b;
window_writeincr(out,c);
}
}
The following example is a vectorized version for the same kernel.
void vect_mul(input_window_int32* __restrict data1,
input_window_int32* __restrict data2,
output_window_int32* __restrict out){
for(int i=0;i<64;i++)
chess_prepare_for_pipelining
{
v8int32 va=window_readincr_v8(data1);
v8int32 vb=window_readincr_v8(data2);
v8acc80 vt=mul(va,vb);
v8int32 vc=srs(vt,0);
window_writeincr(out,vc);
}
}
Note the data types v8int32 and v8acc80 used in the previous kernel code. The
window API window_readincr_v8
returns a vector of
8 int32s and stores them in variables named va
and
vb
. These two variables are vector type
variables and they are passed to the intrinsic function mul
which outputs vt
which is a
v8acc80 data type. The v8acc80 type is reduced by a shift
round saturate function srs
that
allows a v8int32 type, variable vc, to be
returned and then written to the output window. Additional details on the data types
supported by the AI Engine are covered in the
following sections.
The __restrict
keyword used on the input
and output parameters of the vect_mul
function,
allows for more aggressive compiler optimization by explicitly stating independence
between data.
chess_prepare_for_pipelining
is a compiler
pragma that directs kernel compiler to achieve optimized pipeline for the loop.
The scalar version of this example function takes 1055 cycles while the
vectorized version takes only 99 cycles. As you can see there is more than 10 times
speedup for vectorized version of the kernel. Vector processing itself would give 8x
the throughput for int32 multiplication but has a higher latency and would not get
8x the throughput overall. However, with the loop optimizations done, it can get
close to 10x. The sections that follow describe in detail the various data types
that can be used, registers available, and also the kinds of optimizations that can
be achieved on the AI Engine using concepts like
software pipelining in loops and keywords like __restrict
.