Introduction to Scalar and Vector Programming - 2025.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2025-11-26
Version
2025.2 English

This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The following sections include details about each element and optimization skills.

The following example uses only the scalar engine. It demonstrates a for loop iterating through 512 int32 elements. Each loop iteration performs a single multiply of int32 a and int32 b storing the result in c and writing it to an output buffer. The scalar_mul kernel processes two input blocks (buffer) of data input_buffer<int32> and produces an output buffer of data output_buffer<int32>.

You can access buffers through scalar and vector iterators. For additional details on the buffer APIs, see Streaming Data API.

void scalar_mul(input_buffer<int32> & data1,
			input_buffer<int32> & data2,
			output_buffer<int32> & out){
        auto pin1 = aie::begin(data1);
        auto pin2 = aie::begin(data2);
        auto pout = aie::begin(out);
	for(int i=0;i<512;i++)
	{
		int32 a=*pin1++;
		int32 b=*pin2++;
		int32 c=a*b;
		*pout++ = c;
	}
}

The following example is a vectorized version for the same kernel.

void vect_mul(input_buffer<int32> & __restrict data1,
			input_buffer<int32> & __restrict data2,
			output_buffer<int32> & __restrict out){
        auto pin1 = aie::begin_vector<8>(data1);
        auto pin2 = aie::begin_vector<8>(data2);
        auto pout = aie::begin_vector<8>(out);
	for(int i=0;i<64;i++)
	chess_prepare_for_pipelining
	{
		aie::vector<int32,8> va=*pin1++;
		aie::vector<int32,8> vb=*pin2++;
		aie::accum<acc80,8> vt=mul(va,vb);
		aie::vector<int32,8> vc=srs(vt,0);
		*pout++ = vc;
	}
}

Note the data types vector<int32,8> and accum<acc80,8> are used in the previous kernel code. The buffer API begin_vector<8> returns an iterator that will iterate over vectors of 8 int32s and stores them in variables named va and vb. These two variables are vector type variables and they are passed to the intrinsic function mul which outputs vt which is a accum<acc80,8> data type. The accum<acc80,8> type is reduced by a shift round saturate function srs that allows a vector<int32,8> type, variable vc, to be returned and then written to the output buffer. You can find additional details about supported AI Engine data types in the following sections.

The __restrict keyword used on the input and output parameters of the vect_mul function, allows for more aggressive compiler optimization by explicitly stating independence between data.

chess_prepare_for_pipelining is a compiler pragma that directs kernel compiler to achieve optimized pipeline for the loop.

The scalar version of this example function takes 1055 cycles while the vectorized version takes only 99 cycles. As you can see there is more than 10 times speedup for vectorized version of the kernel. Vector processing itself gives 8x the throughput for int32 multiplication but has a higher latency and does not get 8x the throughput overall. However, with the loop optimizations done, it can get close to 10x. The following sections describe in detail the various data types and available registers. The sections also explain AI Engine optimizations using concepts such as software pipelining in loops and keywords like __restrict.