The accumulation registers are 384 bits wide and can be viewed as eight vector lanes of 48 bits each. The idea is to have 32-bit multiplication results and accumulate over those results without bit overflows. The 16 guard bits allow up to 216 accumulations. The output of fixed-point vector MAC and MUL intrinsic functions is stored in the accumulator registers. The following table shows the set of accumulator registers and how smaller registers are combined to form large registers.
384-bit | 768-bit |
---|---|
aml0 | bm0 |
amh0 | |
aml1 | bm1 |
amh1 | |
aml2 | bm2 |
amh2 | |
aml3 | bm3 |
amh3 |
The accumulator registers are prefixed with the letters 'am'. Two of them are aliased to form a 768-bit register that is prefixed with 'bm'.
The shift-round-saturate srs()
intrinsic is used to move a value from an accumulator register to a vector register
with any required shifting and rounding.
v8int32 res = srs(acc, 8); // shift right 8 bits, from accumulator register to vector register
The upshift ups()
intrinsic is used
to move a value from an vector register to an accumulator register with
upshifting:
v8acc48 acc = ups(v, 8); //shift left 8 bits, from vector register to accumulator register
The set_rnd()
and set_sat()
instrinsics are used to set the rounding and
saturation mode of the accumulation result, while clr_rnd()
and clr_sat()
intrinsics
are used to clear the rounding and saturation mode, that is to truncate the
accumulation result.
Note that only when operations are going through the
shift-round-saturate data path, the shifting, rounding, or saturation mode will be
effective. Some intrinsics only use the vector pre-adder operations, where there
will be no shifting, rounding, or saturation mode for configuration. Such operations
are adds, subs, abs, vector compares, or vector selections/shuffles. It is possible
to choose MAC intrinsics instead to do subtraction with shifting, rounding, or
saturation mode configuration. The following code performs subtraction between
va
and vb
with
mul
instead of sub
intrinsics.
v16cint16 va, vb;
int32 zbuff[8]={1,1,1,1,1,1,1,1};
v8int32 coeff=*(v8int32*)zbuff;
v8acc48 acc = mul8_antisym(va, 0, 0x76543210, vb, 0, false, coeff, 0 , 0x76543210);
v8int32 res = srs(acc,0);
Floating-point intrinsic functions do not have separate accumulation registers and instead return their results in a vector register. The following streaming data APIs can be used to read and write floating-point accumulator data from or to the cascade stream.
v8float readincr_v8(input_stream_accfloat * str);
v4cfloat readincr_v4(input_stream_caccfloat * str);
void writeincr_v8(output_stream_accfloat* str, v8float value);
void writeincr_v4(output_stream_caccfloat* str, v4cfloat value);
For more information about the window and streaming data APIs, refer to Window and Streaming Data API in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416).
The data size in memory is aligned to the next power of 2 (here 64b
for acc48
), hence it is best to use sizeof
to determine the position on the elements. The
following code is an example to print accumulator vector registers.
v8acc48 vacc; //cascade value
const int SIZE_ACC48=sizeof(v8acc48)/8;
for(int i=0;i<8;i++){//8 number
int8 *p=(int8*)&vacc+SIZE_ACC48*i;//point to start of each acc48
printf("acc value[%d]=0x",i);
for(int j=5;j>=0;j--){//print each acc48 from higher byte to lower byte
printf("%02x",*(p+j));
}
printf("\n");
}
The output is as follows.
acc value[0]=0x000000000000
acc value[1]=0x000000000001
acc value[2]=0x000000000002
acc value[3]=0x000000000003
acc value[4]=0x000000000004
acc value[5]=0x000000000005
acc value[6]=0x000000000006
acc value[7]=0x000000000007