Floating-Point Operations - 2024.1 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-06-06
Version
2024.1 English

The scalar unit of AI Engine-ML device has no floating-point hardware support. The AI Engine API supports floating-point operations through emulation using bfloat16. Various precisions can be used for all operations which is set by the flag --aie.float-accuracy safe|fast|low:

  • safe: Accuracy is slightly better than FP32.
  • fast: Improved performance with similar accuracy to FP32.
  • low: Best performances with better accuracy than FP16 and bfloat16.
float f=2.0f;
bfloat16 bf=(bfloat16)f;
float f2=(float)bf;

int32 ia=aie::to_fixed(f);
int16 ib=aie::to_fixed(bf);
float f3=aie::to_float(ib);
bfloat16 bf2=aie::to_float(ib);

float sqrt_f=aie::sqrt(f);
bfloat16 inv_bf=aie::inv(bf);
bfloat16 invsqrt_bf=aie::invsqrt(bf);

The AI Engine API also supports conversion from complex integer to complex floating-point with aie::to_float:

cint16 c16;
cint32 c32;
cfloat cf=aie::to_float(c16,0);//shift=0
cfloat cf2=aie::to_float(c32,2);//shift=2

aie::vector<int16,16> vc16;
aie::vector<int32,16> vc32;
auto vcf=aie::to_float(vc16,0);//shift=0
auto vcf2=aie::to_float(vc32,2);//shift=2

The AI Engine API supports conversion between floating-point accumulator registers and bfloat16 vector registers. It also supports floating-point and bfloat16 to integer conversion:

aie::accum<accfloat,16> vf;
aie::vector<bfloat16,16> vbf=vf.to_vector<bfloat16>();
aie::accum<accfloat,16> vf2;
vf2.from_vector(vbf,0);

aie::vector<float,16> vf3;
aie::vector<int32,16> vi=aie::to_fixed<int32>(vf3,0);
auto vi2=aie::to_fixed<int32>(vbf,0);
Note: The current active rounding mode is used during the integer conversion.

The AI Engine API supports multiple lanes of bfloat16 multiplication and accumulation. It supports single-precision floating-point multiplication and accumulation through emulation. The unit reuses the vector register files and permute network of the fixed-point data path. In general, only one vector instruction per cycle can be performed in fixed-point or floating-point.

Floating-point MACs have a latency of two-cycles, thus, using two accumulators in a ping-pong manner helps performance.

auto ita=aie::begin_vector<16>(data1);
auto itb=aie::begin_vector<16>(data2);
auto ito=aie::begin(out);

aie::accum<accfloat,16> acc1=aie::zeros<accfloat,16>(); 
aie::accum<accfloat,16> acc2=aie::zeros<accfloat,16>(); 
aie::vector<bfloat16,16> va,vb; 
for(int i=0;i<32;i++) chess_prepare_for_pipelining { 
  va=*ita++;
  vb=*itb++;
  acc1=aie::mac(acc1,va,vb); 
  va=*ita++;
  vb=*itb++;
  acc2=aie::mac(acc2,va,vb); 
} 
auto acc=aie::add(acc1,acc2);
auto sum=aie::reduce_add(acc.to_vector<float>(0));
*ito=(bfloat16)sum;