The scalar unit of an AIE-ML / AIE-ML v2 device has no floating-point hardware support.
The AI Engine API supports
floating-point operations through emulation using bfloat16. Various precisions can be used for all operations which is set
by the flag --aie.float-accuracy safe|fast|low:
-
safe: Accuracy similar to FP32. -
fast: (default) Improved performance with slightly less accuracy than FP32. -
low: Best performances with better accuracy than FP16 andbfloat16.
float f=2.0f;
bfloat16 bf=(bfloat16)f;
float f2=(float)bf;
int32 ia=aie::to_fixed(f);
int16 ib=aie::to_fixed(bf);
float f3=aie::to_float(ib);
bfloat16 bf2=aie::to_float(ib);
float sqrt_f=aie::sqrt(f);
bfloat16 inv_bf=aie::inv(bf);
bfloat16 invsqrt_bf=aie::invsqrt(bf);
The AI Engine
API also supports conversion from complex integer to complex floating-point with aie::to_float:
cint16 c16;
cint32 c32;
cfloat cf=aie::to_float(c16,0);//shift=0
cfloat cf2=aie::to_float(c32,2);//shift=2
aie::vector<int16,16> vc16;
aie::vector<int32,16> vc32;
auto vcf=aie::to_float(vc16,0);//shift=0
auto vcf2=aie::to_float(vc32,2);//shift=2
The AI Engine
API supports conversion between floating-point accumulator registers and bfloat16 vector registers. It also supports floating-point
and bfloat16 to integer conversion:
aie::accum<accfloat,16> vf;
aie::vector<bfloat16,16> vbf=vf.to_vector<bfloat16>();
aie::accum<accfloat,16> vf2;
vf2.from_vector(vbf,0);
aie::vector<float,16> vf3;
aie::vector<int32,16> vi=aie::to_fixed<int32>(vf3,0);
auto vi2=aie::to_fixed<int32>(vbf,0);
The AI Engine
API supports multiple lanes of bfloat16 multiplication
and accumulation. It supports single-precision floating-point multiplication and
accumulation through emulation. The unit reuses the vector register files and permute
network of the fixed-point data path. In general, only one vector instruction per cycle
can be performed in fixed-point or floating-point.
Floating-point MACs have a latency of two-cycles, thus, using two accumulators in a ping-pong manner helps performance.
auto ita=aie::begin_vector<16>(data1);
auto itb=aie::begin_vector<16>(data2);
auto ito=aie::begin(out);
aie::accum<accfloat,16> acc1=aie::zeros<accfloat,16>();
aie::accum<accfloat,16> acc2=aie::zeros<accfloat,16>();
aie::vector<bfloat16,16> va,vb;
for(int i=0;i<32;i++) chess_prepare_for_pipelining {
va=*ita++;
vb=*itb++;
acc1=aie::mac(acc1,va,vb);
va=*ita++;
vb=*itb++;
acc2=aie::mac(acc2,va,vb);
}
auto acc=aie::add(acc1,acc2);
auto sum=aie::reduce_add(acc.to_vector<float>(0));
*ito=(bfloat16)sum;