The scalar unit floating-point hardware support includes square root,
inverse square root, inverse, absolute value, minimum, and maximum. It supports other
floating-point operations through emulation. The softfloat
library must be linked in for test benches and kernel code using
emulation. For math library functions, the single precision float version must be used
(for example, use expf()
instead of exp()
).
The AI Engine vector unit provides eight lanes of single-precision floating-point multiplication and accumulation. The unit reuses the vector register files and permute network of the fixed-point data path. In general, only one vector instruction per cycle can be performed in fixed-point or floating-point.
Floating-point MACs have a latency of two-cycles, thus, using two accumulators in a ping-pong manner helps performance by allowing the compiler to schedule a MAC on each clock cycle.
acc0 = fpmac( acc0, abuff, 1, 0x0, bbuff, 0, 0x76543210 );
acc1 = fpmac( acc1, abuff, 9, 0x0, bbuff, 0, 0x76543210 );
There are no divide scalar or vector intrinsic functions at this time. However, vector division can be implemented via an inverse and multiply as shown in the following example.
invpi = upd_elem(invpi, 0, inv(pi));
acc = fpmul(concat(acc, undef_v8float()), 0, 0x76543210, invpi, 0, 0);
A similar implementation can be done for the vectors sqrt
, invsqrt
, and sincos
.