In Part 2a, we examined the generated assembler code and found a NOP (no operation) between the VFPMAC (vector floating-point multiply accumulate) mnemonics. This NOP is unavoidable as a floating-point accumulation requires two cycles (see Figure Pipeline Diagram of AI Engine Fixed-point Vector Unit Multiplication and Upshift Paths of AM009).
We can split the matrix-vector multiplication into two separate multiply accumulate operations to perform a floating-point accumulation on each cycle.
Note: Use the multiply accumulate API to scale each matrix column by the corresponding vector element, rather than multiplying each matrix row by the column vector.
Thus, splitting the vector additions into even and odd parts allow us to perform independent multiply accumulate operations:
Also, the AI Engine has two load units. The Julia program aie_iir_2b.jl splits the matrix into even and odd columns and generates two header files.
We start by using the AI Engine APIs.