To make efficient use of vector read and write operations, MUL/MAC operations, and scalar commands some considerations are needed. The AI Engine core has the capability of processing two 256-bit reads, one 256-bit write, one DSP MUL/MAC operation, and a scalar instruction in one clock cycle using Very Long Instruction Word. The core also has dedicated registers to manage the counter for the inner loop as a Zero Overhead Loop. For this to be scheduled properly, first consider the inner loop in terms of data flow. As this case is bound by the 256-bit writes we only need to refresh 256-bit vector of input data. To give room for the input data to be stored on the vector register, it will have to use a 256-bit vector ahead of the output 256-bit vector. This also avoids read and write access on the same address of the vector register.