You can implement this corner-turning data flow directly using the local DMA hardware in each AI Engine tile. Program the input stream dma to write the local input buffer row-wise using a tiling parameter as shown in the following sections. The AI Engine then reads this input buffer column-wise when performing channel-by-channel computation, storing results column-wise in its output buffer.
Program the output stream DMA to read the output buffer row-wise, restoring the TDM nature of the output stream. Corner-turning at both input and output buffers of the mixer uses no core compute resources because local tile DMA hardware computes addressing. This makes it part of the natural data flow.