This PL kernel implements in PL the matrix transpose operation required to feed the proper 9-point input samples to the DFT-9 on the second dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The input 7-pt transforms arrive over two streams as outlined above. Samples are written linearly into a ping-pong buffer and read back using a strided order (seven points apart) producing output 9-pt transforms alternately on each of two output streams. The design has a latency of 1008/8+2 cycles.