This PL kernel implements in PL the matrix transpose operation required to feed the proper 16-point input samples to the DFT-16 on the third dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The 9-pt transforms arrive on alternate streams. Samples are written into a ping/pong buffer. Samples are read back in a transposed (stride-63) order. The kernel produces 4 samples alternately on two output streams. The design has a latency of 1008/8+1 cycles.