The cyclic shift performs no computations but simply introduces memoryless permutations in each input M-vector. No buffering occurs between inputs. The block simply performs a “cyclic shift” of each input M-vector. The shift amount varies according to an eight-stage FSM in this design. This block fits poorly to the AI Engine array as its stream routing is more restrictive than PL for introducing permutations, and there is no compute require to warrant it. This function is a natural fit for a “PL Data Mover” and can be implemented easily using Vitis HLS.