The cyclic shift performs no computations but simply introduces memoryless permutations in each input M-vector. No buffering occurs between inputs. The block simply performs a “cyclic shift” of each input M-vector. The shift amount varies according to an eight-stage FSM in this design. This block does not fit well in the AI Engine array. Its stream routing is more restrictive than PL for permutations, and it requires no computation to justify AI Engine placement. This function is a natural fit for a “PL Data Mover” and you can implement it using Vitis HLS.