The AI Engine shuffle intrinsic function
selects data from a single input data buffer according to the start and offset
parameters. This allows for flexible permutations of the input vector values without
needing to rearrange the values. xbuff is the input data buffer,
with xstart indicating the starting position offset for each lane
in the xbuff data buffer and xoffset indicating
the position offset applied to the data buffer. The shuffle intrinsic function is
available in 8, 16, and 32 lane variants (shuffle8,
shuffle16, and shuffle32). The main permute
for data (xoffsets) is at 32-bit granularity and
xsquare allows a further 16-bit granularity mini permute after
main permute. Thus, the 8-bit and 16-bit vector intrinsic functions can have
additional square parameter- for more complex permutations.
For example, a shuffle16 intrinsic has the
following function prototype.
v16int32 shuffle16 ( v16int32 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi
)
The data permute performs in 32 bits granularity. When the data size is 32 bits or 64 bits, the start and offsets are relative to the full data width, 32 bits or 64 bits. The lane selection follows the regular lane selection scheme.
f: result [lane number] = (xstart + xbuff [lane number]) Mod input_samples
The following example shows how shuffle works on the v16int32 vector. xoffset
and xoffset_hi have 4 bits for each lane. This
example moves the even and odd elements of the buffer into lower and higher parts of
the buffer.
When data permute is on 16 bits data, the intrinsic function includes
another parameter, xsquare, allowing flexibility to
perform data selection in each 4 x 16 bits block of data. The xoffset comes in pairs. The first hex value is an
absolute 32 bits offset and picks up 2 x 16 bits values (index, index+1). The second
hex value is offset from first value + 1 (32 bits offset) and picks up 2 x 16 bits
values. For example, 0x00 selects index 0, 1, and
index 2, 3. 0x24 selects index 8, 9, and index 14,
15. Following is a shuffle example on the v32int16
vector.