TRANSPOSE-0 Kernel - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

This AIE kernel implements the matrix transpose operation required to feed the proper 9-point input samples to the DFT-9 on the second dimension of the 3D cube. For AIE-ML technology, this matrix transpose may be implemented completely within the array using the Memory tile eliminating the need to exit the array to perform the operation in the PL.

Buffer descriptions control the sample ordering employed by the Memory tile on both input and output. A “write BD” controls the sample ordering on input to the Memory tile. The write BD is configured using an ADF graph programming model shown below. The 3D pattern required here has dimensions \({7,9,16}\). This is configured with a buffer_dimension of \({8,16,16}\) since we must ensure alignment to the 32-bit boundaries of the Memory tile. The write address pattern is linear in dimensions 7, 9, and 16, and so the tile_traversal is configure in this order with address wrapping occuring at the dimensions \(0,1,2\).

tiling_parameters write_bd = {
      .buffer_dimension = {8,16,16},
      .tiling_dimension = {1,1,1},
      .offset = {0,0,0},
      .tile_traversal = {{.dimension=0, .stride=1, .wrap=7},
                         {.dimension=1, .stride=1, .wrap=9},
                         {.dimension=2, .stride=1, .wrap=16}} };

The read BD is configured using a similar ADF graph programming model shown below. In this case, the 3D patter required here has dimensions \({9,16,7}\) since we send data first along the 2nd $N_2=9$ dimension, then electing to process $N_3=16$ second and $N_1=7$ last. This is configured with a tile_traversal along dimensions \(1,2,0\) with wrapping applied as before.

Finally, a repetition_count of 8 is specified because both the write BD and the read BD must be repeated four times each to match the number of transforms computed per kernel invocation by each DFT-7, DFT-9 and DFT-16 AIE compute kernel. The num_buffers is set to 2 because a ping/pong buffer arrangement is required here to support a full streaming data flow model.

figure4