2D Matrix Transpose Using Tiling Parameters - 2024.2 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-11-28
Version
2024.2 English

Tiling parameters can be used to handle a matrix transpose. A matrix transpose creates a new matrix wherein the rows correspond to the columns of the original matrix.

Since the shared buffer DMAs can only generate addresses for 32-bit aligned data, transpose results will be exact for 32-bit (int32, uint32, cint32, cint16, float, cfloat) or higher data widths.

The transpose result is only partial and will need further steps for 16-bit, 8-bit and 4-bit data.

For 16-bit data, matrix elements have to be taken 2x1 when performing the transpose. For 8-bit data, the elements must be taken 4x1. For data widths lower than 32 bits, matrix transpose can be a two step process, with the first partial transpose achieved using the tiling parameters on the memory tile data, and the next step of the final transpose of the partial results in the kernel.

In the following code, the kernels are just passthrough functions that copy the input data into the output without any modification. The transpose is handled by the DMA through the read access tiling parameter setting:
class Transpose : public adf::graph
{
public:
    adf::kernel k32,k16,k8;

    adf::shared_buffer<int32> mtx32;
    adf::shared_buffer<int16> mtx16;
    adf::shared_buffer<int8> mtx8;

    adf::input_plio plin32,plin16,plin8;
    adf::output_plio plout32,plout16,plout8;

    Transpose()
    {

        // Kernel Creation
        k32 = adf::kernel::create(Passthrough<int32,Nrows,Ncolumns>);
        k16 = adf::kernel::create(Passthrough<int16,Nrows,Ncolumns>);
        k8 = adf::kernel::create(Passthrough<int8,Nrows,Ncolumns>);


        adf::source(k32) = "Passthrough.cpp";
        adf::runtime<ratio>(k32) = 0.9;
        adf::source(k16) = "Passthrough.cpp";
        adf::runtime<ratio>(k16) = 0.9;
        adf::source(k8) = "Passthrough.cpp";
        adf::runtime<ratio>(k8) = 0.9;

        // Shared Buffers
        mtx32 = adf::shared_buffer<int32>::create({Ncolumns,Nrows},1,1);
        write_access(mtx32.in[0]) =
            adf::tiling({.buffer_dimension={Ncolumns,Nrows},
                  .tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
        read_access(mtx32.out[0]) =
            adf::tiling({.buffer_dimension={Ncolumns,Nrows},
                  .tiling_dimension={1,1},.offset={0,0}, 
                  .tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
                        {.dimension=0, .stride=1, .wrap=Ncolumns}}});

        mtx16 = adf::shared_buffer<int16>::create({Ncolumns,Nrows},1,1);
        write_access(mtx16.in[0]) =
         adf::tiling({.buffer_dimension={Ncolumns,Nrows},.tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
        read_access(mtx16.out[0]) =
            adf::tiling({.buffer_dimension={Ncolumns,Nrows},
                  .tiling_dimension={2,1},.offset={0,0}, 
                  .tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
                        {.dimension=0, .stride=2, .wrap=Ncolumns/2}}});

        mtx8  = adf::shared_buffer<int8 >::create({Ncolumns,Nrows},1,1);
        write_access(mtx8.in[0]) =
            adf::tiling({.buffer_dimension={Ncolumns,Nrows},
                  .tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
        read_access(mtx8.out[0]) =
            adf::tiling({.buffer_dimension={Ncolumns,Nrows},
                  .tiling_dimension={4,1},.offset={0,0}, 
                  .tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
                        {.dimension=0, .stride=4, .wrap=Ncolumns/4}}});


        // PLIO Creation
        plin32 = adf::input_plio::create("input64_32", adf::plio_64_bits, 
                  "data/Input_32.csv", 625);
        plout32 = adf::output_plio::create("output64_32", adf::plio_64_bits, 
                  "data/Output_32.csv", 625);
        plin16 = adf::input_plio::create("input64_16", adf::plio_64_bits, 
                  "data/Input_16.csv", 625);
        plout16 = adf::output_plio::create("output64_16", adf::plio_64_bits, 
                  "data/Output_16.csv", 625);
        plin8 = adf::input_plio::create("input64_8", adf::plio_64_bits, 
                  "data/Input_8.csv", 625);
        plout8 = adf::output_plio::create("output64_8", adf::plio_64_bits, 
                  "data/Output_8.csv", 625);

        // Connections

        adf::connect (plin32.out[0],mtx32.in[0]);
        adf::connect(mtx32.out[0],k32.in[0]);
        adf::connect(k32.out[0],plout32.in[0]);

        adf::connect (plin16.out[0],mtx16.in[0]);
        adf::connect(mtx16.out[0],k16.in[0]);
        adf::connect(k16.out[0],plout16.in[0]);

        adf::connect (plin8.out[0],mtx8.in[0]);
        adf::connect(mtx8.out[0],k8.in[0]);
        adf::connect(k8.out[0],plout8.in[0]);
    };
};
It can be seen that for the 16-bit case the data are taken 2x1 (.tiling_dimension={2,1}) so that the size of each tiling is 32 bits. If the size of the tiling is not a multiple of 32 bits, you will get an error message generated by the AI Engine compiler:
Failed in converting tiling parameters to buffer descriptors for port T.mtx16.out[0]: In dimension 0, stepsize 8 and element size 16 (in bytes) cannot be adjusted to 32 bit word address. In dimension 1, stepsize 1 and element size 16 (in bytes) cannot be adjusted to 32 bit word address.

A step size of 1 with a 16-bit data width will generate 16-bit aligned addresses which is not supported.

The following figure provides a visualization of the effect of this transpose function with 32-bit, 16-bit and 8-bit data:
Figure 1. 32-bit, 16-bit and 8-bit Results For An 8x8 Matrix

In this example, all matrices get transposed in various ways depending on data bitwidth. The color coding helps understand how the data are shuffled around. The effect on the 32-bit data is clearly a transpose function. 16-bit and 8-bit data end up with a partial transpose where multiple data on each row (contiguous on the column dimension) are kept contiguous on the column dimension. This can be used if you have multiple matrices that are interleaved in the column dimension. For example, for 16-bit data you could have two matrices (or 4, 6, 8, …) where element (0,0) of the first matrix is followed by the exact same element of the second matrix, and so on for element (0,1).

If the bit width of the concatenation of each element coming from all matrices is a multiple of 32 bits, then they can all be transposed at once using tiling parameters.