Tiling parameters can be used to handle a matrix transpose. A matrix transpose creates a new matrix wherein the rows correspond to the columns of the original matrix.
Since the shared buffer DMAs can only generate addresses for 32-bit
aligned data, transpose results will be exact for 32-bit (int32,
uint32, cint32, cint16, float, cfloat
) or higher data widths.
The transpose result is only partial and will need further steps for 16-bit, 8-bit and 4-bit data.
For 16-bit data, matrix elements have to be taken 2x1 when performing the transpose. For 8-bit data, the elements must be taken 4x1. For data widths lower than 32 bits, matrix transpose can be a two step process, with the first partial transpose achieved using the tiling parameters on the memory tile data, and the next step of the final transpose of the partial results in the kernel.
class Transpose : public adf::graph
{
public:
adf::kernel k32,k16,k8;
adf::shared_buffer<int32> mtx32;
adf::shared_buffer<int16> mtx16;
adf::shared_buffer<int8> mtx8;
adf::input_plio plin32,plin16,plin8;
adf::output_plio plout32,plout16,plout8;
Transpose()
{
// Kernel Creation
k32 = adf::kernel::create(Passthrough<int32,Nrows,Ncolumns>);
k16 = adf::kernel::create(Passthrough<int16,Nrows,Ncolumns>);
k8 = adf::kernel::create(Passthrough<int8,Nrows,Ncolumns>);
adf::source(k32) = "Passthrough.cpp";
adf::runtime<ratio>(k32) = 0.9;
adf::source(k16) = "Passthrough.cpp";
adf::runtime<ratio>(k16) = 0.9;
adf::source(k8) = "Passthrough.cpp";
adf::runtime<ratio>(k8) = 0.9;
// Shared Buffers
mtx32 = adf::shared_buffer<int32>::create({Ncolumns,Nrows},1,1);
write_access(mtx32.in[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},
.tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
read_access(mtx32.out[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},
.tiling_dimension={1,1},.offset={0,0},
.tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
{.dimension=0, .stride=1, .wrap=Ncolumns}}});
mtx16 = adf::shared_buffer<int16>::create({Ncolumns,Nrows},1,1);
write_access(mtx16.in[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},.tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
read_access(mtx16.out[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},
.tiling_dimension={2,1},.offset={0,0},
.tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
{.dimension=0, .stride=2, .wrap=Ncolumns/2}}});
mtx8 = adf::shared_buffer<int8 >::create({Ncolumns,Nrows},1,1);
write_access(mtx8.in[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},
.tiling_dimension={Ncolumns,Nrows},.offset={0,0}});
read_access(mtx8.out[0]) =
adf::tiling({.buffer_dimension={Ncolumns,Nrows},
.tiling_dimension={4,1},.offset={0,0},
.tile_traversal = {{.dimension=1,.stride=1,.wrap=Nrows},
{.dimension=0, .stride=4, .wrap=Ncolumns/4}}});
// PLIO Creation
plin32 = adf::input_plio::create("input64_32", adf::plio_64_bits,
"data/Input_32.csv", 625);
plout32 = adf::output_plio::create("output64_32", adf::plio_64_bits,
"data/Output_32.csv", 625);
plin16 = adf::input_plio::create("input64_16", adf::plio_64_bits,
"data/Input_16.csv", 625);
plout16 = adf::output_plio::create("output64_16", adf::plio_64_bits,
"data/Output_16.csv", 625);
plin8 = adf::input_plio::create("input64_8", adf::plio_64_bits,
"data/Input_8.csv", 625);
plout8 = adf::output_plio::create("output64_8", adf::plio_64_bits,
"data/Output_8.csv", 625);
// Connections
adf::connect (plin32.out[0],mtx32.in[0]);
adf::connect(mtx32.out[0],k32.in[0]);
adf::connect(k32.out[0],plout32.in[0]);
adf::connect (plin16.out[0],mtx16.in[0]);
adf::connect(mtx16.out[0],k16.in[0]);
adf::connect(k16.out[0],plout16.in[0]);
adf::connect (plin8.out[0],mtx8.in[0]);
adf::connect(mtx8.out[0],k8.in[0]);
adf::connect(k8.out[0],plout8.in[0]);
};
};
.tiling_dimension={2,1}
) so that the size of each tiling is
32 bits. If the size of the tiling is not a multiple of 32 bits, you will get an error
message generated by the AI Engine compiler:
Failed in converting tiling parameters to buffer descriptors for port T.mtx16.out[0]: In dimension 0, stepsize 8 and element size 16 (in bytes) cannot be adjusted to 32 bit word address. In dimension 1, stepsize 1 and element size 16 (in bytes) cannot be adjusted to 32 bit word address.
A step size of 1 with a 16-bit data width will generate 16-bit aligned addresses which is not supported.
In this example, all matrices get transposed in various ways depending on data bitwidth. The color coding helps understand how the data are shuffled around. The effect on the 32-bit data is clearly a transpose function. 16-bit and 8-bit data end up with a partial transpose where multiple data on each row (contiguous on the column dimension) are kept contiguous on the column dimension. This can be used if you have multiple matrices that are interleaved in the column dimension. For example, for 16-bit data you could have two matrices (or 4, 6, 8, …) where element (0,0) of the first matrix is followed by the exact same element of the second matrix, and so on for element (0,1).
If the bit width of the concatenation of each element coming from all matrices is a multiple of 32 bits, then they can all be transposed at once using tiling parameters.