c16b x 16b | c16b x c16b | c32b x c16b | c32b x c32b |
---|---|---|---|
2x4x8 | 1x4x8 | 1x2x4 | 1x2x8 |
4x4x4 | 1x2x8 | ||
2x2x8 | |||
1x4x8 | |||
2x4x8 |
In the example developed in this tutorial the 3 matrices A, B and C are all 64x64 with 8-bit data:
$$A_{64x64}.B_{64x64} = C_{64x64}$$
The mode 4x16x8
will be used so that we need to decompose matrix A into 4x16
sub-matrices, matrix B into 16x8
sub-matrices in oder to compute C using 4x8
sub-results:
In order to use these matrix multiplication modes we need to have one submatrix stored in a register and the other matrix in another register. Unfortunately, when an AI Engine-ML reads memory, it reads 256 contiguous bits from the memory. Multiple reads would be necessary to read a sub-matrix of the right size. A solution is to re-arrange data so that sub-matrices are in contiguous memory addresses. The adf graph API provides a very handy way to do such data ordering manipulation.
Let’s first have a look to the chosen architecture for this matrix multiply small application:
Multiple A and B matrices are stored in DDR which are copied in a memory tile using ping-pong buffering. These matrices are then copied again to AI Engine-ML memory using also ping-pong buffering. The kernel operates on the 2 stored matrices to compute the output C matrix. This matrix is then copied to a memory tile and then DDR. Data reordering can be done either between DDR and memory tile, or between memory tile and AI Engine-ML memory. The latter choice has been done.
The goal of the reordering is to be able to have the sub-matrices needed by the block-based matrix multiplication in adjacent addresses. As we will compute the resulting matrix C block rows by block rows, the sub-blocks of matrix A will be stored row by row and the one of matrix B will be stored column by column. Computing the first row of C will require the user to read 8 times the first row of block of A and the full matrix B block column by block column.
In first place the block must be extracted using memory tile DMA and stored in the AI Engine-ML memory. The tiling has to occur when reading from the memory tile because it is currently impossible to provide a read or a write access pattern to the AI Engine-ML memory.
The first block, on the top-left of the picture is first extracted and stored row by row on the AI Engine-ML memory. The second block, starting with the column vector (8,72, 136, 200) is then also extracted from the memory tile and stored in the AI Engine-ML memory. Finally we obtain the following re-arrangement of the data: