## AI Engine-ML matrix multiplication Instruction Set

The *AI Engine-ML* has specific hardware instructions for matrix multiplications. Depending on the bitwidth of the operands, various matrix sizes are supported. In the following table the notation `MxKxN`

means that matrix multiplication with a first operand of size M rows x K columns and a second operand of size K rows x N columns is supported.

**Matrix Multiplication modes for real types**

8b x 4b | 8b x 8b | 16b x 8b | 8b x 16b | 16b x 16b | 32b x 16b | 16b x 32b | 32b x 32b | bfloat16 x bfloat16 |
---|---|---|---|---|---|---|---|---|

4x16x8 | 4x8x4 | 4x4x4 | 4x4x8 | 4x4x4 | 2x4x8 | 2x4x8 | 4x2x4 | 4x8x4 |

8x16x8 | 4x16x4 | 8x4x4 | 4x4x4 | 2x4x8 | 4x4x4 | 4x4x4 | 4x2x4 | |

4x32x8 | 8x8x4 | 4x8x4 | 4x4x8 | 4x2x4 | 8x2x4 | |||

2x8x8 | 4x4x8 | 4x2x8 | ||||||

4x8x8 | ||||||||

2x16x8 | ||||||||

4x16x8 |

**Matrix Multiplication modes for complex types**

c16b x 16b | c16b x c16b | c32b x c16b | c32b x c32b |
---|---|---|---|

2x4x8 | 1x4x8 | 1x2x4 | 1x2x8 |

4x4x4 | 1x2x8 | ||

2x2x8 | |||

1x4x8 | |||

2x4x8 |

## IO or Compute bound?

One thing is to support a matrix multiply of some size, another is to verify that the 2 loads, the store and the compute are equally optimized.

A complete table of the matrix multiply efficiency, including matrices load and vector compute, can be seen here: ePerformance Table

### Example 1

For example letâ€™s take the first element of the table which is 8b x 4b with a matrix size of 4x16x8:

The sub matrix

**A**is of size 4x16 on 8 bits which is 512 bits: 2 clocks cycles are necessary to load itThe sub matrix

**B**is of size 16x8 on 4 bits which is 512 bits: 2 clocks cycles are necessary to load itThe sub matrix

**C**is of size 4x8 on 16 or 32 bits which is 512 or 1024 bits: 2 or 4 clocks cycles are necessary to store itFinally, 512 MACs must be performed for this matrix which can be done in 1 clock cycles.

The overall efficiency is 50% (result in 16 bits) or 25% (results in 32 bits): 2 or 4 clock cycles for load/store, 1 clock cycle for the compute.

### Tutorial Example

In this tutorial, the matrix sizes are the same but the input data type is `int8`

for both **A** and **B** matrices but the output data type can be either `int16`

or `int32`

.

The sub matrix

**A**is of size 4x16 on 8 bits which is 512 bits: 2 clocks cycles are necessary to load itThe sub matrix

**B**is of size 16x8 on 8 bits which is 1024 bits: 4 clocks cycles are necessary to load itThe sub matrix

**C**is of size 4x8 on 16 or 32 bits which is 512 or 1024 bits: 2 or 4 clocks cycles are necessary to store it, once every 4 sub-matrix multiplication-accumulation.Finally, 512 MACs must be performed for this matrix which can be done in 2 clock cycles (256 int8 x int8 multiplication-accumulations can be performed each cycle).

The overall maximum efficiency is 50%: The limitation comes from the load operation of the **B** sub-matrix.

A simple way to balance load/compute/store operations is to load 2 sub-matrices **A** and 1 sub-matrix **B** to perform 2 multiplication-accumulations for each **B**.

## Code analysis

In this new version of the kernel, we want to load 2 **A** sub-matrices while we load a single **B** sub-matrix. The 2 **A** sub-matrices must belong to the same tile column so that they have to be multiplied by the same **B** sub-matrix.

The simplest id to take 2 **A** tiles just one above the other, and multiply them by the same **B** sub-matrix. On the **C** side, the 2 tiles that will be computed will be also just one above the other.

In order to avoid too many pointer manipulations, the **A** tiles will be read 2 by 2 from Memory Tile so that they will be stored right next to each other in AI Engine ML Memory. **B** tiles will be read as in the previous basic solutions. Similarly to **A**, **C** tiles will be stored side by side in the AI Engine ML Memory. They will be reorganized when copying into the Memory Tile.

This way to do offloads the pointer manipulation to the DMA programming, freeing some scalar processor cycles.

The next 2 animated GIFs will show how **A** matrix is read from the Memory Tile and how **C** matrix is written to it. You can see that I chose to have **super tiles** consisting of 2 sub-matrices one above the other: