**Version: Vitis 2024.1**

## Introduction

Xilinx introduced the Versal™ AI Edge series, designed to enable AI innovation from the edge to the endpoint. This new series is mainly based on the AI Engine-ML that delivers 4X machine learning compute compared to previous AI Engine architecture and integrates new accelerator RAM with an enhanced memory hierarchy for evolving AI algorithms.

IMPORTANT: Before beginning the tutorial make sure you have installed the Vitis 2024.1 software. The Vitis release includes all the embedded base platforms including the VEK280 base platform that is used in this tutorial. In addition, do ensure you have downloaded the Common Images for Embedded Vitis Platforms from this link https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms/2023-2.html The ‘common image’ package contains a prebuilt Linux kernel and root file system that can be used with the Versal board for embedded design development using Vitis. Before starting this tutorial run the following steps:

Go to the directory where you have unzipped the Versal Common Image package

In a Bash shell run the

`/Common Images Dir/xilinx-versal-common-v2024.1/environment-setup-cortexa72-cortexa53-xilinx-linux`

script. This script sets up the`SDKTARGETSYSROOT`

and`CXX`

variables. If the script is not present, you must run the`/Common Images Dir/xilinx-versal-common-v2024.1/sdk.sh`

.Set up your

`ROOTFS`

, and`IMAGE`

to point to the`rootfs.ext4`

and`Image`

files located in the`/Common Images Dir/xilinx-versal-common-v2024.1`

directory.Set up your

`PLATFORM_REPO_PATHS`

environment variable to`$XILINX_VITIS/lin64/Vitis/2024.1/base_platforms`

This tutorial targets VEK280 board for 2024.1 version.

Data generation for this tutorial requires Python 3. The following packages are required:

math

sys

numpy

random

## Objectives

After completing this tutorial, you will be able to:

Understand the differences between AI Engine and AI Engine-ML architecture.

How to declare and use shared buffers (memory tiles).

How to declare and use external buffers (external memory).

How to program buffer descriptors using tiling parameters

This tutorial is based on matrix multiplication which is a usual algorithm in Machine Learning applications.

## Prerequisite knowledge

To follow this tutorial you need to understand the architecture of the *AI Engine-ML* as well as the art of buffer descriptor programming:

A short introduction to **AI Engine-ML** architecture is available here.

The various memory levels contains DMAs used to receive/transfer data to/from memory or Programmable Logic. These DMAs use Buffer Descriptors (BDs) that contains the parameters of these transfers. The best way to program these BDs is to use *Tiling Parameters* that are introduced here.

## Matrix Multiplication

Matrix multiplication is very common algorithm that can be found in numerous standard applications. The basic equation is:

```
$$ C = A.B $$
$$ \left( c_{ij} \right)_{\substack{0\leq i \lt M \\ 0 \leq j \lt N}} = \sum_{k=0}^{k<K} a_{ik}.b_{kj}$$
```

Natural storage for a matrix is column major: all columns of row 0 are stored csequentially in memory, then row 1 and so on up to last row o the matrix. In the following image, index in the boxes shows the increasing address:

## Taking advantage of *AI Engine-ML* architecture

The *AI Engine-ML* has specific hardware instructions for matrix multiplications. Depending on the bitwidth of the operands, various matrix sizes are supported. In the following table the notation `MxKxN`

means that matrix multiplication with a first operand of size M rows x K columns and a second operand of size K rows x N columns is supported.

### Matrix Multiplication modes for real types

8b x 4b | 8b x 8b | 16b x 8b | 8b x 16b | 16b x 16b | 32b x 16b | 16b x 32b | 32b x 32b | bfloat16 x bfloat16 |
---|---|---|---|---|---|---|---|---|

4x16x8 | 4x8x4 | 4x4x4 | 4x4x8 | 4x4x4 | 2x4x8 | 2x4x8 | 4x2x4 | 4x8x4 |

8x16x8 | 4x16x4 | 8x4x4 | 4x4x4 | 2x4x8 | 4x4x4 | 4x4x4 | 4x2x4 | |

4x32x8 | 8x8x4 | 4x8x4 | 4x4x8 | 4x2x4 | 8x2x4 | |||

2x8x8 | 4x4x8 | 4x2x8 | ||||||

4x8x8 | ||||||||

2x16x8 | ||||||||

4x16x8 |

### Matrix Multiplication modes for complex types

c16b x 16b | c16b x c16b | c32b x c16b | c32b x c32b |
---|---|---|---|

2x4x8 | 1x4x8 | 1x2x4 | 1x2x8 |

4x4x4 | 1x2x8 | ||

2x2x8 | |||

1x4x8 | |||

2x4x8 |

In the example developed in this tutorial the 3 matrices A, B and C are all 64x64 with 8-bit data:

```
$$A_{64x64}.B_{64x64} = C_{64x64}$$
```

The mode `4x16x8`

will be used so that we need to decompose matrix **A** into `4x16`

sub-matrices, matrix **B** into `16x8`

sub-matrices in oder to compute **C** using `4x8`

sub-results:

In order to use these matrix multiplication modes we need to have one submatrix stored in a register and the other matrix in another register. Unfortunately, when an AI Engine-ML reads memory, it reads 256 contiguous bits from the memory. Multiple reads would be necessary to read a sub-matrix of the right size. A solution is to re-arrange data so that sub-matrices are in contiguous memory addresses. The *adf* graph API provides a very handy way to do such data ordering manipulation.

Let’s first have a look to the chosen architecture for this matrix multiply small application:

Multiple **A** and **B** matrices are stored in DDR which are copied in a memory tile using ping-pong buffering. These matrices are then copied again to AI Engine-ML memory using also ping-pong buffering. The kernel operates on the 2 stored matrices to compute the output **C** matrix. This matrix is then copied to a memory tile and then DDR. Data reordering can be done either between DDR and memory tile, or between memory tile and AI Engine-ML memory. The latter choice has been done.

The goal of the reordering is to be able to have the sub-matrices needed by the block-based matrix multiplication in adjacent addresses. As we will compute the resulting matrix **C** block rows by block rows, the sub-blocks of matrix **A** will be stored row by row and the one of matrix **B** will be stored column by column. Computing the first row of **C** will require the user to read 8 times the first row of block of **A** and the full matrix **B** block column by block column.

In first place the block must be extracted using memory tile DMA and stored in the AI Engine-ML memory. The tiling has to occur when reading from the memory tile because it is currently impossible to provide a read or a write access pattern to the AI Engine-ML memory.

The first block, on the top-left of the picture is first extracted and stored row by row on the AI Engine-ML memory. The second block, starting with the column vector **(8,72, 136, 200)** is then also extracted from the memory tile and stored in the AI Engine-ML memory. Finally we obtain the following re-arrangement of the data:

## AI Engine-ML code analysis

This tutorial has been built to allow the user to easily change matrices and sub-matrices sizes. Matrix **A** being of size **(M,K)** and matrix **B** of size **(K,N)**, the resulting matrix **C** has size **(M,N)**. The `Makefile`

defines these default values to 64 (`sizeM, sizeK, sizeN`

). The size of the sub-matrices used by the AIE API is also defined (`subM, subK, subN`

). All these values can be overriden in the `make`

command line.

In this part we focus on a straightforward implementation of the matrix multiply which will be selected by the macro `OPTIMIZED_SOURCE = 0`

. The `make`

command will be invoked using `make OPT=0 ...`

which is actually the default.

```
# Default values for A, B, C matrix sizes
# A:MxK B:KxN C:MxN
sizeM ?= 64
sizeK ?= 64
sizeN ?= 64
# Default for A, B and C sub matrices
# 4x16x8
subM ?= 4
subK ?= 16
subN ?= 8
#Default Number of iterations
NIterations ?= 16
```

The `system_settings.h`

header file defines all the sizes that will be used internally by the kernel:

```
// Multiply 2 matrices (MxK) x (KxN)
#define A_ROWS sizeM
#define A_COLS sizeK
#define B_ROWS A_COLS
#define B_COLS sizeN
#define C_ROWS A_ROWS
#define C_COLS B_COLS
// Non Sparse Tiling: 4x16x8
#define ATILES_ROWS_NS subM
#define ATILES_COLS_NS subK
#define BTILES_ROWS_NS ATILES_COLS_NS
#define BTILES_COLS_NS subN
#define CTILES_ROWS_NS ATILES_ROWS_NS
#define CTILES_COLS_NS BTILES_COLS_NS
```

As explained in previous section, the matrices will be transferred from DDR to memory tile without any change, and then from memory tile to *AI Engine-ML* memory with a reordering of the data to make them easier to read from the kernel.

Even the write access pattern to the memory tile on the input side as well as read access pattern on the output side is just linear contiguous addressing, it needs to be specified in the graph. All these tiling parameters are defined in the file `tiling_parameters.h`

. Let’s have a look to these parameters for the input matrix **A**:

```
adf::tiling_parameters WriteAns_pattern = {
.buffer_dimension={A_COLS,A_ROWS},
.tiling_dimension={A_COLS,1},
.offset={0,0},
.tile_traversal={
{.dimension=1, .stride=1, .wrap=A_ROWS}
}
};
adf::tiling_parameters ReadAns_pattern = {
.buffer_dimension={A_COLS,A_ROWS},
.tiling_dimension={ATILES_COLS_NS,ATILES_ROWS_NS},
.offset={0,0},
.tile_traversal={
{.dimension=0, .stride=ATILES_COLS_NS, .wrap=A_COLS/ATILES_COLS_NS},
{.dimension=1, .stride=ATILES_ROWS_NS, .wrap=A_ROWS/ATILES_ROWS_NS}
}
};
```

The matrix is a 2D set of data dimension 0 being the number of columns, dimension 1 being the number of rows. When writing to the memory tile, data is stored column major in the memory. The read access of matrix **A** is completely different as we read the data block by block, each block being a sub-matrix of the matrix multiplication of the API, and we read the blocks column major from the memory (dimension 0 then dimension 1). For the matrix **B** it will be the same except that the block reading will be done row major (dimension 1 then dimension 0). **C** Matrix is written block by block, column major. The following animated GIF gives you the order the various **A, B** and **C** blocks are read and written to memory