Versal GeMM Implementation Using Vitis Acceleration Library and DSP58 Tutorial
Version: Vitis 2024.1
Table of Contents
Introduction
Versalâ„¢ adaptive SoCs combine programmable logic (PL), processing system (PS), and AI Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. The hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. A host of tools, software, libraries, IP, middleware, and frameworks enable Versal adaptive SoCs to support all industry-standard design flows.
This tutorial performs two implementations of a system-level design: one with AI Engine, and the other with RTL using the DSP Engines. In each implementation, the tutorial takes you through the hardware emulation and hardware flow in the context of a complete Versal adaptive SoC system design.
A Makefile is provided for each implementation. It can be used to create the design for cint16
datatype, for matrix dimensions 32x32x32 (MAT A, B, and C Dimensions - 32x32), 64x64x64, 128x128x128, 256x256x256, 512x512x512, and 1024x1024x1024 and lastly for different targets (hw_emu and hw).
The design documentation demonstrates hardware and software design details including the methodology for each implementation, explaining the functional partitioning. The compilation, execution, and measurement steps as well as observations are given.
Objectives
Objectives
After completing the tutorial, you should be able to:
Develop a system-level GeMM design by identifying an algorithm and deploying it on AI Engines or PL and DSP Engines.
Build a complete system design by going through the following steps in the Vitis flow:
Create the AI Engine Adaptive Data Flow API (ADF) graph.
Compile the A72 host application and compiling PL kernels.
Use the Vitis compiler (V++) to link the AI Engine and HLS kernels with the platform.
Package the design.
Run the design through the hardware emulation and hardware flow in a mixed SystemC/RTL cycle-accurate/QEMU-based simulator.
Understand graph control APIs for AI Engine implementation and HLS APIs for controlling HLS/PL kernels.
Understand the methodological differences between a design created using AI Engines and a design created using PL and DSP Engines.
Understand metrics including utilization, performance/throughput, and power across various instances of FFT arrays of different dimensions.
Design Overview
Design Overview
AIE
In this design, the multiplication of 2 square matrices (MatA and MatB) is done using a 32-AIE core overlay. MatA is divided into 8 x 4 blocks and MatB into 4 x 8 blocks. MatA input is provided 1x4 block at a time, using 4 input streams, and MatB is provided using 32 input streams for each 4x8 blocks. Output Matrix MatC is divided into 8x8 blocks and is given out as 1x8block at a time using 8 output streams. 32 core overlay is chosen to keep the core overlay same across all Matrx Dimensions, 32x32x32-64x64x64 onwards to 1024x1024x1024 and keep the performance high.
DSP
In this design, Matrix multiplication is implemented using Systolic array of 1024 DSP58 Engines. There are 32 DSP58 cascade chains, each chain has 32 DSP58s. Matrix-Matrix multiplication is decomposed into Matrix-Vector multiplication. One Matrix B column vector is multiplied by each row of Matrix A. This is achieved by broadcasting Matrix B column vector to DSPs at the same position in each cascade chain, while all 1K elements of Matrix A are read and each element drives one Port A of DSP58. One cascade chain implements one column vector and one row vector multiplication. This operation completes in 32 clocks.
Thus 32x32 matrix is the basic matrix multiplication unit. Larger matrices are broken down into submatrices of size 32x32, and each 32x32 submatrix of Matrix A is multiplied with each submatrix of Matrix B. For larger matrix multiplication, partial sum needs to be stored, read back, added to the new value and stored back.
Directory Structure
Directory Structure
GeMM_AIEvsDSP
|__AIE......................contains AI Engine implementation
| |Makefile....................with recipes for each step of the design compilation
| |images......................contains images used for AI Engine Design documentation
| |description.json............required for internal regression
| |multi_params.json...........required for internal regression
| |build.......................created and contains subfolders from design build
| |design......................contains source and include files
| | |aie_src....................contains all the aie source files and aiesimulator input files
| | | |aiesim_data.................contains all the files for the aiesimulator input
| | |pl_src.....................contains all the data mover source files
| | |host_app_src...............contains host application source files
| | |system_configs.............contains all system configuration files
| | |profiling_configs..........contains xrt.ini file
| | |exec_files.................contains hw_emu launch script
| | |vivado_metrics_scripts.....contains script for reporting utilisation and power from vivado
|__DSP......................contains DSP implementation targeting DSP Engines
| |Makefile....................with recipes for each step of the design compilation
| |images......................contains images used for DSP Design documentation
| |description.json............required for XOAH
| |multi_params.json...........required for XOAH
| |build.......................created and contains subfolders from design build
| |design......................contains source and include files
| | |pl_src.....................contains all GeMM and data mover source files
| | |host_app_src...............contains host application source files
| | |system_configs.............contains all system configuration files
| | |profiling_configs..........contains xrt.ini file
| | |exec_files.................contains hw_emu launch script
| | |vivado_metrics_scripts.....contains script for reporting utilisation and power from vivado