HLS - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2024-06-19
Version
2024.1 English

Table of Contents

Building the Design

Hardware Design Details

Software Design Details

Performance Details

Building the Design

Design Build

Design Build

In this section, you learn to build and run the Matrix Multiplication design using the DSP58 Engines in Versal device. You will compile the design and integrate it into a larger system design (including the PS host application).

The Makefile used to build the design takes 2 user inputs from command line. These are - TARGET (hw/hw_emu) GEMM_SIZE (32, 64, 128, 256, 512 or 1024)

Based on these inputs, the design flow will generate a new directory (called build/). Underneath are subdirectories named gemm_GEMM_SIZExGEMM_SIZExGEMM_SIZE. For example if GEMM_SIZE is given as 64, a subdirectory named gemm_64x64x64 will be created under build directory. Underneath, hw_emu/ and/or hw/ subfolders will be created. These folders contain a host app executable and the builds targeted to hw or hw_emu respectively. The hw_emu/ subfolder contains the build for hardware emulation. The hw/ subfolder contains the build for a hardware run on a VCK190 board.

Make Steps

Make Steps

To run the following make steps (for example, make kernels, make xsa, make application, and make package), you must be in the gemm_dsp58/ folder. The following options can be specified in the make steps. Instructions for how to apply them are provided later in this section.

TARGET: This option can be set to hw or hw_emu to build the design in the hardware or hardware emulation flow. The default is hw_emu.

GEMM_SIZE: This option can be set to 32, 64, 128, 256, 512 or 1024

The Makefile uses the following directory references:

## Relative directory
RELATIVE_PROJECT_DIR := ./
PROJECT_REPO := $(shell readlink -f $(RELATIVE_PROJECT_DIR))
DESIGN_REPO  := $(PROJECT_REPO)/design
PL_SRC_REPO  := $(DESIGN_REPO)/pl_src
CONSTRAINTS_REPO  := $(PL_SRC_REPO)/constraints
HOST_APP_SRC := $(DESIGN_REPO)/host_app_src
SYSTEM_CONFIGS_REPO    := $(DESIGN_REPO)/system_configs
VIVADO_METRICS_SCRIPTS_REPO := $(DESIGN_REPO)/vivado_metrics_scripts

BASE_BLD_DIR := $(PROJECT_REPO)/build_$(PL_FREQ)
GEMM_BLD_DIR     := $(BASE_BLD_DIR)/gemm_$(MAT_DIMS)
BUILD_TARGET_DIR := $(GEMM_BLD_DIR)/$(TARGET)

VIVADO_REPORTS_REPO := $(PROJECT_REPO)/vivado_reports_dir
BLD_VIVADO_REPORTS_DIR := $(VIVADO_REPORTS_REPO)/gemm_$(MAT_DIMS)

EMBEDDED_PACKAGE_OUT := $(BUILD_TARGET_DIR)/package
EMBEDDED_EXEC_SCRIPT := run_script.sh
Build the Entire Design with a Single Command

Build the Entire Design with a Single Command

If you are already familiar with Vitis kernel compilation flows, you can build the entire design with one command:

make run (default TARGET=hw_emu, GEMM_SIZE=64) 

or

make run TARGET=hw (Target is hardware, GEMM_SIZE=64)

This command runs the make kernels, make xsa, make application, make package, and make run_emu steps for hardware emulation or to run on hardware (VCK190 board) depending on the TARGET you specify. The settings also apply to individual make steps listed below.

The generated files are placed under an individual directory: $(BUILD_TARGET_DIR)/. Each make step to build the design is specified in the following sections. These sections also detail the options used and the location of input and output files in each case.

See this page for a detailed description of all Vitis compiler switches. The following table provides a summary of the switches used.

Switch Description
--target | -t [hw|hw_emu] Specifies the build target.
--platform | -f Specifies the name of a supported acceleration platform as specified by the $PLATFORM_REPO_PATHS environment variable or the full path to the platform XPFM file.
--save-temps | -s Directs the Vitis compiler command to save intermediate files/directories created during the compilation and link process. Use the --temp_dir option to specify a location to write the intermediate files to.
--temp_dir This allows you to manage the location where the tool writes temporary files created during the build process. The temporary results are written by the Vitis compiler, and then removed, unless the --save-temps option is also specified.
--verbose Display verbose/debug information.
--compile | -c Required for compilation to generate XO files from kernel source files.
--kernel \<arg>|-k \<arg> Compile only the specified kernel from the input file. Only one -k option is allowed per Vitis compiler command.
-D | --define \<Macro Name>=\<value> Defines Macros for the compiler.
--output | -o Specifies the name of the output file generated by the V++ command. The kernel output should be XO.

Following RTL files are used in this design

${PL_SRC_REPO}/rtl/BDELAY.vhd
${PL_SRC_REPO}/rtl/FIXGEMM.vhd
${PL_SRC_REPO}/rtl/SDELAY.vhd
${PL_SRC_REPO}/rtl/sfixed_pkg.vhd
${PL_SRC_REPO}/rtl/cfixed_pkg.vhd
${PL_SRC_REPO}/rtl/DSP_GW.vhd
${PL_SRC_REPO}/rtl/FIXGEMM_WRAPPER.vhd
${PL_SRC_REPO}/rtl/control_logic.sv
${PL_SRC_REPO}/rtl/gemm_top.sv
${PL_SRC_REPO}/rtl/ps_slave.sv
${PL_SRC_REPO}/rtl/DSP_data_controller.sv
${PL_SRC_REPO}/rtl/op_uram.sv
${PL_SRC_REPO}/rtl/row_uram.sv
${PL_SRC_REPO}/rtl/col_uram.sv
${PL_SRC_REPO}/rtl/gemm_large_ocm.sv
${PL_SRC_REPO}/rtl/partial_sum_bram.sv
${PL_SRC_REPO}/rtl/synchronizer.sv

$(CONSTRAINTS_REPO)/gemm_dsp58.tcl provides constraints for synthesis and implementation.

Following is the output xo file

$(PROJECT_REPO)/build/gemm_GEMM_SIZExGEMM_SIZExGEMM_SIZE/gemm_large_ocm.xo
make kernels: Generates the PL Kernels

make kernels: Generates the PL Kernels

This step uses the RTL and mem_init_files specified above to generate the PL kernel (gemm_large_ocm.xo)

make xsa: Using the Vitis Tools to Link PL Kernels with the Platform

make application: Compile the Host Application

You can compile the host application by following the typical cross-compilation flow for the Cortex A72 processor. To build the application, run the following command

make application 

or

cd $(BUILD_TARGET_DIR);	\

aarch64-xilinx-linux-g++ -mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc -fstack-protector-strong \
   -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=$(SDKTARGETSYSROOT) -O -c \
   -std=c++14 -D__linux__ \
   -DM_LARGE=$(GEMM_SIZE) -DN_LARGE=$(GEMM_SIZE) -DL_LARGE=$(GEMM_SIZE) \
   -I$(SDKTARGETSYSROOT)/usr/include/xrt -I$(SDKTARGETSYSROOT)/usr/include -I$(SDKTARGETSYSROOT)/usr/lib -I$(HOST_APP_SRC)/$(MAT_DIMS) \
$(HOST_APP_SRC)/main.cpp -o $(BUILD_TARGET_DIR)/gemm_top_app.o \
   -L$(SDKTARGETSYSROOT)/lib -lxrt_coreutil

aarch64-xilinx-linux-g++  -mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc -fstack-protector-strong \
   -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=$(SDKTARGETSYSROOT) \
   $(BUILD_TARGET_DIR)/gemm_top_app.o -L$(SDKTARGETSYSROOT)/usr/lib -lxrt_coreutil \
   -o $(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf

See this page for XRT documentation. See this page for details of host application programming.

Switch Description
-O | Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that can take a great deal of compilation time.
-D__linux__
-DXAIE_DEBUG Enable debug interface capabilities where certain core status, event status, or stack trace can be dumped out.
-D\<Pre-processor Macro String>=\<value> Pass Pre-processor Macro definitions to the cross-compiler.
-I \<dir> Add the directory dir to the list of directories to be searched for header files.
-o \<file> Place output in file <file>. This applies regardless of the output being produced, whether it be an executable file, an object file, an assembler file or preprocessed C code.
--sysroot=\<dir> Use dir as the logical root directory for headers and libraries. For example, if the compiler would normally search for headers in /usr/include and libraries in /usr/lib, it will instead search dir/usr/include and dir/usr/lib. This is automatically set by the env_setup.sh script
-l\<library> Search the library named library when linking. The 2D-FFT tutorial requires adf_api_xrt and xrt_coreutil libraries.
-L \<dir> Add directory <dir> to the list of directories to be searched for -l.

The following is a description of the input sources compiled by the cross-compiler compiler command.

Inputs Sources Description
$(HOST_APP_SRC)/main.cpp Source application file for the gemm_dsp_xrt.elf that will run on an A72 processor.
$(HOST_APP_SRC)/matrix_A_data.h, matrix_B_data.h Matrix A and B Data to be used for matrix multiplication.
$(HOST_APP_SRC)/output_data.h Golden data to which DUT output will be compared.

The following is a description of the output objects that results from executing the cross-compiler command with the above inputs and options.

Output Objects Description
$(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf The executable that will run on an A72 processor.
make package: Packaging the Design

make package: Packaging the Design

With the Kernel outputs created, as well as the new platform, you can now generate the programmable device image (PDI) and a package to be used on an SD card. The PDI contains all the executables, bitstreams, and configurations of the device. The packaged SD card directory contains everything to boot Linux, the generated applications, and the XCLBIN.

The command to run this step is as follows (default TARGET=hw_emu):

make package

or

cp $(PROJECT_REPO)/run_script.sh $(BUILD_TARGET_DIR)/
cd $(BUILD_TARGET_DIR);	\

v++ -p -t hw --save-temps --temp_dir $(BUILD_TARGET_DIR)/_x -f xilinx_vck190_base_202410_1 \
   --package.rootfs $(XLNX_VERSAL)/rootfs.ext4 --package.kernel_image $(XLNX_VERSAL)/Image --package.boot_mode=sd \
   --package.out_dir $(BUILD_TARGET_DIR)/package --package.image_format=ext4 --package.sd_file $(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf \
   $(BUILD_TARGET_DIR)/gemm.hw.xclbin

If the XRT_ROOT is set, the following Vitis compiler flags are also set:

   --package.sd_dir $(XRT_ROOT)

See this page for more details about packaging the system.

Switch Description
--target | -t [hw|hw_emu] Specifies the build target.
--package | -p Packages the final product at the end of the Vitis compile and link build process.
--package.rootfs \<arg> Where \<arg> specifies the absolute or relative path to a processed Linux root file system file. The platform RootFS file is available for download from xilinx.com. Refer to the Vitis Software Platform Installation for more information.
--package.kernel_image \<arg> Where \<arg> specifies the absolute or relative path to a Linux kernel image file. Overrides the existing image available in the platform. The platform image file is available for download from xilinx.com. Refer to the Vitis Software Platform Installation for more information.
--package.boot_mode \<arg> Where \<arg> specifies Boot mode used for running the application in emulation or on hardware.
--package.image_format Where \<arg> specifies \<ext4|fat32> output image file format. ext4 is the Linux file system and fat32 is the Windows file system.
--package.sd_file Where \<arg> specifies an ELF or other data file to package into the sd_card directory/image. This option can be used repeatedly to specify multiple files to add to the sd_card.
Inputs Sources Description
$(XRT_ROOT) The PS host application needs the XRT headers in this folder to execute. Set in the env_setup.sh.
$(XLNX_VERSAL)/rootfs.ext4 The root filesystem file for PetaLinux.
$(XLNX_VERSAL)/Image The pre-built PetaLinux image the processor boots from.
$(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf The PS host application executable created in the make application step.
$(BUILD_TARGET_DIR)/gemm.hw_emu.xclbin The XCLBIN file created in the make xclbin step.

The output of the V++ Package step is the package directory that contains the contents to run hardware emulation.

Output Objects Description
$(BUILD_TARGET_DIR)/package The hardware emulation package that contains the boot file, hardware emulation launch script, the PLM and PMC boot files, the PMC and QEMU command argument specification files, and the Vivado simulation folder.
make run_emu: Running Hardware Emulation

make run_emu: Running Hardware Emulation

After packaging, everything is set to run hardware emulation. To run emulation, use the following command (default TARGET=hw_emu):

make run_emu 

or

###########################################################################
Hardware Emulation Goto:
$(BUILD_TARGET_DIR)/package

and do:
./launch_hw_emu.sh or ./launch_hw_emu.sh -g (for waveform viewer)...

When hardware emulation is launched, you see the QEMU simulator load. Wait for the autoboot countdown to go to zero. After a few minutes, the root Linux prompt comes up:

root@versal-rootfs-common-2024.1:~#

After the root prompt comes up, run the following commands to run the design:

cd /mnt
export XILINX_XRT=/usr
./gemm_dsp_xrt.elf a.xclbin

The gemm_dsp_xrt.elf executes. After a few minutes, you should see the output with TEST PASSED on the console. When this is shown, run the following keyboard command to exit the QEMU instance:

#To exit QEMU Simulation
Press CtrlA, let go of the keyboard, and then press x 

To run with waveform, do the following:

cd $(BUILD_TARGET_DIR)/package
./launch_hw_emu.sh -g

The XSIM Waveform Viewer is launched. Drag and drop the signals into the viewer and click Play to start the emulation. Go back to the terminal and wait for the Linux prompt to show up. In the XSIM Waveform Viewer, you will see the signals you added to the waveform adjusting over the execution of the design. When this is done, hit the pause button and close the window to end the emulation.Data Integrity mismatch due to software issue in Hardware Emulation,Design works in Hardware run.

TARGET=hw: Running on Hardware

Running on Hardware

To run the design on hardware, rerun the following make steps with TARGET=hw and other applicable options (see the preceding make steps specified above).

make kernels TARGET=hw 
make xsa TARGET=hw 
make application TARGET=hw
make package TARGET=hw 

These commands create a $(BUILD_TARGET_DIR) folder with the kernels, xsa, and package for a hardware run.

Run the following step to set up the execution file, generated images, and base images ($(BUILD_TARGET_DIR)/package/sd_card and $(BUILD_TARGET_DIR)/package/sd_card.img).

make run_emu TARGET=hw 

These commands create a build/hw folder with the kernels, XCLBIN, and package for a hardware run. Follow steps 1-9 to run the gemm_dsp_xrt.elf executable on your VCK190 board.

Step 1. Ensure your board is powered off.

Step 2. Use an SD card writer (such as balenaEtcher) to flash the sd_card.img file to an SD card.

Step 3. Plug the flashed SD card into the top slot of the VCK190 board.

Step 4. Set the switch (SW1 Mode\[3:0\]=1110 = OFF OFF OFF ON).

Step 5. Connect your computer to the VCK190 board using the USB cable included with the board.

Step 6. Open a TeraTerm terminal and select the correct COM port. Set the port settings to the following:

Port: <COMMXX>
Speed: 115200
Data: 8 bit
Parity: none
Stop Bits: 1 bit
Flow control: none
Transmit delay: 0 msec/char 0 msec/line

Step 7. Power on the board.

Step 8. Wait until you see the root@versal-rootfs-common-2024_1 Linux command prompt. Press enter a few times to get past any xinit errors.

Step 9. Run the following commands in the TeraTerm terminal:

mount /dev/mmcblk0p1 /mnt
cd /mnt
export XILINX_XRT=/usr

./gemm_dsp_xrt.elf a.xclbin

Hardware Design Details

Matrix Multiplication using DSP58 Implementation Architecture

Matrix Multiplication using DSP58 Implementation Architecture

In this design, Matrix Multiplication is implemented using DSP58 Systolic array of size 32x32. i.e There are 32 DSP58 cascade chains, each chain having 32 DSP58s. Thus 32x32 matrix is the basic matrix multiplication size. Larger matrices are broken down into submatrices of size 32x32.

Basic 32x32 Multiplication is performed as follows -

Matrix A row data moves upwards along DSP A Port cascade chain. For first 32 clocks data is only shifted into the DSP chains. So after 32 clocks, Row 0 of Matrix A is populated in first DSP cascade chain, Row 1 is populated in next cascade chain and so on, as shown in the below diagram. show in the below diagram

Image of Matrix A data movement

Calculating First Row of Output Matrix

After Matrix A elements are shifted into cascade chain, last row of matrix B is driven clock-by-clock to the bottom most DSP of the first cascade chain, as shown in the below diagram

Image of Matrix B data movement

First Row of output matrix is calculated as follows -

Bottom most DSP calculates A[0,31] B[31,0] and sends the output to upper DSP via PCOUT cascade port. On 2nd clock upper DSP starts receiving B[30,0], B[30,1], … B[30,31] (i.e Row 30 of Matrix B). So,on 2nd clock, 2nd DSP calculates A[0,30] B [30,0] + PCOUT = A[0,30] B[30,0] + A[0,31] B[31,0], and sends it up to the 3rd DSP. 3rd DSP starts receiving Matrix B Column 29 on 3rd clock, computes 3rd MAC operation and send up to 4th DSP. Thus after 32nd clock, top DSP has generated Row 0 Column 0 element of the output matrix.

On 2nd clock, bottom DSP receives B[31,1] and it calculates A[0,31] * B[31,1] which is the beginning of the MAC operation for Row 0 Column 1 element of output matrix. Row 0, Column 1 calculations traverse upwards in a similar way, and on 33rd clock, top DSP generates Row 0 Column 1 element of the output matrix.

Similarly for next 30 clocks, (that is, clock 34 to 63) top DSP of first cascade chain generates other 30 elements of Row 0 of the output matrix

Other rows of output matrix are calculated as follows -

B[31,0], B[31,1], … B[31,31] elements, that is row 31 of Matrix B is shifted to next DSP chain every clock. Hence Start of driving Matrix A Rows to subsequent DSP chains is also started with one clock delay. So bottom DSP of 2nd cascade chain starts on 2nd clock and it computes A[1,31] * B[31,0], which is beginning of the MAC operation for Row 1 Column 0 element of output matrix. Thus 2nd cascade chain is 1 clock delayed wrt first cascade chain and it generates its 32 outputs from clock 33 to 64. These outputs are Row 1 of the output matrix. Each subsequenct cascade chain is one clock delayed wrt previous chain, and thus the last cascade chain generates Row 31 outputs on clock 63 to 94.

32x32 Matrix Multiplication Latency

For the first 32 clocks, Matrix A Row 0 is loaded into first cascade chain. Over next 32 clocks, First cascade chain calculates first row of output matrix, and for next 32 clocks, other rows of output matrix are generated. However after 64 clocks, first DSP cascade chain can receive first row data for next 32x32 matrix.

Larger matrices are broken down into smaller 32x32 matrices. For example, 1Kx1Kx1K Matrices are represented as follows, where each box is 32x32 matrix –

Image of GEMM DSP Implementation Submatrices

Output matrix is -

Image of GEMM DSP Implementation Output Matrix

Data Flow for larger matrices

Matrix A00 is first multiplied with Matrix B00, which is the basic 32x32 matrix multiplication. Over the first 96 clocks, each DSP chain produces 32 outputs, thus total 1K outputs are generated which are the partial sums for the final output. These partial sums are written to 64 partial sum BRAMs. After 64 clocks, first cascade chain is done with A00 B00 submatrix, and it then starts performing A00 B01 to calculate partial sums for the next column of the output matrix. Likewise over next 32 clocks, other DSP cascade chains will also complete A00 B00 matrix multiplication and move to A00 B01 submatrix multiplication. This way Matrix A00 is multiplied with Matrix B00, B01, B02 … B0,31.

This completes A00 submatrix multiplications. Next, we read A01 submatrix of Matrix A, and it gets multiplied with submatrices of Matrix B. The partial sums are added to the partial sums previous generated, and stored back. Thus we will keep moving along the first row of Matirx A and multiply that submatrix with submatrices of Matirx B. This will continue for 32 iterations, and in the 32nd iteration, data is written to Output BRAM instead of partial Sum BRAM. This completes computation of the first row of the output matrix.

Then we will move to the next row of Matrix A and all these steps are repeated. After 32 such iterations, 1Kx1Kx1K matrix multiplication will be completed.

Matrix Calculation Latency for large matrices

32x32 matrix calculation requires 96 clocks. However first cascade chain in the DSP58 array is done with its computation after 64 clocks, and it can start receiving data for next submatrix. Thus for 32 clocks, there is overlap of previous and new submatrix calculations. So the total number of clocks required for large matix multiplication is 64 * No. of Sbumatrices + 32.

In this design, DSP clock is operating at 750MHz (1.33ns).

The following figure shows block diagram of the design.

Image of GEMM DSP Implementation Architecture

PL Kernel Details