Table of Contents
Building the Design
Design Build
Design Build
In this section, you learn to build and run the Matrix Multiplication design using the DSP58 Engines in Versal device. You will compile the design and integrate it into a larger system design (including the PS host application).
The Makefile used to build the design takes 2 user inputs from command line. These are - TARGET (hw/hw_emu) GEMM_SIZE (32, 64, 128, 256, 512 or 1024)
Based on these inputs, the design flow will generate a new directory (called build/
). Underneath are subdirectories named gemm_GEMM_SIZExGEMM_SIZExGEMM_SIZE. For example if GEMM_SIZE is given as 64, a subdirectory named gemm_64x64x64 will be created under build directory. Underneath, hw_emu/
and/or hw/
subfolders will be created. These folders contain a host app executable and the builds targeted to hw
or hw_emu
respectively. The hw_emu/
subfolder contains the build for hardware emulation. The hw/
subfolder contains the build for a hardware run on a VCK190 board.
Make Steps
Make Steps
To run the following make
steps (for example, make kernels
, make xsa
, make application
, and make package
), you must be in the gemm_dsp58/
folder. The following options can be specified in the make
steps. Instructions for how to apply them are provided later in this section.
TARGET:
This option can be set to hw
or hw_emu
to build the design in the hardware or hardware emulation flow. The default is hw_emu
.
GEMM_SIZE:
This option can be set to 32, 64, 128, 256, 512 or 1024
The Makefile uses the following directory references:
## Relative directory
RELATIVE_PROJECT_DIR := ./
PROJECT_REPO := $(shell readlink -f $(RELATIVE_PROJECT_DIR))
DESIGN_REPO := $(PROJECT_REPO)/design
PL_SRC_REPO := $(DESIGN_REPO)/pl_src
CONSTRAINTS_REPO := $(PL_SRC_REPO)/constraints
HOST_APP_SRC := $(DESIGN_REPO)/host_app_src
SYSTEM_CONFIGS_REPO := $(DESIGN_REPO)/system_configs
VIVADO_METRICS_SCRIPTS_REPO := $(DESIGN_REPO)/vivado_metrics_scripts
BASE_BLD_DIR := $(PROJECT_REPO)/build_$(PL_FREQ)
GEMM_BLD_DIR := $(BASE_BLD_DIR)/gemm_$(MAT_DIMS)
BUILD_TARGET_DIR := $(GEMM_BLD_DIR)/$(TARGET)
VIVADO_REPORTS_REPO := $(PROJECT_REPO)/vivado_reports_dir
BLD_VIVADO_REPORTS_DIR := $(VIVADO_REPORTS_REPO)/gemm_$(MAT_DIMS)
EMBEDDED_PACKAGE_OUT := $(BUILD_TARGET_DIR)/package
EMBEDDED_EXEC_SCRIPT := run_script.sh
Build the Entire Design with a Single Command
Build the Entire Design with a Single Command
If you are already familiar with Vitis kernel compilation flows, you can build the entire design with one command:
make run (default TARGET=hw_emu, GEMM_SIZE=64)
or
make run TARGET=hw (Target is hardware, GEMM_SIZE=64)
This command runs the make kernels
, make xsa
, make application
, make package
, and make run_emu
steps for hardware emulation or to run on hardware (VCK190 board) depending on the TARGET
you specify. The settings also apply to individual make steps listed below.
The generated files are placed under an individual directory: $(BUILD_TARGET_DIR)/
. Each make
step to build the design is specified in the following sections. These sections also detail the options used and the location of input and output files in each case.
See this page for a detailed description of all Vitis compiler switches. The following table provides a summary of the switches used.
Switch | Description |
---|---|
--target | -t [hw|hw_emu] | Specifies the build target. |
--platform | -f | Specifies the name of a supported acceleration platform as specified by the $PLATFORM_REPO_PATHS environment variable or the full path to the platform XPFM file. |
--save-temps | -s | Directs the Vitis compiler command to save intermediate files/directories created during the compilation and link process. Use the --temp_dir option to specify a location to write the intermediate files to. |
--temp_dir |
This allows you to manage the location where the tool writes temporary files created during the build process. The temporary results are written by the Vitis compiler, and then removed, unless the --save-temps option is also specified. |
--verbose | Display verbose/debug information. |
--compile | -c | Required for compilation to generate XO files from kernel source files. |
--kernel \<arg>|-k \<arg> | Compile only the specified kernel from the input file. Only one -k option is allowed per Vitis compiler command. |
-D | --define \<Macro Name>=\<value> | Defines Macros for the compiler. |
--output | -o | Specifies the name of the output file generated by the V++ command. The kernel output should be XO. |
Following RTL files are used in this design
${PL_SRC_REPO}/rtl/BDELAY.vhd
${PL_SRC_REPO}/rtl/FIXGEMM.vhd
${PL_SRC_REPO}/rtl/SDELAY.vhd
${PL_SRC_REPO}/rtl/sfixed_pkg.vhd
${PL_SRC_REPO}/rtl/cfixed_pkg.vhd
${PL_SRC_REPO}/rtl/DSP_GW.vhd
${PL_SRC_REPO}/rtl/FIXGEMM_WRAPPER.vhd
${PL_SRC_REPO}/rtl/control_logic.sv
${PL_SRC_REPO}/rtl/gemm_top.sv
${PL_SRC_REPO}/rtl/ps_slave.sv
${PL_SRC_REPO}/rtl/DSP_data_controller.sv
${PL_SRC_REPO}/rtl/op_uram.sv
${PL_SRC_REPO}/rtl/row_uram.sv
${PL_SRC_REPO}/rtl/col_uram.sv
${PL_SRC_REPO}/rtl/gemm_large_ocm.sv
${PL_SRC_REPO}/rtl/partial_sum_bram.sv
${PL_SRC_REPO}/rtl/synchronizer.sv
$(CONSTRAINTS_REPO)/gemm_dsp58.tcl provides constraints for synthesis and implementation.
Following is the output xo file
$(PROJECT_REPO)/build/gemm_GEMM_SIZExGEMM_SIZExGEMM_SIZE/gemm_large_ocm.xo
make kernels: Generates the PL Kernels
make kernels: Generates the PL Kernels
This step uses the RTL and mem_init_files specified above to generate the PL kernel (gemm_large_ocm.xo)
make xsa: Using the Vitis Tools to Link PL Kernels with the Platform
make xsa: Using the Vitis Tools to Link HLS Kernels with the Platform
After the kernel is generated, you can use the Vitis compiler to link it with the platform to generate an XSA file.
The Vitis tools allow you to integrate the kernels into an existing extensible platform. This is an automated step from a software developer perspective where the platform chosen is provided by the hardware designer (or you can opt to use one of the many extensible base platforms provided by Xilinx and the Vitis tools build the hardware design and integrate the kernels into the design).
The command to run this step is shown as follows
make xsa TARGET=<hw/hw_emu> GEMM_SIZE=<64,128,256,512,1024>
The expanded command is as follows:
cd $(BUILD_TARGET_DIR); \
v++ -l --platform xilinx_vck190_base_202410_1 --save-temps --temp_dir $(BUILD_TARGET_DIR)/_x \
--verbose -g --clock.freqHz 500000000:gemm_large_ocm_0 --clock.defaultTolerance 0.001 \
--config $(SYSTEM_CONFIGS_REPO)/gemm.cfg --vivado.prop fileset.sim_1.xsim.simulate.log_all_signals=true \
--vivado.prop run.synth_1.{STEPS.SYNTH_DESIGN.ARGS.CONTROL_SET_OPT_THRESHOLD}={16} \
--vivado.prop run.synth_1.{STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS}={true} \
--xp vivado_prop:run.impl_1.STEPS.PLACE_DESIGN.TCL.PRE=$(CONSTRAINTS_REPO)/gemm_dsp58.tcl
-t hw_emu -o $(BUILD_TARGET_DIR)/gemm.hw_emu.xclbin $(PROJECT_REPO)/build/gemm_GEMM_SIZExGEMM_SIZExGEMM_SIZE/gemm_large_ocm.xo
See this page for a detailed description of Vitis linking options. The following table provides a summary of the switches used.
Switch | Description |
---|---|
--platform | -f | Specifies the name of a supported acceleration platform as specified by the $PLATFORM_REPO_PATHS environment variable or the full path to the platform XPFM file. |
--save-temps | -s | Directs the V++ command to save intermediate files/directories created during the compilation and link process. Use the --temp_dir option to specify a location to write the intermediate files to. |
--temp_dir |
This allows you to manage the location where the tool writes temporary files created during the build process. The temporary results are written by the Vitis compiler, and then removed, unless the --save-temps option is also specified. |
--verbose | Display verbose/debug information. |
--output | -o | Specifies the name of the output file generated by the V++ command. In this design the outputs of the HLS/DSP kernels with their interfacing with the PL kernels are in XO files. |
--vivado.prop \<arg> | Specifies properties for the Vivado Design Suite to be used during synthesis and implementation of the FPGA binary (xclbin). See this page for detailed Vivado options. |
--profile.data [ |
Enables monitoring of data ports through the monitor IPs. This option needs to be specified during linking. See this page for detailed profiling options. |
--profile.trace_memory \<FIFO>:\<size>|\<MEMORY>[\<n>] | When building the hardware target (-t=hw), use this option to specify the type and amount of memory to use for capturing trace data. See this page for detailed profiling options. |
--config |
Specifies a configuration file containing V++ switches. |
The information to tell the linker how to connect the PL kernels together is described in a configuration file, system_configs/gemm.cfg
. The file describes the overall connection scheme of the system.
[connectivity]
nk=gemm_large_ocm:1:gemm_large_ocm_0
[clock]
id=0:gemm_large_ocm_0.S_AXI_ACLK
[advanced]
## Disable Profiling in hw_emu so that it is faster...
param=hw_emu.enableProfiling=false
## Export the xsa of the design..
param=compiler.addOutputTypes=hw_export
param=compiler.worstNegativeSlack=-1.0
[vivado]
prop=run.synth_1.STRATEGY=Flow_PerfOptimized_high
prop=run.impl_1.STEPS.OPT_DESIGN.is_enabled=true
prop=run.impl_1.STEPS.OPT_DESIGN.ARGS.DIRECTIVE=Explore
prop=run.impl_1.STEPS.PLACE_DESIGN.ARGS.DIRECTIVE=ExtraTimingOpt
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.is_enabled=true
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.ARGS.DIRECTIVE=AggressiveExplore
#prop=run.impl_1.STEPS.ROUTE_DESIGN.ARGS.MORE OPTIONS=-tns_cleanup
prop=run.impl_1.STEPS.ROUTE_DESIGN.ARGS.DIRECTIVE=AggressiveExplore
See this page for a detailed description of the Vitis compiler configuration file. A summary of the configuration options used is provided in the following table.
Switch | Comment |
---|---|
--connectivity.nk | Number of kernels. gemm_large_ocm:1:gemm_large_ocm_0 means that the Vitis compiler should instantiate one gemm_large_ocm kernel and name the instance gemm_large_ocm_0 . |
param=hw_emu.enableProfiling=false - This option disables profiing during hw_emu for faster run time | |
param=compiler.addOutputTypes=hw_export | This option tells the Vitis compiler that besides creating an XCLBIN file, it also outputs an XSA file which is needed to create a post-Vivado fixed platform for Vitis software development. |
param=compiler.worstNegativeSlack=-1.0 - This parameter sets 210ps tolerance for WNS | |
prop=run.synth_1.STRATEGY=Flow_PerfOptimized_high - This parameter sets Synthesis streategy | |
prop=run.impl_1.STEPS.OPT_DESIGN.is_enabled=true - This option enables opt design directive | |
prop=run.impl_1.STEPS.OPT_DESIGN.ARGS.DIRECTIVE=Explore - This option sets the value of opt design stage directive | |
prop=run.impl_1.STEPS.PLACE_DESIGN.ARGS.DIRECTIVE=ExtraTimingOpt - This option sets the value of place design directive | |
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.is_enabled=true - This option enables physical optimzation directive | |
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.ARGS.DIRECTIVE=AggressiveExplore - This option sets value of physical optimization directive | |
prop=run.impl_1.STEPS.ROUTE_DESIGN.ARGS.DIRECTIVE=AggressiveExplore - This option sets value of route design directive |
The Vitis™ compiler calls the Vivado™ IP integrator under the hood to build the design. The platform and kernels are input to the Vivado Design Suite, which produces a simulation XSA or an XSA after running place and route on the design. The point at which the XSA is produced from Vivado depends on the -target
option set on the Vitis compiler command line.
You can now view the Vivado project, which is located in the $(BUILD_TARGET_DIR)/_x/link/vivado/vpl/prj
directory. You have now generated the XCLBIN file, $(BUILD_TARGET_DIR)/gemm.hw_emu.xclbin
, that is used to execute your design on the platform.
make application: Compile the Host Application
make application: Compile the Host Application
You can compile the host application by following the typical cross-compilation flow for the Cortex A72 processor. To build the application, run the following command
make application
or
cd $(BUILD_TARGET_DIR); \
aarch64-xilinx-linux-g++ -mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc -fstack-protector-strong \
-D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=$(SDKTARGETSYSROOT) -O -c \
-std=c++14 -D__linux__ \
-DM_LARGE=$(GEMM_SIZE) -DN_LARGE=$(GEMM_SIZE) -DL_LARGE=$(GEMM_SIZE) \
-I$(SDKTARGETSYSROOT)/usr/include/xrt -I$(SDKTARGETSYSROOT)/usr/include -I$(SDKTARGETSYSROOT)/usr/lib -I$(HOST_APP_SRC)/$(MAT_DIMS) \
$(HOST_APP_SRC)/main.cpp -o $(BUILD_TARGET_DIR)/gemm_top_app.o \
-L$(SDKTARGETSYSROOT)/lib -lxrt_coreutil
aarch64-xilinx-linux-g++ -mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc -fstack-protector-strong \
-D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=$(SDKTARGETSYSROOT) \
$(BUILD_TARGET_DIR)/gemm_top_app.o -L$(SDKTARGETSYSROOT)/usr/lib -lxrt_coreutil \
-o $(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf
See this page for XRT documentation. See this page for details of host application programming.
Switch | Description |
---|---|
-O | Optimize. | Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that can take a great deal of compilation time. |
-D__linux__ | |
-DXAIE_DEBUG | Enable debug interface capabilities where certain core status, event status, or stack trace can be dumped out. |
-D\<Pre-processor Macro String>=\<value> | Pass Pre-processor Macro definitions to the cross-compiler. |
-I \<dir> | Add the directory dir to the list of directories to be searched for header files. |
-o \<file> | Place output in file <file> . This applies regardless of the output being produced, whether it be an executable file, an object file, an assembler file or preprocessed C code. |
--sysroot=\<dir> | Use dir as the logical root directory for headers and libraries. For example, if the compiler would normally search for headers in /usr/include and libraries in /usr/lib , it will instead search dir/usr/include and dir/usr/lib . This is automatically set by the env_setup.sh script |
-l\<library> | Search the library named library when linking. The 2D-FFT tutorial requires adf_api_xrt and xrt_coreutil libraries. |
-L \<dir> | Add directory <dir> to the list of directories to be searched for -l. |
The following is a description of the input sources compiled by the cross-compiler compiler command.
Inputs Sources | Description |
---|---|
$(HOST_APP_SRC)/main.cpp | Source application file for the gemm_dsp_xrt.elf that will run on an A72 processor. |
$(HOST_APP_SRC)/matrix_A_data.h, matrix_B_data.h | Matrix A and B Data to be used for matrix multiplication. |
$(HOST_APP_SRC)/output_data.h | Golden data to which DUT output will be compared. |
The following is a description of the output objects that results from executing the cross-compiler command with the above inputs and options.
Output Objects | Description |
---|---|
$(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf | The executable that will run on an A72 processor. |
make package: Packaging the Design
make package: Packaging the Design
With the Kernel outputs created, as well as the new platform, you can now generate the programmable device image (PDI) and a package to be used on an SD card. The PDI contains all the executables, bitstreams, and configurations of the device. The packaged SD card directory contains everything to boot Linux, the generated applications, and the XCLBIN.
The command to run this step is as follows (default TARGET=hw_emu
):
make package
or
cp $(PROJECT_REPO)/run_script.sh $(BUILD_TARGET_DIR)/
cd $(BUILD_TARGET_DIR); \
v++ -p -t hw --save-temps --temp_dir $(BUILD_TARGET_DIR)/_x -f xilinx_vck190_base_202410_1 \
--package.rootfs $(XLNX_VERSAL)/rootfs.ext4 --package.kernel_image $(XLNX_VERSAL)/Image --package.boot_mode=sd \
--package.out_dir $(BUILD_TARGET_DIR)/package --package.image_format=ext4 --package.sd_file $(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf \
$(BUILD_TARGET_DIR)/gemm.hw.xclbin
If the XRT_ROOT
is set, the following Vitis compiler flags are also set:
--package.sd_dir $(XRT_ROOT)
See this page for more details about packaging the system.
Switch | Description |
---|---|
--target | -t [hw|hw_emu] | Specifies the build target. |
--package | -p | Packages the final product at the end of the Vitis compile and link build process. |
--package.rootfs \<arg> | Where \<arg> specifies the absolute or relative path to a processed Linux root file system file. The platform RootFS file is available for download from xilinx.com. Refer to the Vitis Software Platform Installation for more information. |
--package.kernel_image \<arg> | Where \<arg> specifies the absolute or relative path to a Linux kernel image file. Overrides the existing image available in the platform. The platform image file is available for download from xilinx.com. Refer to the Vitis Software Platform Installation for more information. |
--package.boot_mode \<arg> | Where \<arg> specifies |
--package.image_format | Where \<arg> specifies \<ext4|fat32> output image file format. ext4 is the Linux file system and fat32 is the Windows file system. |
--package.sd_file | Where \<arg> specifies an ELF or other data file to package into the sd_card directory/image. This option can be used repeatedly to specify multiple files to add to the sd_card . |
Inputs Sources | Description |
---|---|
$(XRT_ROOT) | The PS host application needs the XRT headers in this folder to execute. Set in the env_setup.sh . |
$(XLNX_VERSAL)/rootfs.ext4 | The root filesystem file for PetaLinux. |
$(XLNX_VERSAL)/Image | The pre-built PetaLinux image the processor boots from. |
$(BUILD_TARGET_DIR)/gemm_dsp_xrt.elf | The PS host application executable created in the make application step. |
$(BUILD_TARGET_DIR)/gemm.hw_emu.xclbin | The XCLBIN file created in the make xclbin step. |
The output of the V++ Package step is the package directory that contains the contents to run hardware emulation.
Output Objects | Description |
---|---|
$(BUILD_TARGET_DIR)/package | The hardware emulation package that contains the boot file, hardware emulation launch script, the PLM and PMC boot files, the PMC and QEMU command argument specification files, and the Vivado simulation folder. |
make run_emu: Running Hardware Emulation
make run_emu: Running Hardware Emulation
After packaging, everything is set to run hardware emulation. To run emulation, use the following command (default TARGET=hw_emu
):
make run_emu
or
###########################################################################
Hardware Emulation Goto:
$(BUILD_TARGET_DIR)/package
and do:
./launch_hw_emu.sh or ./launch_hw_emu.sh -g (for waveform viewer)...
When hardware emulation is launched, you see the QEMU simulator load. Wait for the autoboot countdown to go to zero. After a few minutes, the root Linux prompt comes up:
root@versal-rootfs-common-2024.1:~#
After the root prompt comes up, run the following commands to run the design:
cd /mnt
export XILINX_XRT=/usr
./gemm_dsp_xrt.elf a.xclbin
The gemm_dsp_xrt.elf
executes. After a few minutes, you should see the output with TEST PASSED
on the console. When this is shown, run the following keyboard command to exit the QEMU instance:
#To exit QEMU Simulation
Press CtrlA, let go of the keyboard, and then press x
To run with waveform, do the following:
cd $(BUILD_TARGET_DIR)/package
./launch_hw_emu.sh -g
The XSIM Waveform Viewer is launched. Drag and drop the signals into the viewer and click Play to start the emulation. Go back to the terminal and wait for the Linux prompt to show up. In the XSIM Waveform Viewer, you will see the signals you added to the waveform adjusting over the execution of the design. When this is done, hit the pause button and close the window to end the emulation.Data Integrity mismatch due to software issue in Hardware Emulation,Design works in Hardware run.
TARGET=hw: Running on Hardware
Running on Hardware
To run the design on hardware, rerun the following make
steps with TARGET=hw
and other applicable options (see the preceding make
steps specified above).
make kernels TARGET=hw
make xsa TARGET=hw
make application TARGET=hw
make package TARGET=hw
These commands create a $(BUILD_TARGET_DIR)
folder with the kernels, xsa, and package
for a hardware run.
Run the following step to set up the execution file, generated images, and base images ($(BUILD_TARGET_DIR)/package/sd_card
and $(BUILD_TARGET_DIR)/package/sd_card.img
).
make run_emu TARGET=hw
These commands create a build/hw
folder with the kernels, XCLBIN, and package
for a hardware run. Follow steps 1-9 to run the gemm_dsp_xrt.elf
executable on your VCK190 board.
Step 1. Ensure your board is powered off.
Step 2. Use an SD card writer (such as balenaEtcher) to flash the sd_card.img
file to an SD card.
Step 3. Plug the flashed SD card into the top slot of the VCK190 board.
Step 4. Set the switch (SW1 Mode\[3:0\]=1110 = OFF OFF OFF ON
).
Step 5. Connect your computer to the VCK190 board using the USB cable included with the board.
Step 6. Open a TeraTerm terminal and select the correct COM port. Set the port settings to the following:
Port: <COMMXX>
Speed: 115200
Data: 8 bit
Parity: none
Stop Bits: 1 bit
Flow control: none
Transmit delay: 0 msec/char 0 msec/line
Step 7. Power on the board.
Step 8. Wait until you see the root@versal-rootfs-common-2024_1
Linux command prompt. Press enter a few times to get past any xinit
errors.
Step 9. Run the following commands in the TeraTerm terminal:
mount /dev/mmcblk0p1 /mnt
cd /mnt
export XILINX_XRT=/usr
./gemm_dsp_xrt.elf a.xclbin
Hardware Design Details
Matrix Multiplication using DSP58 Implementation Architecture
Matrix Multiplication using DSP58 Implementation Architecture
In this design, Matrix Multiplication is implemented using DSP58 Systolic array of size 32x32. i.e There are 32 DSP58 cascade chains, each chain having 32 DSP58s. Thus 32x32 matrix is the basic matrix multiplication size. Larger matrices are broken down into submatrices of size 32x32.
Basic 32x32 Multiplication is performed as follows -
Matrix A row data moves upwards along DSP A Port cascade chain. For first 32 clocks data is only shifted into the DSP chains. So after 32 clocks, Row 0 of Matrix A is populated in first DSP cascade chain, Row 1 is populated in next cascade chain and so on, as shown in the below diagram. show in the below diagram
Calculating First Row of Output Matrix
After Matrix A elements are shifted into cascade chain, last row of matrix B is driven clock-by-clock to the bottom most DSP of the first cascade chain, as shown in the below diagram
First Row of output matrix is calculated as follows -
Bottom most DSP calculates A[0,31] B[31,0] and sends the output to upper DSP via PCOUT cascade port. On 2nd clock upper DSP starts receiving B[30,0], B[30,1], … B[30,31] (i.e Row 30 of Matrix B). So,on 2nd clock, 2nd DSP calculates A[0,30] B [30,0] + PCOUT = A[0,30] B[30,0] + A[0,31] B[31,0], and sends it up to the 3rd DSP. 3rd DSP starts receiving Matrix B Column 29 on 3rd clock, computes 3rd MAC operation and send up to 4th DSP. Thus after 32nd clock, top DSP has generated Row 0 Column 0 element of the output matrix.
On 2nd clock, bottom DSP receives B[31,1] and it calculates A[0,31] * B[31,1] which is the beginning of the MAC operation for Row 0 Column 1 element of output matrix. Row 0, Column 1 calculations traverse upwards in a similar way, and on 33rd clock, top DSP generates Row 0 Column 1 element of the output matrix.
Similarly for next 30 clocks, (that is, clock 34 to 63) top DSP of first cascade chain generates other 30 elements of Row 0 of the output matrix
Other rows of output matrix are calculated as follows -
B[31,0], B[31,1], … B[31,31] elements, that is row 31 of Matrix B is shifted to next DSP chain every clock. Hence Start of driving Matrix A Rows to subsequent DSP chains is also started with one clock delay. So bottom DSP of 2nd cascade chain starts on 2nd clock and it computes A[1,31] * B[31,0], which is beginning of the MAC operation for Row 1 Column 0 element of output matrix. Thus 2nd cascade chain is 1 clock delayed wrt first cascade chain and it generates its 32 outputs from clock 33 to 64. These outputs are Row 1 of the output matrix. Each subsequenct cascade chain is one clock delayed wrt previous chain, and thus the last cascade chain generates Row 31 outputs on clock 63 to 94.
32x32 Matrix Multiplication Latency
For the first 32 clocks, Matrix A Row 0 is loaded into first cascade chain. Over next 32 clocks, First cascade chain calculates first row of output matrix, and for next 32 clocks, other rows of output matrix are generated. However after 64 clocks, first DSP cascade chain can receive first row data for next 32x32 matrix.
Larger matrices are broken down into smaller 32x32 matrices. For example, 1Kx1Kx1K Matrices are represented as follows, where each box is 32x32 matrix –
Output matrix is -
Data Flow for larger matrices
Matrix A00 is first multiplied with Matrix B00, which is the basic 32x32 matrix multiplication. Over the first 96 clocks, each DSP chain produces 32 outputs, thus total 1K outputs are generated which are the partial sums for the final output. These partial sums are written to 64 partial sum BRAMs. After 64 clocks, first cascade chain is done with A00 B00 submatrix, and it then starts performing A00 B01 to calculate partial sums for the next column of the output matrix. Likewise over next 32 clocks, other DSP cascade chains will also complete A00 B00 matrix multiplication and move to A00 B01 submatrix multiplication. This way Matrix A00 is multiplied with Matrix B00, B01, B02 … B0,31.
This completes A00 submatrix multiplications. Next, we read A01 submatrix of Matrix A, and it gets multiplied with submatrices of Matrix B. The partial sums are added to the partial sums previous generated, and stored back. Thus we will keep moving along the first row of Matirx A and multiply that submatrix with submatrices of Matirx B. This will continue for 32 iterations, and in the 32nd iteration, data is written to Output BRAM instead of partial Sum BRAM. This completes computation of the first row of the output matrix.
Then we will move to the next row of Matrix A and all these steps are repeated. After 32 such iterations, 1Kx1Kx1K matrix multiplication will be completed.
Matrix Calculation Latency for large matrices
32x32 matrix calculation requires 96 clocks. However first cascade chain in the DSP58 array is done with its computation after 64 clocks, and it can start receiving data for next submatrix. Thus for 32 clocks, there is overlap of previous and new submatrix calculations. So the total number of clocks required for large matix multiplication is 64 * No. of Sbumatrices + 32.
In this design, DSP clock is operating at 750MHz (1.33ns).
The following figure shows block diagram of the design.