Implementation - 2024.1 English

AI Engine System Software Driver Reference Manual (UG1642)

Document ID
Release Date
2024.1 English
Data Movement
The application uses the GMIO attribute to make external memory-mapped connections to and from global memory. These connections are created between the AI Engine kernel and the logical global memory port of the hardware platform design through an NoC. In this design, the buffer descriptors are programmed in the AI Engine AI Engine interface tiles DMAs to initiate AI Engine to DDR read and write transactions from the PS program. The burst length of the memory-mapped transaction is 64-bit, and AI Engine interface tiles DMAs use physical memory addressing read/write data from global memory.
Figure 1. Data Movement
Data Slicing
To compute matrix multiplication on AI Engine, matrix A is sliced horizontally and distributed equally among all the core used through the AI Engine AXI4-Stream network. Matrix B is transposed and feed to the first core in the design element by element. The first core shares the input matrix B with the other AI Engine cores through the AXI4-Stream connection. As the output is in a z-order, hence a re-ordering of the output matrix is required.
Figure 2. Data Slicing
Build Flow
The following graph explains the build flow for GMIO based AI Engine designs.
AMD Vitis™ generates aie_control_xrt.cpp, which is cross-compiled to run on the target. The compiled application loads the generated AI Engine ELFs and CDOs (packaged into XCLBIN) to the corresponding tile through load XCLBIN API.
Figure 3. Build Flow
Runtime Execution
At runtime, Linux application binary calls AI Engine userspace driver, and runtime library, libadf_api_xrt.a. AI Engine userspace drivers abstract the kernel-space driver which handles runtime configurations along with ELF loading.
Figure 4. Runtime Execution
Sample Output
Follow the Linux boot process to boot the Linux on the target. At the Linux login prompt, login with the user as root and password as root. The AI Engine XCLBIN and executables are pre-installed in the /usr/bin/ directory.
        root@xilinx-vck190-202x_x:~# aie-matrix-multiplication
Initializing ADF API...
[INFO] AIE GMIO Matrix Multiplication
[INFO] Matrix size(int32): 1200x1200
[   68.288633] zocl-drm axi:zyxclmm_drm: zocl_create_client: created KDS client for pid(824), ret: 0
[   68.297535] zocl-drm axi:zyxclmm_drm: zocl_destroy_client: client exits pid(824)
[   68.304999] zocl-drm axi:zyxclmm_drm: zocl_create_client: created KDS client for pid(824), ret: 0
[   68.349096] [drm] found kind 29(AIE_RESOURCES)
[80573.078]Loading PDI from DDR
[80573.164]Monolithic/Master Device
[80576.303]3.190 ms: PDI initialization time
[80580.153]+++Loading Image#: 0x0, Name: aie_image, Id: 0x1C000000
[80585.916]---Loading Partition#: 0x0, Id: 0x0
[80645.089] 55.070 ms for Partition#: 0x0, Size: 19127040 Bytes
[80647.906]Subsystem PDI Load: Done
[   68.349108] [drm] found kind 18(PDI)
[   68.448707] [drm] FPGA Manager load DONE
[   68.455278] [drm] Partition 1 already requested
XAIEFAL: INFO: Resource group Avail is created.
XAIEFAL: INFO: Resource group Static is created.
XAIEFAL: INFO: Resource group Generic is created.
[INFO] XCLBIN download complete
[INFO] AIE cores are done executing
[INFO] Running sanity check
[INFO] XGeMM Success!
[   68.469075] [drm] zocl_xclbin_read_axlf fe3eeecf-1b48-4862-4723-5aba0732fe7b ret: 0
[  114.412806] zocl-drm axi:zyxclmm_drm: zocl_destroy_client: client exits pid(824)
Customizing and Rebuilding
As previously mentioned, you can change the number of AI Engine cores used for matrix multiplication. However, because the immediately available data memory to the core is limited, reducing the number of AI Engine cores reduces the maximum matrix size supported by the application. Within the config.h header file, the NUM_HW_ROWS and NUM_HW_COLS macros can be set to change the number of cores used. The maximum number of AI Engine cores available is 400.