Building the Design - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
Release Date
2024.1 English

Estimated time: 11 minutes

make all

or, follow steps 1-3 as follows:

Step 1: Set the Vitis Utility Library path

XFLIB_DIR := $(shell readlink -f $(XFLIB_DIR_REL_PATH))

This path will have the folder utils in it along with other libraries. You will be using its L2 Data-Mover generator tool.

Step 2: Generate mm2s_mp.cpp and s2mm_mp.cpp Datamover kernels

make -f ./ GENKERNEL=$(XFLIB_DIR)/L2/scripts/generate_kernels SPEC=./kernel/spec.json TOOLDIR=./_krnlgen

Here you use the L2 Data-Mover generator tool ($(XFLIB_DIR)/L2/scripts/generate_kernels). This tool uses the kernel/spec.json specification to write kernel/mm2s_mp.cpp and kernel/s2mm_mp.cpp HLS kernel source files.

Step 3: Compile HLS PL Kernels

Following is an example of how the mm2s_mp kernel is compiled.

v++ -c                                                                 \
    -t hw                                                              \
    --platform xilinx_vck190_base_202410_1                             \
    --save-temps --optimize 2                                          \ 8 -I$(XFLIB_DIR)/L1/include                            \
    -I$(XFLIB_DIR)/L1/include/hw                                      \
    -I./kernel                                                         \
    -k mm2s_mp                                                          \
    --hls.clock 150000000:mm2s_mp                                       \
    --temp_dir ./build/_x_temp.hw.xilinx_vck190_base_202410_1          \
    --report_dir ./build/reports/_x.hw_emu.xilinx_vck190_base_202410_1 \
    -o './build/_x_temp.hw_emu.xilinx_vck190_base_202410_1/mm2s_mp.xo'  \

The same compilation options are used to compile the s2mm_mp, packet_sender, and packet_receiver kernels.

HLS PL Kernels

After coming up with 400 tile AI Engine design, the next step is the come up with the way to move data from DDR send it to the AI Engine. We do this by using the the AMD Vitis™ core development kit, to create kernel code in C++ meant to be accelerated on the FPGA. The kernel code is compiled by the Vitis Compiler (v++ -c) into kernel objects (XO). The following is a table describing each HLS PL kernel.

Kernel Name Description Fmax
mm2s_mp Dual-channel data-mover that moves data from DDR to AXI4-Stream. 411 MHz
packet_sender Packet switching kernel that packetizes AXI4-Stream data by generating a header packet and appropriately asserting TLAST 580 MHz
packet_receiver Packet switching kernel that evaluates packet headers from incoming streams and reroutes data to one of 4 AXI4-Streams 499.5 MHz
s2mm_mp Quad-channel data-mover that moves data from AXI4-Stream to DDR. 411 MHz

Using Vivado timing closure techniques, you can increase the FMax if needed. To showcase the example, integrate using the 300 MHz clock. There is also a 400 MHz timing-closed design in the beamforming tutorial.

alt text


The mm2s_mp is generated from the kernel/spec.json specification. Review this file. Notice the mm2s_mp kernel implementation is set to LoadDdrToStream, meaning this kernel is used to move data from DDR (AXI-MM) to AXI-Stream. It is specified to have two channels. The first channel moves data from a DDR buffer called ibuff to an AXI-stream called s0. This channel moves the i data out of DDR to AXI-Stream. The second channel moves j data from DDR buffer jbuff to an AXI-Stream s1 and streams the data directly into the AI Engine’s input_j port.


After mm2s_mp kernel loads i data onto an AXI-Stream, the s0 is the input to the packet_sender kernel. The packet_sender kernel takes raw i data and packetizes it for the AI Engine. Review the kernel/packet_sender.cpp definition. The packet_sender does the following:

  • generates a header AXI-Stream packet

  • reads the rx stream

  • writes 224 AXI-Stream data packets to one of the 100 tx streams

  • asserts TLAST appropriately on the last data packet

It repeats these actions so all 100 tx streams have a packet header and 224 data packets written to it. This is 1 iteration of data the AI Engine is expecting. The 100 tx streams are connected to the 100 input_i ports on the AI Engine.


After the AI Engine’s 100 N-Body Subsystems crunch the N-Body equations on the input_i and input_j data, it outputs four data packets on each of the 100 output_i ports. Each output data packet can have a header of 0, 1, 2, or 3, indicating that it is coming from nbody_kernel[0], nbody_kernel[1], nbody_kernel[2], or nbody_kernel[3] in each of the nbody_subsystems. The 100 output_i ports are connected to the 100 rx streams on the packet_receiver kernel. The packet_receiver kernel receives four packets from each of the 100 rx streams, and depending on the packet header, writes the data to tx0, tx1, tx2, or tx3 streams.


The s2mm_mp kernel is generated from the kernel/spec.json specification. Review this file again. Notice that the s2mm_mp kernel has an implementation StoreStreamToMaster which moves data from AXI-Streams to DDR. The s2mm_mp kernel has 4 channels: k0,k1,k2, and k3. Each stream writes the data coming from the tx0-tx3 streams to a DDR buffer.


Next Steps

After compiling the PL datamover kernels, you are ready to link the entire hardware design together in the next module, Module 04 - Full System Design.


GitHub issues will be used for tracking requests and bugs. For questions go to

Copyright © 2020–2024 Advanced Micro Devices, Inc

Terms and Conditions