Building the Kernel Module - 2023.1 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
Release Date
2023.1 English

Now you will build the kernel module as a standalone kernel with AXI interfaces to memory, which are also used for simulation. To do this, use the following steps:

  cd  $CONV_TUTORIAL_DIR/hls_build
  vitis_hls -f build.tcl 

TIP: This step can take some time to complete.

An output similar to the following will be printed:

    HLS Testbench for Xilinx 2D Filter Example
    Image info
    - Width     :       1000
    - Height    :         30
    - Stride    :       1024
    - Bytes     :      30720
    Running FPGA accelerator
    Comparing results
    Test PASSED: Output matches reference
    INFO: [COSIM 212-1000] *** C/RTL co-simulation finished: PASS ***

This shows that an image with Width=1000 and height=30 is simulated. There are default parameters for image dimensions, and these are kept as small values to make co-simulation run more quickly. The synthesis is done for a maximum image size of 1920x1080.

When the build and simulation are finished, launch the Vitis HLS GUI to analyze the performance estimate reports and implementation QoR as follows:

vitis_hls -p conv_filter_prj

After the GUI opens, the first thing to notice is the Synthesis Summary report, which shows the Performance and Resource Estimates as the following:

Resource Report

It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 block RAMs by the Window2D data mover block.

One important thing to notice is the static performance estimate for the kernel, 7.3 ms, which is very close to the estimated target latency of 6.9 ms for the kernel as calculated in the previous lab.

You can also get an accurate measurement of latency for kernel from the Co-simulation report. You can go to the Solution > Open Report > Cosimulation menu in Vitis HLS to open the Co-simulation report which will show something as follows:

Co-Sim Report

Since you are simulating a 1000x30 image, the expected latency should be 30,000 + fixed latency of some blocks (assuming one clock cycle per output pixel). The number shown in the report is 38,520. Here 8,520 is the fixed latency, and when the actual image size is 1920x1080, the fixed latency will get amortized across more image lines. The fact that a large image will amortize the latency can be verified by simulating with a larger image.

Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1, as follows:

II Report

When you have verified that the throughput requirements are met, and the resource consumption is acceptable, you can move forward and start integrating the full application, which consists of creating and compiling the host application to drive the kernel, and building the kernel using one of the AMD platforms for Alveo Data Center accelerator cards. For this lab, the Alveo U200 card is used.

In this lab, you learned about:

  • Optimized implementation of the convolution filter

  • Building and kernel performance analysis using Vitis HLS

Next Lab Module: Building the 2-D convolutional Kernel and Host Application

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions