LeNet - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
Release Date
2024.1 English

Version: Vitis 2024.1

Table of Contents


Before You Begin

Building the LeNet Design

Hardware Design Details

Software Design Details

Throughput Measurement Details



Versal™ adaptive SoCs combine programmable logic (PL), processing system (PS), and AI Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. The hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. A host of tools, software, libraries, IP, middleware, and frameworks enable Versal adaptive SoCs to support all industry-standard design flows.

This tutorial uses the LeNet algorithm to implement a system-level design to perform image classification using the AI Engine and PL, including block RAM. The design demonstrates functional partitioning between the AI Engine and PL. It also highlights memory partitioning and hierarchy among DDR memory, PL (block RAM) and AI Engine memory.

The tutorial takes you through hardware emulation and hardware flow in the context of a complete Versal ACAP system integration. A Makefile is provided that you can modify to suit your own needs in a different context.



After completing the tutorial, you should be able to:

  • Build a complete system design by going through the various steps in the Vitis™ unified software platform flow, including creating the AI Engine Adaptive Data Flow (ADF) API graph, compiling the A72 host application and compiling PL kernels, using the Vitis compiler (V++) to link the AI Engine and HLS kernels with the platform, and packaging the design. You will also be able to run the design through the hardware emulation and hardware flow in a mixed System C/RTL cycle-accurate/QEMU-based simulator.

  • Develop an understanding of Convolutional Neural Network (CNN) layer details using the LeNet algorithm and how the layers are mapped into data processing and compute blocks.

  • Develop an understanding of the kernels developed in the design; AI Engine kernels to process fully connected convolutional layers and PL kernels to process the input rearrange and max pool and rearrange functions.

  • Develop an understanding of the AI Engine IP interface using the AXI4-Stream interface.

  • Develop an understanding of memory hierarchy in a system-level design involving DDR memory, PL block RAM, and AI Engine memory.

  • Develop an understanding of graph control APIs to enable run-time updates using the run-time parameter (RTP) interface.

  • Develop an understanding of performance measurement and functional/throughput debug at the application level.

Tutorial Overview

Tutorial Overview

In this application tutorial, the LeNet algorithm is used to perform image classification on an input image using five AI Engine tiles and PL resources including block RAM. A top-level block diagram is shown in the following figure. An image is loaded from DDR memory through the Network on Chip (NoC) to block RAM and then to the AI Engine. The PL input pre-processing unit receives the input image and sends the output to the first AI Engine tile to perform matrix multiplication. The output from the first AI Engine tile goes to a PL unit to perform the first level of max pool and data rearrangement (M1R1). The output is fed to the second AI Engine tile and the output from that tile is sent to the PL to perform the second level max pooling and data rearrangement (M2R2). The output is then sent to a fully connected layer (FC1) implemented in two AI Engine tiles and uses the rectified linear unit layer (ReLu) as an activation function. The outputs from the two AI Engine tiles are then fed into a second fully connected layer implemented in the core04 AI Engine tile. The output is sent to a data conversion unit in the PL and then to the DDR memory through the NoC. In between the AI Engine and PL units is a datamover module (refer to the LeNet Controller in the following figure) that contains the following kernels:

  • mm2s: a memory-mapped to stream kernel to feed data from DDR memory through the NoC to the AI Engine Array.

  • s2mm: a stream to memory-mapped kernel to feed data from the AI Engine Array through NoC to DDR memory.

Image of LeNet Block Diagram

In the design, there are two major PL kernels. The input pre-processing units, M1R1 and M2R2 are contained in the lenet_kernel RTL kernel which has already been packaged as a Xilinx object .xo (XO) file. The datamover kernel dma_hls provides the interface between the AI Engine and DDR memory. The five AI Engine kernels all implement matrix multiplication. The matrix dimensions depend on the image dimension, weight dimension, and number of features.

Directory Structure

Directory Structure

|____design......................contains AI Engine kernel, HLS kernel source files, and input data files
|    |___aie_src.................contains all the aie source files
|    |___pl_src..................contains all the data mover source files
|    |___host_app_src............contains host application source files
|    |___directives..............contains directives for various vitis compilation stages 
|    |___exec_scripts............contains run commands
|    |___profiling_configs.......contains xrt.ini file
|    |___system_configs..........contains all system configuration files
|    |___vivado_metrics_scripts..contains scripts to get vivado metric reports
|____images......................contains images that appear in the README.md
|____Makefile....................with recipes for each step of the design compilation
|____description.json............for XOAH
|____sample_env_setup.sh.........required to setup Vitis environment variables and Libraries