Vitis Tutorials: AI Engine Development (XD100) - 2025.1 English - Learn how to target, develop, and deploy advanced algorithms using a Versal AI Engine array in conjunction with PL IP/kernels and software applications running on the embedded processors. - XD100
Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English
AI Engine Development
AI Engine Tutorials
AIE Design Tutorials
AIE Lenet Tutorial
Introduction
Tutorial Overview
Before You Begin
Tools: Installing the Tools
Environment: Setting Up the Shell Environment
Super Sampling Rate FIR
Dual SSR 16 HW
Goal of this hardware implementation
Compile the graph
Build hardware and generate sd_card.img
Logging in for the first time
Support
Dual Stream SSR
Dual-Stream Input Impact
Designing the Graph
C++ Code Analysis
Compilation and Analysis
Support
Multi Kernel
Designing the Kernel
C++ Code Analysis
Data and Coefficients Management and Operation Scheduling
Compilation and Analysis
Support
Single Kernel
Filter Description
Designing the Kernel
Interfaces
Data and Coefficients Management
Coefficients and Data Update Scheduling
Compilation and Analysis
Vitis Analyzer
Script Utils
Support
Single Stream SSR
Super Sampling Rate FIR Filter
Super Sampling Rate and Polyphase
Organize Computation for a 2.5 Gsps Data Stream in 2 Phases
Designing the Graph
C++ Code Analysis
Compilation and Analysis
Support
Beamforming
Module 01 - Custom Platform
Options Table
Dependencies
Build Products
Introduction: What is a Custom Vitis Embedded Platform?
What is the Hardware Platform?
What is the Software Platform?
Platform Vivado Project
Create Platform Vivado Project
Create Block Design
Port Instantiation
AI Engine
AXI Debug Hub IP and Simulation Clock and Reset Generator IP
AXI SmartConnects
AXI Verification IPs
Clock Infrastructure
CIPS
NoC
Create Interface Connections
Clock Connections
AXI SmartConnect Connections
CIPS and NoC Connections
NoC Connections
Clocking Infrastructure Connections
CIPS Clocks
Create Address Segments
Set Platform Attributes
Control Interfaces Requirements
Memory Interface Requirements
Clock Requirements
Set Platform Attributes with for Loops
DDR4 Constraints
Create Wrapper for Block Design
Post Link Tcl Commands
Timing Closure
Emulation Setup
Platform Output Type
Wrap Up Vivado Project
Export Hardware XSA*
Software Platform
Platform Create
Domain Create: AI Engine
Domain Create: Linux
BIF File
Boot Directory
Domain Create: Bare Metal
Generate Platform
References
Support
Module 02 - AI Engine Design
Options Table
Dependencies
Build Products
Introduction
AI Engine Kernels, Graphs, and Applications
AI Engine Kernels and Graphs
Cascading Chain Subgraph
Downlink Subgraph
Uplink Subgraph
Test Beamforming Graph
AI Engine Application
Sending Data to the Beamforming Kernels
AI Engine Kernels Parameters
AI Engine Subgraph Window Connections
AI Engine Application Data Files
Simulating the AI Engine Graph Application
Run-Time Event API for Performance Profiling
Conclusion
References
Support
Module 03 - PL Design
Building the Design
Dependencies
PL Master Kernels
PL Slave Kernels
Build Products
PL Kernels: Master and Slaves
PL Master Kernels
PL Master Execution Flow
Reset
Configuration
BLOCK_SIZE
NITER and ROLLOVER_ADDR
Start
Done
IP Kernelization
PL Slave Kernels
PL Slave Execution Flow
Reset
Configuration
BLOCK_SIZE
NITER and ROLLOVER_ADDR
Start
Done
AXI4-Stream Register Slice
Beamforming Design: Downlink AI Engine Graph
Beamforming Design: Uplink AI Engine Graph
References
Support
Module 04 - AI Engine and PL Integration
Building the Design
Build XCLBIN from Scratch
Options
Dependencies
Build Products
Introduction: Linking the System
Timing Summary
REV0: vck190_v1_0_wrapper_timing_summary_routed.rpt
REV1: vck190_v1_0_wrapper_timing_summary_routed.rpt
REV0 Configuration File (config.ini)
[connectivity] Section
Number of Kernels
Streaming Connections
[clock] Section
[advanced] Section
New XSA Platform: rev0
Timing Closure
Timing Closure Strategy
REV1: Configuration File (config_2regslice.ini)
[connectivity] Section
[clock] Section
[vivado] Section
New XSA Platform: rev1
References
Support
Module 05 - Baremetal Host Application
Introduction: Building a Bare-Metal System
Building the Design
Difference between main_partial.cpp and main_full.cpp
Generating the Platform
Compiling the PS Application Source Code
Linking the PS Application Source Code
Bare-Metal Source Code
PS Host Application
Main Function
test_dlbf/test_ulbf Functions
Reset
Configuration
Check RAM
Start
Wait for Done: Inputs
Wait for Done: Outputs
Verify Output
Test ULBF
References
Support
Module 06 - Running the Baremetal System
Building the Design: Hardware Emulation
Dependencies
Build Products
Running the System: Hardware Emulation
Building the Design: Hardware
Dependencies
Build Products
Running the System: Hardware
References
Support
Module 07 - Petalinux
Differences between Bare Metal and PetaLinux
Building the Design
Building the PetaLinux Software Platform
Create PetaLinux: Creating the PetaLinux Project with a BSP
Config PetaLinux: Updating the PetaLinux Project with an XSA
Config PetaLinux: Customizing the Root File System
Config Petalinux: Updating the Device Tree
Config Petalinux: Customizing Kernel Configuration
Config Petalinux: Clean-Up
Build PetaLinux: Building the PetaLinux Image
Build Petalinux: Building the SDK (Target Sysroot Generation)
Build PetaLinux: Installing the SDK (Target Sysroot Generation)
Build PetaLinux: Generating the Boot Image
Build the Versal Custom PetaLinux Platform
References
Support
Module 08 - Linux SW Application
Introduction: Programming the PS Host Application
Execution Flow Chart
Bind UIO Drivers with PL Kernels
Changes from 2025.1
Load AIE XCLBIN
Reset AI Engine
Load AI Engine with XCLBIN
Reset AI Engine in the Middle of Execution
Command-Line Arguments
Support
Module 09 - Running the Linux System
Running the System
Support
Polyphase Channelizer
Introduction
Channelizer Requirements
MATLAB Model
System Partitioning
Clock Rate and SSR Planning
Circular Buffer
Polyphase Filterbank
Cyclic Shift Buffer
IDFT
Design Overview
Polyphase Filterbank Design
Discrete Fourier Transform Design
Build and Run Design
Setup & Initialization
Hardware Emulation
Hardware
Estimating Power Using the Power Design Manager
Step 1: Building the Design for VCK190 and Executing Power Targets
Step 2: Creating a New Project
Step 3: Refining the AI Engine Power Estimate Using Simulated Design and Switching Activities
References
Support
Prime Factor FFT
Introduction
Matlab Models
I/O Permutations (2D Case)
I/O Permutations (3D Case)
Design Overview
INPUT PERMUTE Kernel
FFT-7 Kernel
TRANSPOSE1 Kernel
FFT-9 Kernel
TRANSPOSE2 Kernel
FFT-16 Kernel
OUTPUT PERMUTE Kernel
Design Resources
Build and Run Design
Setup & Initialization
Hardware Emulation
Hardware
References
Support
2D FFT AIE vs HLS
AIE Implementation
AI Engine Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Compiling PL Kernels
make graph: Creating the AI Engine ADF Graph for Vitis Compiler Flow
make xsa: Using the Vitis Tools to Link AI Engine and HLS Kernels with the Platform
make application: Compiling the Host Application
make package: Packaging the Design
make run_emu: Running Hardware Emulation
Running on Hardware
Hardware Design Details
Design Details
AI Engine and PL Kernels
dma_hls
Software Design Details
AI Engine Kernels and Graph Representation
Adaptive Data Flow (ADF) Graph
Defining the Graph Class
Top-Level Application
PL Data Mover Kernel
dma_hls (dma_hls.cpp)
Top Function Declaration
Top Function Definition
PS Host Application
HLS Implementation
HLS Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Compile PL Kernels
make xsa: Using the Vitis Tools to Link HLS Kernels with the Platform
make application: Compile the Host Application
make package: Packaging the Design
make run_emu: Running Hardware Emulation
Running on Hardware
Hardware Design Details
Design Details
HLS/PL Kernels
FFT_2D
DMA_HLS
Software Design Details
HLS/DSP Kernel Representation
Data Flow
Define FFT Inputs
Required Headers and Function Declarations
FFT Core Config Structure
Top Function
Sub-Function Details
Reading Data
FFT Function
Writing Out Data
PL Data Mover Kernel
dma_hls (dma_hls.cpp)
Top Function Declaration
Top Function Definition
PS Host Application
FIR Filter AIE vs HLS
AI Engine Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Compile PL Kernels
make graph: Creating the AI Engine ADF Graph for Vitis Compiler Flow
make xsa: Use Vitis Tools to Link AI Engine and HLS Kernels with the Platform
make application: Compile the Host Application
make package: Package the Design
make run_emu: Run Hardware Emulation
Run on Hardware
Hardware Design Details
Design Details
AI Engine and PL Kernels
Software Design Details
Data Flow Graph
Define the Graph Class
Instantiate DSPLib FIR Filters
Add Connectivity Information
Top Level Application
PL Kernels
datamover (datamover.cpp)
Arguments
pragma HLS INTERFACE s_axilite
pragma HLS INTERFACE axis
pragma HLS PIPELINE II=1
PS Host Application
Include graph.cpp
load_xclbin Function
Datamover Class
FIR Chain Class
Main Function
1. Check Command Line Argument
2. Open XCLBIN
3. Create and Initialize Data Mover Kernels and FIR Chain Graph
4. Run the Data Mover Kernel and FIR Chain Graph
5. Wait for Data Mover Kernels to Complete
6. Verify Output Results
7. Release Allocated Resources
References
AI Engine Documentation
Support
HLS Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Compile PL Kernels
make xsa: Use Vitis Tools to Link HLS Kernels with the Platform
make application: Compile the Host Application
make package: Package the Design
make run_emu: Run Hardware Emulation
Run on Hardware
Hardware Design Details
Design Details
HLS PL Kernels
Software Design Details
N Body Simulator
Running the Simulation
Python Simulations on x86 Machine
Results
(Optional) Creating Animation GiFs
Next Steps
Support
Build the Design
AI Engine Design
A Single Nbody() Kernel
Four NBody() Kernels Packet Switched
Workload Distribution and input_j
100 N-Body Subsystems
Why Packet Switching?
(Optional) Simulate the AI Engine Design
References
Next Steps
Support
Building the Design
Step 1: Set the Vitis Utility Library path
Step 2: Generate mm2s_mp.cpp and s2mm_mp.cpp Datamover kernels
Step 3: Compile HLS PL Kernels
HLS PL Kernels
mm2s_mp
packet_sender
packet_receiver
s2mm_mp
References
Next Steps
Support
Building the Design
Full System Design
Design Implementation
References
Next Steps
Support
Building the Design
Step 1: Compile Host Software
Step 2: Link Host Software
Host Software
NBodySimulator API
Logger API
Log Levels:
Host Applications
References
Next Steps
Building the Design
SD Card Image Generation
Booting the VCK190 Board
Running the Design on Hardware
References
Next Steps
Support
Creating the Animation GIF
Results
Latency Performance Comparisons
Design Throughput Calculations (Effective vs. Theoretical)
(Optional) Building x1_design and x10_design
Building the x1_design (simulates 128 particles)
Building the x10_design (simulates 1,280 particles)
Support
DDC Chain
Table of Contents
Introduction
Upgrading Tools, Device Speed Grade, and Makefile
Upgrading the Code
Converting Kernel Functions to Kernel Classes
Migrating from Windows to Buffers
Replacing Intrinsics with APIs
Relocating Global Variables to Kernel Class Data Members
Handling State Variables to Enable x86sim
Updating Older Pragmas
Supporting x86 Compilation and Simulation
Building and Running the Design
Setup and Initialization
x86 Functional Simulation
Hardware Simulation
Summary
Support
License
GeMM AIE vs DSP
AI Engine Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Compiling PL Kernels
make graph: Creating the AI Engine ADF Graph for Vitis Compiler Flow
make xsa: Using the Vitis Tools to Link AI Engine and HLS Kernels with the Platform
make application: Compiling the Host Application
make package: Packaging the Design
make run_emu: Running Hardware Emulation
Running on Hardware
Hardware Design Details
Design Details
AI Engine and PL Kernels
dma_hls
Software Design Details
GeMM DSP58 Implementation
Building the Design
Make Steps
Build the Entire Design with a Single Command
make kernels: Generates the PL Kernels
make xsa: Using the Vitis Tools to Link HLS Kernels with the Platform
make application: Compile the Host Application
make package: Packaging the Design
make run_emu: Running Hardware Emulation
Running on Hardware
Hardware Design Details
PL Kernel Details
Platform Details
Software Design Details
Bilinear Interpolation
Introduction
Computing Interpolated Values
Design Assumptions
AI Engine Code Vectorization
Data Interface
Programmable Logic Component
PLIO Interface
AI Engine Test Vectors
AI Engine Kernel Processing
Kernel Data Interface
Kernel Code
Running the Example
Generating Test Vectors
Running x86 Simulation
Running AI Engine Simulation
Analyzing Results
Vitis Analyzer
Test Vector Comparison
Customizing the Example
Specifying a Test Image and Output Resolution
Multicore Processing
References
Support
2D IFFT 64K
Introduction
Matlab Model
Design Overview
Design Approach
IFFT-256 Prototyping
Front-End IFFT-256 AI Engine Kernel
Memory Transpose PL Kernel
Back-End IFFT-256 AI Engine Kernel
Design Resources
Build and Run Design
Setup & Initialization
Hardware Emulation
Hardware
References
Support
FFT DFT on AIE
Introduction
Discrete Fourier Transform
Fast Fourier Transform
Stockham Fast Fourier Transform
DFT Designs on AI Engine
DFT as a Vector x Matrix Multiplication
High Throughput SSR=8 DFT Design
Throughput Measurement for the dft16 Design
FFT Designs on AI Engine
FFT Designs on AI Engine
Single Tile AI Engine API Design
Throughput and Latency Measurements for fft32_r2 Design
Optimization Technique: Batch Processing
Single-Tile DSPlib Design
Throughput and Latency Measurements for fft32_dsplib Design
Optimization Technique: Split Stages
Throughput and Latency Measurements for fft32_dsplib_split Design
Optimization Technique: Parallel Implementation
Throughput and Latency measurements for fft32_dsplib_ssr Design
Conclusion
Bitonic Sorting
Introduction
Small Bitonic Sorting Example
Stage 0
Stage 1
Stage 2
Stage 3
Profiling of \(N=16\) Bitonic Sort vs. std::sort()
Large Bitonic Sorting Example
Profiling of \(N=1024\) Bitonic Sort vs. std::sort()
References
Support
Farrow Filter
Introduction
Requirements and System Partitioning
Compute Analysis
Bandwidth Analysis
Storage Analysis
AI Engine Implementation and Optimization
Initial Farrow Design
First Farrow Optimization
Second Farrow Optimization
Final Farrow Optimization
Build and Run Design
Setup and Initialization
Hardware Emulation
Hardware
Summary and Conclusion
References
Support
1M Point FFT 32Gsps
Introduction
Matlab Models
Design Overview
AI Engine Graph View
AI Engine Array View
VC1902 Floorplan View
AI Engine Design Validation
VC1902 Timing Closure
Design Resources
Build and Run Design
Setup & Initialization
Hardware
References
Support
Hough Transform
Introduction
What is the Hough Transform?
What is System Partitioning?
System Partitioning Methodology
Hough Transform Matlab Model
System Partitioning
Goals
Parallelizing Over “Image Tiles”
Parallelizing Over “Theta”
Analyzing Storage Requirements
Analyzing Compute Requirements
Analyzing I/O Bandwidth Requirements
SIMD / Vectorization
Solution Synthesis
Partitioning Validation
Iterating to System Feasibility
Conclusions
References
Support
MUSIC Algorithm
Introduction
System Model
Subspace Algorithm
MUSIC Spectrum Estimation
MATLAB Model
AI Engine Subgraph Designs
IO Adapter Subgraph
QRD Subgraph
SVD Subgraph
DOA Subgraph
Scanner Subgraph
Finder Subgraph
Top-Level Design
Building the Design
Setup and Initialization
Hardware Emulation
Hardware
Hardware-in-the-Loop Demo
Architecture
System Operation
Performance Estimation
Software Version
MATLAB Folder Structure
Steps to Generate and Run HIL Demo Data
Archiving Demo Data
Playback Videos
Client and Server on MATLAB
Conclusions
References
Appendix
Deploying the SD Card Image
Booting the VCK190 Board
Simple Ethernet Configuration
Using a VPN
Running the PS Application
Testing with MATLAB
Support
Softmax Function
Introduction
Softmax Function Definition
Computing the Exponential Function
IEEE 754 Format Trick
Improving Accuracy
Adapting for Single-Precision Floating-Point
AI Engine Implementation
AI Engine Kernel Processing
Kernel Data Interface
Kernel Code
Running the Example
Generating Test Vectors
Running x86 Simulation
Running AI Engine Simulation
Analyzing Results
Vitis Analyzer
Test Vector Comparison
References
Support
TDM Mixer
Introduction
Corner-Turning using Tile DMA
Vectorization of the Mixer
Corner-Turning Concept
Local Tile DMA Tiling Parameters
TDM Mixer Graph Design
Input Buffer Tiling Parameters
Output Buffer Tiling Parameters
Baseline Mixer Design
Vitis Functional Simulation
Optimized Mixer Design
Conclusions
References
Support
Back-Projection for Synthetic Aperture Radar on AI Engines
Introduction
Goals
GOTCHA Volumetric SAR Data Set
References
Back-Projection Engine
Back-Projection Engine
Design Approach
DDR Buffers and PL URAM Buffers
BP Engine Graph aand Kernel Scheduling
Graph View
Floorplan View
Resource Utilization
Throughput and Latency
Hardware Emulation
Hardware
Block Design: ifft2k_async()
Block Design: range_gen()
Block Design: diff3dsq()
Block Design: sqrt()
Block Design: dR_comp()
Block Design: fmod_floor()
Block Design: expjx()
Block Design: interp1()
Block Design: image_buffer()
Final SAR BP Engine Performance
References
Design Builds
Design Builds
Setup and Initialization
Single Engine Design Build
Multiple Engine Design Build
Multiple Engines
Multiple Engines
Overview
Placement Constraints
Device-Level Details
Hardware Throughput
Opportunities for Optimization
System Model
System Model
Introduction and Approach
Structural Similarity Index Measure
MATLAB System Model
Inner Loop Analysis
SAR Back-Projection Compute Workloads
Algorithm Adaptations for AI Engine
System Parameter Adaptations
ifft() Adaptations
Vectorized Functional Approximation
interp1() Adaptations
fmod_floor() Adaptations
Final AI Engine Algorithm Performance
SAR BP Engine Block Diagram
References
System Partitioning
System Partitioning
System Parameters & Performance Targets
AI Engine Prototyping
SAR BP Engine Design Proposal
Projected System Throughput
Projected System Resources
Next Steps
References
Conclusion
Conclusion
AIE Feature Tutorials
AIE A to Z
Custom Base Platform Creation
Platforms
Step 1: Build the AMD Versal™ Extensible Embedded Platform Example Design in Vivado
Step 2: Build the Platform in the Vitis Software Platform
AIE Application Creation
Step 1: Create a new AI Engine Application Project
Step 2: Build the Project and Run Through Emulation-AIE
PL Application Creation
Step 1: Modify the Graph for Use in Hardware Build
Step 2: Add PL Kernels
Step 3: Configure the Hardware Linking Project
Step 4. Build the System
PS Application Creation Run All
Step 1: Create a New Platform in the Bare-metal Domain
Step 2. Build the Baremetal AI Engine Control Application
Step 3: Package the Full System
Step 4: Run the System in Hardware Emulation
Step 5: Build the System targeting the Hardware
Step 6A: Run the System in Hardware via SD Boot
Step 6B: Run or Debug the System in Hardware using JTAG
Summary
Using GMIO
AI Engine GMIO Performance Profile
Design Introduction
Performance Profiling Methods
Profiling using C++ Class API
Profiling using AI Engine Cycles Received from AI Engine Kernels
Profiling using the Event API
Conclusion
AI Engine GMIO Programming Model
Step 1 - Synchronous GMIO Transfer
Run AI Engine Compiler and AI Engine Simulator
Step 2 - Asynchronous GMIO Transfer for Input and Synchronous GMIO Transfer for Output
Run AI Engine Compiler and AI Engine Simulator
Step 3 - Asynchronous GMIO Transfer and Hardware Flow
Run AI Engine Simulator and Hardware Flow
Conclusion
RTP Reconfiguration
Introduction
Overview
Steps
Asynchronous Scalar RTP
Asynchronous Array RTP
Asynchronous RTP Read
Synchronous RTP
Summary
Support
Packet Switching
Buffer-based AI Engine Kernels
Construct Graph with Packet Switching Capability
Packet Format
Prepare Data and Run AI Engine Simulator
Example PL Kernels for Packet Switching
Example PS code for Packet Switching
Run Hardware Emulation and Hardware Flows
Conclusion
Buffer-based AI Engine Kernels with Mixed Data Types
Prepare Data for AI Engine Simulator
PS Application and HW Emulation Flows
Conclusion
Support
Packet Stream-based AI Engine Kernels
Packet Stream Interfaces and Operations
Construct Graph for Packet Stream Kernels
Run the AI Engine Simulator, HW Emulation, and HW Flows
Conclusion
Support
AI Engine Versal Integration
Objectives
Tutorial Overview
Step 1: Launch Vitis Unified IDE
Step 2: Create and Build the AI Engine Component
Simulate the AI Engine Graph using the x86simulator
Build and Run the AI Engine Graph for Hardware
Step 3: Create and Build HLS Components
Step 4: Create and Build the Application Component
Step 5: Create the System Project
Step 6: Building and Running the System Project
Building and Running for Software Emulation
Building and Running for Hardware Emulation
Building and Running on Hardware
Step 7: Using the Analysis View
Support
Versal System Design Clocking Tutorial
Introduction
Objectives
Step 1 - Building ADF Graph
Step 2 - Clocking the PL Kernels
Step 3 - v++ linker – Building the System
Step 4 - Compiling Host Code
Step 5 - Packaging Design and Running on Board
Challenge (Optional)
Build the design for Hardware Emulation
Summary
AI Engine Floating Point
Introduction
AI Engine Architecture Details
Fixed-Point Pipeline
Floating-point Pipeline
Floating-point intrinsics
Start, offset
fpneg, fpabs, fpadd, fpsub
fpneg
fpabs
fpneg_abs
fpadd, fpsub
fpadd_abs, fpsub_abs
fpmul
fpabs_mul
fpneg_mul
fpneg_abs_mul
fpmac, fpmsc, fpmac_abs, fpmsc_abs
fpmul_conf, fpmac_conf
Floating-Point Examples
FIR Filter
Real Floating-Point Filter
Complex Floating-Point Filter
Matrix Multiply
Support
DSP Library
Introduction
Part 1: Creating a Single Kernel Graph
Understanding the Source Files
Compile the application
Running the Design through Simulation
Using Vitis Analyzer to look at the Simulation Results
Part 2: Creating a Multi Kernel Graph
Changes to the Filter Graph from Part 1
Build AI Engine Emulation
Running the Design through Simulation
Using Vitis Analyzer to look at the Compilation and Simulation Results
Part 3: Optimizing Filter Performance
Changes to the Filter Graph from Part 1
Build AI Engine Emulation
Running the Design through Simulation
Using Vitis Analyzer to look at the Compilation and Simulation Results
Conclusion
Debug Walkthrough
Porting a Command Line Project to the Vitis IDE Project
Step 1: Launch Vitis Unified IDE
Step 2: Create an AI Engine Component
Step 3: Create HLS Components
Step 4: Create the Application Component
Step 5: Create the System Project
Support
AI Engine Simulation Debug Walkthrough
AI Engine Simulation Debug Walkthrough
Introduction
Features
Section 1
Build and Simulate in the Vitis IDE
Section 2
Debug Using printf
Section 3
Debug Using the Vitis IDE Debugger
Limitations
Section 4
Enabling Profile and Trace Options
Exercise Step
Section 5
Deadlock Detection
Section 6
Visualizing Deadlock in the Vitis Analyzer
Section 7
Debugging Memory Access Violations
Section 8
Kernel Debug
Section 9
Design Performance Debug
Calculating the Graph Throughput Using Graph Output
Section 10
Determine Average Throughput of PLIO
Support
Hardware-Emulation Debug Walkthrough
Hardware-Emulation Debug Walkthrough
Introduction
Features
Section 1
Build for Hardware Emulation Using the Vitis IDE
Section 2
Debug PL Kernels Using the Vivado Logic Simulator
Section 3
Performance of the AI Engine Using the Hardware Emulation Results
Calculating the Kernel Latency
Calculating the Graph Throughput Using the Graph Output
Section 4
Command Line Project Source Code Debug with the Vitis Unified IDE
Refer to Chapter 52: Debugging the System Project and AI Engine in UG1393 for more details to debug AI Engine.
Support
Hardware Debug Walkthrough
Design Execution and System Metrics
Features
Running the Design on Hardware
Analyzing Run Results
AI Engine Status Using XRT
Manual AI Engine Status Using the XBUtil Utility
Deadlock Detection Using XSDB
Error Handling and Reporting in the Host Application
XRT Error Handling APIs
Using XBUtil
Using APIs in the Host Application
Profiling Graph Throughput
Exercise Step
Profiling to Count Samples Sent and Received
Support
System Profiling
Features
Generating the Hardware Image
Hardware Profiing Features
XRT Flow
Open Multiple Profile Runs in the Vitis Analyzer
Profiling Data Explanation
AI Engine Core Profiling Data**
AI Engine Memory Profiling Data
Interface Profiling Data
Profiling Data Analysis
XSDB Flow
Support
PL Kernel Analysis
Features
Getting the Design Files Ready
Profiling Using PL Profile Monitors
Inserting ILAs to Monitor Specific AXI Interfaces
Enable ILA in the Design
Set Up the Connection in Vivado
Examine the Captured Results
Support
AI Engine Event Trace and Analysis
Event Trace Analysis Features
Build the Design
Prepare for the Hardware Run
XRT Flow
Launch the Vitis Analyzer to Examine the Event Trace Files
Details of the Event Trace Data
XSDB Flow
Event Trace Considerations
Event Trace Choice Considerations
Number of Event Trace Streams Methodology
Event Trace Limitations
Event Trace Analysis Using HSDP
Setup SmartLynq+ Module, and Connect to Versal Device
Launch XSDB, and Offload Trace Information
Limitations
Debug the Host Code and Kernel Source Code using the Vitis IDE
Limitations of the Source Code Debug on Hardware
Support
X86 Simulation Debug Walkthrough
X86 Simulation Debug Walkthrough
Introduction
Features
Section 1
Build and Simulate in the Vitis IDE
Section 2
Debug Using printf()
Section 3
Debug Using printf with Vector Datatypes
Section 4
Debug Using the Vitis IDE Debugger
Section 5
x86simulator Options for Debugging
Data Dump
Deadlock Detection
Scenario 1
Scenario 2
Trace Report in the File
Trace Report in the Output Console
Section 6
Memory Access Violation and Valgrind Support
Set Up the Environment Variables
Section 6 Exercise Step
Section 7
Using the GDB Debugger in the Command Line
x86simulation on the Command Line
x86simulation with the GDB
x86simulator Using the GDB Server
Section 7 Exercise Step
Support
AIE DSP Lib Model Composer
Introduction
Before You Begin
Overview
Stage 1: Create and Simulate the Design
Stage 2: Further Analysis of the Design
Stage 3: Generate the Code and Perform Emulation-AI Engine
Stage 4: Increasing the PLIO Bitwidth and Re-generate
Conclusion
AI Engine Emulation Waveform Analysis
Introduction
Objectives
Tutorial Overview
Design Overview
Transaction Level Modeling
Steps
Step 1: Build Design
Step 2: Launching Emulation with XSIM Waveform GUI
Step 3: Using XSIM Waveform GUI and QEMU
Exploring the Waveforms
Checking Proper Boot-up Using PMC
Transactions Generated by PS (QEMU) to PL/AIE
PL to AI Engine
AI Engine RTP Signals
AI Engine to PL to DDR Memory
Limitations
Step 4: Using Vitis Analyzer
Summary
AIE Performance Analysis
AI Engine Graph Execution and Measurement
Graph and Kernel Code
Graph Execution Model
Graph Performance Measurement
Design Optimization Considerations
Conclusion
Support
AI Engine Deadlock Analysis
Common Deadlock Scenarios
AI Engine Deadlock Example and Analysis in AI Engine Simulator
AI Engine Stall Analysis with Vitis Analyzer
AI Engine Deadlock Detection in the Hardware Emulation Flow
AI Engine Deadlock Detection in the Hardware Flow
Conclusion
Appendix (Optional)
Manual Dump and Register Reading to Detect AI Engine Status in Hardware Emulation and Hardware
Support
AI Engine Status Analysis
Setting Up and Running the Design
Option 1: Automated and Periodic AI Engine Status Output
Analyzing the Automated Status Output
Option 2: Manual output the AI Engine status
Analyzing the Manual Status Output
Conclusion
Support
Implementing IIR Filter
Implementing an IIR Filter on the AI Engine - Part 1a
Preliminaries
Kernel Code
Julia Script Notes
Adaptive Dataflow Graph
Testbench Code
Build and Run the Program
Conclusion
References
Support
Implementing an IIR Filter on the AI Engine - Part 1b
Recap
Julia Script
Adaptive Dataflow (ADF) Graph
Testbench Code
Building and Running the Design
Changing Coefficients During Runtime
Conclusion
Support
Implementing an IIR Filter on the AI Engine - Part 2a
Implementing an IIR Filter on the AI Engine - Part 2a
Preliminaries
Kernel Code
Testbench Code
Analysis
Conclusion
Support
Implementing an IIR Filter on the AI Engine - Part 2b
Implementing an IIR Filter on the AI Engine - Part 2b
Preliminaries
Kernel Header
Kernel Code (AI Engine API)
Graph Code
Testbench Code
Analysis (using AI Engine API)
Generated Code
Throughput
Kernel Code (LLI)
Conclusions
Support
Post Link Recompile
Direct AI Engine Recompile Makefile Flow
Initialization
Phase 1: Compile AI Engine application and PL Kernels and Link the System
Phase 2: Recompile the AI Engine Application, Package the New System, and Rerun Hardware Emulation
Perform On-Board Testing
Support
Post-Link Recompile of an AI Engine Application
Initialization
Phase 1: Creating a Fixed Platform from an AI Engine Application and PL Kernels
Phase 2: Using a Platform Generated by Vitis and Modifying the AI Engine Application
Perform On-Board Testing
Support
RTL IP with AIE Engines
Introduction
Objectives
Tutorial Overview
Step 1 - Creating custom RTL kernels with the Vivado Design Suite
Step 2 - Creating HLS kernels with Vitis compiler
Step 3 - Interfacing ADF graph to Programmable Logic
Step 4 - Building XCLBIN
Step 5 - Build Host Application
Step 6 - Package
Step 7 - Run Emulation
To View Emulation Waveforms
Summary
AIE A to Z Custom Linux Platform
AI Engine Graph Integration and Validation using a Custom Linux Platform
Prerequisites
Setting up the environment
Re-compiling ADF graph
Re-compiling Programmable Logic (PL) kernels targeting the custom platform
Hardware Emulation
Targeting Hardware
Support
AIE Compiler Features
Conditional Objects Instantiation
Introduction
Basics of Conditional Instantiation
Conditional Usage Examples
Case 1: Conditional Cascade Port
Case 2: Conditional Array of Sub-Graphs
Case 3: Conditional Sequential Sub-Graphs
Case 4: Conditional RTP Ports
Support
Data Multicasting
Introduction
Case 1: Stream and Buffer Multicasting
Case 2: Multirate Buffer Multicasting
Support
Multirate AI Engine Graphs
Introduction
Multirate Examples
I/O-buffer Interface
UpConv then DownConv
DownConv then UpConv
Split and Merge
Stream Interface
No Repetition Count Indicated
UpConv then DownConv
DownConv then UpConv
Split and Merge
Support
Two Tone Filter
Introduction
Before You Begin
Overview
AIE Independent Graphs
Compiling AI Engine Graphs for Independent Partitions
Step 1: Compile and Verify Each Partition with AIE simulator
Partition pr0 in folder pr0_gmio
Partition pr1 in folder pr1_rtp and partition pr2 in folder pr2_perf
Step 2: V++ linker to integrate the partitions
Step 3: Compile host code
Step 4: Package for hardware
Step 5: Run applications in HW
Summary
Support
Partition Reloading
Generate AI Engine-only and PL-only XCLBIN
Host code for controlling Graph and Partition Reloading
Reference Design 1: (./partition_reload_same_graph)
Reference Design 2: (./partition_reload_diff_graph)
Reference Design 3: (./AIE_reload_whole_array)
Summary
Support
AIE PL Interface
Introduction
Part 1 - Connecting RTL AXI4-Stream Interfaces (included in Block Design) to the AI Engine
Platform
Hardware Platform creation
Vitis V++ Link
Hardware Emulation
Part 2 - Connecting RTL AXI4-Stream interfaces (NOT included in Block Design) to the AI Engine
Hardware Platform
Vitis V++ Link
Hardware Emulation
README
Part 3 - Connecting Monitored RTL Interfaces to AI Engine
Creating the design
Running the Design in Hardware
Part 4 - Broadcasting Data to the AI Engine and the Programmable
Creating the design
Hardware Emulation
AI Engine Algorithm Performance Optimization
AI Engine Kernel Optimization Lab
Introduction
DVB-S2 Soft Demodulator
Kernel Code
Lab A Walkthrough
Lab B Walkthrough
References
Support
A "Gentle" Introduction to AI Engine Kernel Programming
Overview
Introduction
A Brief Overview of AI Engine Tiles and Kernels
Scalar and Vector Processors
AVX Versus AI Engine APIs
Vector Addition with AVX Intrinsics
vadd_avx.cpp
Vector Addition with AIE API
vadd_aie.cpp
Building and Running the AVX CPU Program
Building and Running the AIE API Program
AI Engine APIs
Modfied Kahn Process Network (KPN)
Which Applications are Best Suited for AI Engines?
Code Required to Create a Program to Run on an AI Engine
Kernel Code Structure
Graph Code Structure
Test Bench/Control Code Structure
AI Engine Kernel Input and Output Types
A Contrived Task to Illustrate How to Access AIE Kernel I/O Ports
Sample Code for Stream Input and Output
Unit Test for Squared Magnitude Module
Sample Code for Buffer Input and Output
Unit Test for Matrix Multiplication Module
Sample Code for Accumulator Cascade
Unit Test for Matrix-Vector Multiplication Module
Sample Code for Runtime Parameter (RTP)
Unit Test for SumDiff Module
Create the Contrived Task
Advanced Dataflow Graph
Top-level File for Contrived Design
Build the Contrived Design
Conclusion
Learning Resources for AI Engine Kernel Programming
Support
AI Engine-ML Tutorials
AIE-ML Design Tutorials
AIE-ML Programming
AI Engine-ML Architecture
Introduction
AI Engine-ML processor array
Support
Compute Optimization
AI Engine-ML matrix multiplication Instruction Set
IO or Compute bound?
Example 1
Tutorial Example
Code analysis
Running the tutorial
Performance Analysis
Conclusion
Support
Matrix Multiplication Compute Performance of the AI Engine-ML Tiles
Support
Tiling Parameters Programming
Introduction
Tiling parameter structure
A graphical Example
Some other examples
1D Linear with Zero-Padding before
1D linear with zero-padding and truncation
3D Linear with zero padding around
Support
Prime Factor FFT-1008 on AIE-ML
Introduction
Matlab Models
Design Overview
INPUT PERMUTE Kernel
FFT-7 Kernel
TRANSPOSE-0 Kernel
FFT-9 Kernel
TRANSPOSE-1 Kernel
FFT-16 Kernel
OUTPUT PERMUTE Kernel
Design Resources
Build and Run Design
Setup & Initialization
Hardware Emulation
Hardware
References
Support
License
AIE-ML LeNet Tutorial
Introduction
Tutorial Overview
Before You Begin
Tools: Installing the Tools
Environment: Setting Up the Shell Environment
AIE API based FFT for Many Instances Applications
Fast Fourier Transform
From the basics to the FFT
Cooley-Tukey FFT and its variants
Cooley-Tukey decimation formalization
Power-of-B/Radix-B and mixed radix FFT algorithms
The Stockham FFT algorithm
Bibliography
Support
Versal and AI Engine ML Basics
Versal adaptive SoC overview
Versal AI Engine ML overview
AI Engine programming basics
Support
Twiddle Factors Generation
Usage
Parameters setting
Support
Basic Verification Script
Using the script
Support
Softmax Function on AIE-ML
Introduction
Softmax Function Definition
Computing the Exponential Function
IEEE 754 Format Trick
Improving Accuracy
Adapting for bfloat16 Floating-Point
AI Engine Implementation
AI Engine Kernel Processing
Kernel Data Interface
Kernel Code
Running the Example
Generating Test Vectors
Running x86 Simulation
Running AI Engine Simulation
Analyzing Results
Vitis Analyzer
Test Vector Comparison
References
Support
Migrating Farrow Filter from AIE to AIE-ML
Introduction
Comparison of AIE vs AIE-ML Farrow Filter Design Implementation
Conclusion
Polyphase Channelizer on AIE-ML using Vitis Libraries
Introduction
Channelizer Requirements
System Partitioning
Filterbank System Partitioning
Filterbank Compute Requirements
Filterbank Storage Requirements
Filterbank I/O Bandwidth Requirements
Filterbank Library Characterization
Filterbank Library Optimization
IFFT-2D System Partitioning
Available Workflows for IFFT-2D IP
IFFT-2D Library Characterization
IFFT-2D Library Optimization
Design Summary
Design Resources
Build and Run Design
Setup & Initialization
Hardware Emulation
Hardware
References
Support
License
MNIST ConvNet on AIE-ML
Introduction
Virtual Python Environment Setup
Jupyter Notebook Model
Import the MNIST Image Database
Training & Testing the MNIST ConvNet Model
Using the MNIST ConvNet for Inference
Extracting Weights & Biases for AIE-ML Inference Solution
AIE-ML Inference Solution
Design Approach
Vitis Functional Simulation
MNIST ConvNet: AI Engine Graph View
MNIST ConvNet: AI Engine Floorplan View
MNIST ConvNet: AI Engine Resource Utilization
Vectorization of 3x3 Conv2D Layer Processing
MNIST ConvNet: Profiling & Vector Load
MNIST ConvNet: Throughput
Individual Layer Designs
Layer Design Details: conv2d_w1()
Layer Design Details: max_pooling2d_w2()
Layer Design Details: conv2d_w3()
Layer Design Details: max_pooling2d_w4()
Layer Design Details: conv2d_w5()
Layer Design Details: dense_w7()
Summary
References
Support
License
AIE-ML Feature Tutorials
A to Z Bare-metal Flow
Introduction
Support
Using GMIO with AIE-ML
Introduction
Objectives
Steps
Runtime Parameter Reconfiguration
Introduction
Objectives
Steps
Support
Packet Switching
Objectives
Steps
Support
Versal Integration for Hardware Emulation and Hardware
Introduction
Objectives
Tutorial Overview
Section 1: Compile AI Engine Code for AIE Simulator: Viewing Compilation Results in Vitis Analyzer
Important
Compiling an AI Engine ADF Graph for V++ Flow
Vitis Analyzer Compile Summary
Section 2: Simulate the AI Engine Graph using the aiesimulator and Viewing Trace and Profile Results in Vitis Analyzer
Section 3: Run the Hardware Emulation, and View Run Summary in Vitis Analyzer
1. Compiling HLS Kernels Using v++
2. Use V++ to Link AI Engine, HLS Kernels with the Platform
3.Compile the A72 Host Application
4.Package the Design
5.Run Hardware Emulation
Section 4: Build and Run on Hardware
Summary
Support
Matrix Compute with Vitis Libraries on AIE and AIE-ML
Matrix Compute with Vitis Libraries on AIE and AIE-ML
Introduction
AMD Versal Devices with AI Engine Variants
Tiling Parameter Programming
Buffer Descriptors
Introduction
Buffer Descriptors overview
Buffer Descriptor parameters
AI Engine ML registers bit width
BD counting
BD example
AI Engine Compiler Report
Support
Tiling Parameter Programming Documentation Examples
Introduction
Example usage
1D Examples
1D Linear Transferring From Source to Destination
1D Linear With Zero Pre-padding
1D Linear With Zero Pre and Post-padding
1D Linear With Zero Pre and Post-padding with buffer boundary setting
2D Examples
2D Linear transfer from source to destination
2D Linear transfer with zero-padding on dimension 0
4x2 matrice tranfer
2D shuffle from higher address to lower address
3D Examples
3D linear copy
3D Linear with zero padding
4D Example
Support
Tiling Parameter for External Memory
Introduction
Matrix Transposition in Interface Tile
Hardware Emulation and Hardware run
Support
Tiling Parameter for Memory Modules
Introduction
Test Case 1: Matrix Transpose
Test Case 2
Tensor Buffer Stream
tensor_descriptor
tensor_buffer_stream
Test case 3
Test Case 4
Conclusion
Support
Tiling Parameter for Memory Tiles
Introduction
Test case 1
Write access
Read Access
Test case 2
Test case 3
Test Case 4
Support
Tiling Parameters Programming
Introduction
Tiling parameter structure
A graphical Example
Write Access (Kernel1 and Kernel2)
Read Access (Kernel3 and Kernel4)
Limitations
Support
AI Engine-ML Performance Analysis Tutorial
Objectives
Target Application Introduction
Steps - Version 1
Steps - Version 2
Steps – Version 3
Steps - Version 4
Conclusion
Support
AIE Compiler Features
Introduction
Objectives
Tutorial Sections
Conditional Objects
Case 1
Case 2
Case 3
Case 4
Multirate
UpConv then DownConv (Buffer)
DownConv then UpConv (Buffer)
Split and Merge (Buffer)
UpConv then DownConv (Stream)
DownConv then UpConv (Stream)
Split and Merge (stream)
Multicast
Case 1: Stream and Buffer Multicasting
Case 2: Multirate Buffer Multicasting