Conclusion: The bandwith is higher when accessing a single Pseudo Channel over 256 MB data (or less) compared to accessing multiple Pseudo Channels. - 2022.2 English
Vitis Tutorials: Hardware Acceleration (XD099)
Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English
Vitis Tutorials: Hardware Acceleration
Feature Tutorials
Design Tutorials
Feature Tutorials
Getting Started with RTL Kernels
Package IP/Package XO Flow
Create a New Project
Add Kernel Sources
Open the IP Packager
Specify the Control Protocol
Edit Ports and Interfaces
Add Control Registers and Address Offsets
Check Integrity, Assign Properties, and Package IP
Next Steps
Host Code Programming
Setting Up the XRT Native API
Specifying the Device ID and Loading the XCLBIN
Setting up the kernel and kernel arguments
Transferring Data
Running the kernel and returning results
Next Steps
Using the RTL Kernel in a Vitis IDE Project
Using the RTL Kernel in a Vitis IDE Project
Add the Hardware Kernel (.xo)
Build the Project
(Optional) Build and Run the Hardware on the Target Platform
Makefile Use
Summary
Mixing C and RTL
Introduction
Tutorial Overview
Before You Begin
Accessing the Tutorial Reference Files
Building an Application with C++ Based Kernel
C++ Based Kernel
Host Code
Build the Application
Run Emulation
Review the Application Timeline
Putting it All Together
Building an Application with C++ and RTL-Based Kernels
RTL-Based Kernel
Create the Vitis Project
The Vivado Design Suite Project
Host Code Updates
Build and Emulation with C++ and RTL Based Kernels
Next Steps
Dataflow Debug and Optimization
Dataflow Viewer Basics
Taking the tour
First Lab
Module Hierarchy View
The Dataflow Graph Pane
Dataflow Properties Table
Viewing the Dataflow Graph after RTL co-simulation
Viewing Dataflow Performance using Waveforms
Next Step
FIFO Sizing for Performance and avoiding Deadlocks
Types of channels
Deadlock Detection and Analysis
Second Lab
Manual FIFO Sizing
Global FIFO Sizing
Automated FIFO Sizing
Takeaways
Using Multiple DDR Banks
Introduction
Tutorial Overview
Before You Begin
Accessing the Tutorial Reference Files
Tutorial Setup
Set v++ Linker Options
Conclusion
Using Multiple Compute Units
Introduction
Tutorial Overview
Before You Begin
Accessing the Tutorial Reference Files
Makefile Flow
Run Hardware Emulation
Inspect the Host Code
Emulation Result
Improve the Host Code for Concurrent Kernel Enqueuing
Increasing the Number of CUs
Run Hardware Emulation and Inspect the Change
Conclusion
Controlling Vivado Implementation
Introduction
Tutorial Overview
Before You Begin
Accessing the Tutorial Reference Files
Set Up the Vitis Environment
Controlling Vivado Synthesis and Implementation through the Vitis Compiler
Optimizing the Design in the Vivado Tool
Reuse the Optimized Checkpoint to Create the Device Binary
Putting it All Together
Conclusion
Optimizing for HBM
HBM Overview
Next Steps
Migrating to HBM
Application Overview
Using DDR
Run application using DDR
Migration to HBM
Run application using HBM
Next Step
HBM Bandwidth Explorations
HBM Bandwidth Explorations
Sequential Accesses
Conclusion: The bandwidth achieved for sequential accesses is mostly independent of the topology and is constant at about 13 GB/s.
Random Accesses
Conclusion: The bandwith is higher when accessing a single Pseudo Channel over 256 MB data (or less) compared to accessing multiple Pseudo Channels.
Random Accesses with RAMA IP
Conclusion: The RAMA IP significantly improves memory access efficiency in cases where the required memory access exceeds 256 MB (one HBM pseudo-channel)
Summary
Host Memory Access
XRT and Platform version
Tutorial Description
Kernel structure
Host code
Kernel compilation
Running the application
Summary
Using GT Kernels and Ethernet IPs on Alveo
Features and Design Overview
Design Flow and Tutorial Steps
1. Generate IP
2. Package Kernels
3. Vitis Linking
Summary
Enabling FPGA to FPGA P2P Transfer using Native XRT C++ API
XRT and Platform version
Introduction
1. Understanding the original (non-p2p) version of the host code
2. Running original (non-p2p) version of the design
3. Understanding the changes required for p2p transfer
Steps required for p2p data transfer
4. Running the p2p version of the design
Appendix: Understand and review the design to reverse dataflow direction with same setup
Support
License
Design Tutorials
Convolution Example
Introduction and Performance Estimation
Video Filtering Applications and 2-D Convolution Filters
Performance Requirements for 1080p HD Video
Software Implementation
Running the Software Application
Hardware Implementation
Baseline Hardware Implementation Performance
Performance Estimation for Optimized Hardware Implementations
Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter
2-D Convolution Filter Implementation
Top Level Structure of Kernel
Data Mover
Window2D: Line and Window Buffers
Building and Simulating the Kernel using Vitis HLS
Building the Kernel Module
Building the 2-D Convolution Kernel and Host Application
Host Application
Host Application Variants
Host Application Details
2D Filtering Requests
2D Filter Dispatcher
Building the Application
Kernel Build Options
Host Build Options
Application Runtime Options
Running Software Emulation
Running Hardware Emulation
System Run
Building the Hardware xclbin
Application Run Using FPGA Kernel
Profile Summary
Application Timeline
Conclusion
Bloom Filter Example
Overview of the Original Application
Tutorial Implementation
Next Steps
Experiencing Acceleration Performance
Next Steps
Architect a Device-Accelerated Application
Identify Functions to Accelerate on the FPGA
Evaluate the MurmurHash2 Function
Evaluate the First “for” Loop in the runCPU Function—”Hash” Functionality
Evaluate the Second “for” Loop in the runOnCPU Function—”Profile Compute Score” Functionality
Establish the Realistic Goal for the Overall Application
Determine the Maximum Achievable Throughput
Identifying Parallelization for an FPGA Application
Next Steps
Implementing the Kernel
Bloom4x: Kernel Implementation Using 4 Words in Parallel
Macro Architecture Implementation
Micro Architecture Implementation
Build the Kernel Using the Vitis Tool Flow
Review the Initial Host Code
Run Software Emulation, Hardware Emulation and Hw
Visualize the Resources Utilized
Review Profile Reports and Timeline Trace
Review Profile Summary Report
Review the Timeline Trace
Throughput Achieved
Opportunities for Performance Improvements
Bloom8x: Kernel Implementation Using 8 Words in Parallel
Run Hardware on the FPGA
Visualize the Resources Utilized
Review Profile Summary Report and Timeline Trace
Throughput Achieved
Bloom16x : Kernel Implementation Using 16 Words in Parallel
Run Hardware on the FPGA
Opportunities for Performance Improvements
Next Steps
Data Movement Between the Host and Kernel
Overlap of Host Data Transfer and Compute with Split Buffers
Host Code Modifications
Run the Application Using the Bloom8x Kernel
Review Profile Report and Timeline Trace for the Bloom8x Kernel
Run the Application Using the Bloom16x Kernel
Conclusion
Overlap of Host Data Transfer and Compute with Multiple Buffers
Host Code Modifications
Run the Application Using the Bloom8x Kernel
Review Profile Report and Timeline Trace for the Bloom8x Kernel
Overlap Between the Host CPU and FPGA
Host Code Modifications
Run the Application Using the Bloom8x Kernel
Review Profile Report and Timeline Trace for the Bloom8x Kernel
Review Profile Summary Report for the Bloom8x Kernel
Throughput Achieved
Opportunities for Performance Improvements
Using Multiple DDR Banks
Code Modifications
Run the Application Using 8 Words in Parallel
Review the Profile Report and Timeline Trace
Conclusion
RTL Systems Integration Example
ALPHA_MIX HLS C Kernel Creation
Hardware Emulation
Waveform Report
Profiling the Application
Run Guidance
Platform and System Diagrams
Profile Summary
Application Timeline
RTC_GEN RTL Kernel Creation
Determine Top Level Design Specification
Use RTL Kernel Wizard to Create Kernel Frame
RTC_GEN Kernel Development
Package the RTL Kernel
Traveling Salesperson Problem
Load the Vitis HLS Project
Launching the Vitis HLS GUI
Next Step
Understand the Design Structure
Design Structure
Next
Run the C Simulation
Run the Vitis HLS C Simulation
Next
Run the C Synthesis
Run Vitis HLS C Synthesis
Next
Run the RTL/C Cosimulation
Run the Co-Simulation
Next
Export the Design and Evaluate Performance in Vivado
Export the accelerated function and evaluate in Vivado
Review the Vivado results
Next Step
Improved Performance with 4 Parallel Distance Lookups
Load the project into Vitis HLS
Review Code Changes
Running C-simulation and C-synthesis
Bottom RTL Kernel Design Flow Example
RTL Module: Aes
About the AES Encryption Algorithm
AES-ECB Encryption
AES-ECB Decryption
AES-CBC Encryption
AES-CBC Decryption
RTL Module Aes
Testbench
Usage
RTL Kernel: krnl_aes
Introduction
Kernel Features
IP Generation
Pack the Design into Vivado IP and Vitis Kernel
Step 1: Create Vivado project and add design sources
Step 2: Infer clock, reset, AXI interfaces and associate them with clock
Step 3: Set the definition of AXI control slave registers, including CTRL and user kernel arguments
Step 4: Package Vivado IP and generate Vitis kernel file
Testbench
Kernel Test System and Overlay (XCLBIN) Generation
Host Programming
Tutorial Usage
Before You Begin
Tutorial Steps
1. Generate IPs
2. Run Standalone Simulation
3. Package Vivado IP and Generate Vitis Kernel File
4. Build Kernel Testing System Overlay Files
For a hardware target
For a hardware emulation target
5. Compile Host Program
Finding the Device ID of Your Target Card
6. Run Hardware Emulation (Optional)
7. Run Host Program in Hardware Mode
RTL Kernel: krnl_cbc
Introduction
Kernel Feature
IP Generation
Packing the Design into Vivado IP and Vitis Kernel
1: Create the Vivado project and add design sources
2: Infer clock, reset, and AXI interfaces, and associate them with the clock
3: Set the definition of AXI control slave registers, including CTRL and user kernel arguments
4: Associate AXI master port to pointer argument and set data width
5: Package the Vivado IP and generate the Vitis kernel file
Manually creating the kernel XML file
Testbench
Kernel Test System and Overlay (XCLBIN) Generation
Host Programming
Tutorial usage
Before You Begin
Tutorial Steps
1. Generate IPs
2. Run Standalone Simulation
3. Package Vivado IP and Generate Vitis Kernel File
4. Build Kernel Testing System Overlay Files
For a hardware target
For a hardware emulation target
5. Compile Host Program
Finding the Device ID of Your Target Card
6. Run Hardware Emulation
7. Run Host Program in Hardware Mode
Choleskey Algorithm Acceleration
Workflows
The Vitis Flow
System Setup
Install Vitis Software Platform
Install Alveo U50 Accelerator card
Setup Environment to Run Vitis
Validate Alveo U50 Accelerator card
Algorithm Description
Algorithm Description
Run this design on CPU
Next
Module 1
Understanding Code Setup with Host and Kernel
Build and Emulate with Vitis
Using make
Vitis Analyzer for Application End-to-end Timeline Analysis
Vitis HLS for Kernel Optimizations
Wrap-up for module 1
Next
Module 2
Pipelining for Throughput
The INTERFACE Pragma
Next
Module 3
Kernel Resources Used (regular floating point versus double)
Takeaway for this module…
Next
Module 4
Code modifications for the Cholesky kernel
Running the design
Result Summary
Conclusion
XRT Host Code Optimization
Introduction
Tutorial Overview
Before You Begin
Accessing the Tutorial Reference Files
Model
Building the Device Binary (xclbin)
Host Code
host.cpp Main Functions
Lab 1: Pipelined Kernel Execution Using Out-of-Order Event Queue
Lab 2: Kernel and Host Code Synchronization
Lab 3: OpenCL API Buffer Size
Conclusion
Next Steps
Aurora Kernel on Alveo
Introduction
Develop krnl_aurora kernel
Generate Aurora 64B/66B Core IP
Generate AXI Stream Data FIFO IP
AXI Control Slave Module
krnl_aurora Top Module
Package krnl_aurora Kernel
strm_issue and strm_dump Kernel
Kernel Integration (Linking)
Host Program
One More Thing
1. About Aurora IP
2. RTL Kernel krnl_aurora
3. HLS Kernel strm_issue and strm_dump
4. Top-level Linking Consideration
Summary
Revision History
Single Source Shortest Path Application
Workflow
Section 1 - Understanding the Workflow
Single Source Shortest Path kernel based on Vitis Graph Library L2
Designing other Kernels
Using krnls_wa for Computing the weighted_average Weights
Using krnls_search for Results Query
Programming the Host
Writing the Makefile
Overview of the Host/Kernel Paradigm
Introducing the Makefile
Next
Environment Setup
Section 2 - Setting up the Environment
Prerequisites
Setting up the Vitis™ Environment
Downloading the Libraries
Setting up the Vitis Libraries
Setting Options
Next
Application
Section 3 - Creating and Running an Application
Download the Application and Navigate to the Working Directory
Single Source Shortest Path Application
Workflow
Section 1 - Understanding the Workflow
Single Source Shortest Path kernel based on Vitis Graph Library L2
Designing other Kernels
Using krnls_wa for Computing the weighted_average Weights
Using krnls_search for Results Query
Programming the Host
Writing the Makefile
Overview of the Host/Kernel Paradigm
Introducing the Makefile
Next
Environment Setup
Section 2 - Setting up the Environment
Prerequisites
Setting up the Vitis™ Environment
Downloading the Libraries
Setting up the Vitis Libraries
Setting Options
Next
Application
Section 3 - Creating and Running an Application
Download the Application and Navigate to the Working Directory
Get Moving with Alveo
Acceleration Basics
Acceleration Concepts
Identifying Acceleration
Alveo Overview
Xilinx Runtime (XRT) and APIs
Runtime SW Design
Memory Allocation Concepts
Alveo Guided Software Introduction
Guided SW Examples
Provided Design Files
Hardware Design Setup
Example 0: Loading an Alveo Image
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 1: Simple Memory Allocation
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 2: Aligned Memory Allocation
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 3: XRT Memory Allocation
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 4: Parallelizing the Data Path
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 5: Optimizing Compute and Transfer
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 6: Meet the Other Shoe
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways
Example 7: Image Resizing with Vitis Vision
Overview
Key Code
Running the Application
Extra Exercises
Example 8: Building Processing Pipelines with Vitis Vision
Overview
Key Code
Running the Application
Extra Exercises
Key Takeaways