In this section, you will learn about the Vitis application development flow first and then review a methodology for re-architecting a CPU application for FPGA-based acceleration; re-coding each kernel to meet the overall performance objective. An image of the Vitis application development flow is shown below.
- Application Compilation using G++
-
The host program written in C/C++, and using XRT native API, is compiled using a g++ compiler to create a host executable file to run on the x86 processor. The host program interacts with kernels in the PL region on the FPGA device.
- Kernel Compilation using Vitis HLS
-
Vitis HLS is a compiler that takes C/C++ source code as an input and synthesizes it into an RTL design that is optimized for Xilinx FPGA products. Each C++ kernel must be synthesized using Vitis HLS to produce a Xilinx object (.xo) file. One or more .xo files can be paired for linking using Vitis linker to produce the .xclbin file.
The steps for kernel development in Vitis HLS are as follows:
- Write the C/C++ code for the function
- Verify the Code using C-simulation
- Build the kernel using C-synthesis
- Verify the kernel generated with C++ outputs
- Review the HLS synthesis reports and co-simulation reports to analyze performance
- Repeat previous steps until performance goals are met.
- PL Kernel Linking using Vitis Tools
-
Xilinx object (.xo) files are linked with the target hardware platform by the Vitis linker to create a device binary file (.xclbin) that is loaded for execution on the Alveo accelerator card.
Tip: This step will call Vivado place and route to generate the .xclbin file.To help define the architecture of the device binary, a configuration file can be created specifying option like how many instances of a kernel (or Compute Unit) should be built in the device binary, how are the kernels connected to the global memory, or to other kernels, etc. This configuration file is passed to the Vitis linker to generate the .xclbin.
There are three different build targets of the Vitis Compiler that defines the nature and contents of the generated .xclbin file. Two emulation targets used for validation and debugging purposes: software emulation for C-based simulation, and hardware emulation for RTL co-simulation; and one hardware target for building the final project output to run on the Alveo card. The same host program can be used to run any of the .xclbin targets.
Tip: Compiling for an emulation target is significantly faster than compiling for actual hardware. The emulation run is performed in a simulation environment, which offers enhanced debug visibility and does not require a physical accelerator card. - Running the Application
- Finally, when you run the application the host program loads the .xclbin file generated by Vitis Compiler. The host application always runs on the CPU and can be run in emulation mode on x86, or run on the actual physical accelerator platform.
Developing Vitis Accelerated Applications
The methodology is comprised of two major phases:
- Architecting the application and identifying kernels with performance goals defined. The developer makes key decisions about the application architecture by determining which software functions should be mapped to device kernels, how much parallelism is needed, and how it should be delivered.
- Developing the C/C++ kernels to meet the goals established. The developer implements the kernels. This task primarily involves structuring source code and applying the desired compiler pragma to create the desired kernel architecture and meet the performance target. Review the Design Principles for Software Programmers intended for software developers who want to understand the process of synthesizing accelerated hardware from a software algorithm written in C/C++
- Profile the C++ application using Valgrind, callgrind, and gprof to create the baseline for analysis. The functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs.
- The maximum achievable throughput is limited by the PCIe bus. PCIe
performance is influenced by many different aspects, such as motherboard,
drivers, target platform, and transfer sizes. Run DMA tests upfront to
measure the effective throughput of PCIe transfers and thereby determine the
upper bound of the acceleration potential, such as the
xbutil dma
test. - Identify the performance bottlenecks by reviewing the algorithm and analyzing any parallel paths. Accelerating one path may not give the expected acceleration for the overall application. When looking for acceleration candidates, consider the performance of the entire application, not just of individual functions.
- Identify the overall acceleration potential, set the application performance goal.
- After the functions to be accelerated have been identified and the overall acceleration goals have been established, the next step is to determine what level of parallelization is needed to meet the goals.
- Enable parallelism between host and device data transfer and compute on FPGA so that there is minimal idle time. Keep the device kernels active performing new computations as early and often as possible. Optimize data transfers to and from the device.
For a more complete examination of this topic, refer to Methodology for Accelerating Data Center Applications with the Vitis Software Platform, or refer to Design Principles for Software Programmers in the Vitis HLS User Guide (UG1399).
Methodology for Developing C/C++ Kernels
The software program can be automatically converted (or synthesized) into hardware, but achieving acceptable quality of results (QoR) will require additional work such as rewriting the software to help the Vitis HLS tool achieve the desired performance goals. To help, you need to understand the best practices for writing good software for execution on the FPGA device. The next few sections will discuss how you can first identify some macro-level architectural optimizations to structure your program and then focus on some fine-grained micro-level architectural optimizations to boost your performance goals. The following key kernel requirements for optimal application performance should have already been identified during the architecture definition phase:
- Throughput goal
- Latency goal
- Datapath width
- The number of accelerated kernels.
- Interface bandwidth
These requirements drive the kernel development and optimization process. Achieving the kernel throughput goal is the primary objective, as overall application performance is predicated on each kernel meeting the specified throughput.
The kernel development methodology, therefore, follows a throughput-driven approach and works from the outside-in. This approach has two phases, as also described in the following figure:
- Defining and implementing the macro-architecture of the kernel
- Coding and optimizing the micro-architecture of the kernel
Refer to Methodology for Developing C/C++ Kernels for a detailed view on requirements, considerations, and how to re-architect the code for achieving higher performance.