Although this may seem obvious, any hardware acceleration system can be broadly discussed in two parts: the hardware architecture and implementation, and the software that interacts with that hardware. For Vitis, regardless of any higher-level software frameworks you may be using in your application such as FFmpeg, GStreamer, or others, the software library that fundamentally interacts with the Alveo hardware is the Xilinx Runtime (XRT).
While XRT consists of many components, its primary role can be boiled down to three simple things:
Programming the FPGA/ACAP kernels and managing the life cycle of the hardware
Allocating memory and migrating that memory between the host CPU and the card
Managing the operation of the hardware: sequencing execution of kernels, setting kernel arguments, etc.
These three things are also, in the same order, the most to least “expensive” operations you can perform on an FPGA. Let’s examine that in more detail.
Programming the kernels into the hardware inherently takes some amount of time. Depending on the capacity of the FPGA, the PCIe bandwidth available to transfer the configuration image, etc., the time required is typically in the order of dozens to hundreds of milliseconds. This is generally a “one time deal” when you launch your application, so the configuration hit can be absorbed into general setup latency, but it’s important to be aware of. There are applications where Alveo is reprogrammed multiple times during operation to provide different large kernels. If you’re planning to build such an architecture, incorporate this configuration time into your application as seamlessly as possible. It’s also important to note that while many applications can simultaneously use the hardware, only one image can be programmed at any point in time.
Allocating memory and moving it around is the real “meat” of XRT. Allocating and managing memory effectively is a critical skill for developing acceleration architectures. If you don’t manage memory and memory migration efficiently, you will significantly impact your overall application performance - and not in the way you’d hope! Fortunately XRT provides many functions to interact with memory, and we will explore those specifics later on.
Finally, XRT manages the operation of the hardware by setting kernel arguments and managing the kernel execution flow. Kernels can run sequentially or in parallel, from one process or many, and in blocking or non-blocking ways. The exact way your software interacts with your kernels is under your control, and we will investigate some examples of this later on.
It’s worth noting that XRT is a low-level API. For very advanced or unusual use models you may wish to interact with it directly, but most designers choose to use a higher-level API such as OpenCL, the Xilinx Media Accelerator (XMA) framework, or others. The following figure shows a top-level view of the available APIs. For this document, we will focus primarily on the OpenCL API to make this introduction more approachable. If you’ve used OpenCL before you will find it mostly similar (although Xilinx does provide some extensions for FPGA-specific tasks.)
In the next section, you’ll learn the basic concepts of memory management you’ll need to understand to optimize your application.
Copyright© 2019-2021 Xilinx