Alveo Overview - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-11-13
Version
2023.2 English

Before diving into the software, familiarize yourself with the capabilities of the AMD Alveo™ Data Center accelerator card itself. Each Alveo card combines three essential things: a powerful FPGA or SoC for acceleration, high-bandwidth DDR4 memory banks, and connectivity to a host server via a high-bandwidth PCIe link. This link is capable of transferring approximately 16 GiB of data per second between the Alveo card and the host at Gen3x16, and higher rates for newer generations.

Recall that unlike a CPU, GPU, or ASIC, an FPGA is effectively a blank slate. You have a pool of very low-level logic resources like flip-flops, gates, and SRAM, but very little fixed functionality. Every interface the device has, including the PCIe and external memory links, is implemented using at least some of those resources.

To ensure that the PCIe link, system monitoring, and board health interfaces are always available to the host processor, Alveo designs are split into a shell and role conceptual model. The shell contains all of the static functionality: external links, configuration, clocking, etc. Your design fills the role portion of the model with custom logic implementing your specific algorithm(s). You can see this topology reflected here:

Alveo Topology

The Alveo FPGA is further subdivided into multiple super logic regions (SLRs), which aid in the architecture of very high-performance designs. But this is a slightly more advanced topic that will remain largely unnoticed as you take your first steps into Alveo development.

Alveo cards have multiple on-card DDR4 memories. These memories have a high bandwidth to and from the Alveo device, and in the context of OpenCL™ are collectively referred to as the device global memory. Each memory bank has a capacity of 16 GiB and operates at 2400 MHzDDR. This memory has a very high bandwidth, and your kernels can easily saturate it if desired. There is, however, a latency hit for either reading from or writing to this memory, especially if the addresses accessed are not contiguous or we are reading/writing short data beats.

The PCIe lane, on the other hand, has a decently high bandwidth but not nearly as high as the DDR memory on the Alveo card itself. In addition, the latency hit involved with transferring data over PCIe is quite high. As a general rule of thumb, transfer data over PCIe as infrequently as possible. For continuous data processing, try designing your system so that your data is transferring while the kernels are processing something else. For instance, while waiting for Alveo to process one frame of video, you can be simultaneously transferring the next frame to the global memory from the CPU.

There are many more details we could cover on the FPGA architecture, the Alveo cards themselves, etc., but for introductory purposes, we have enough details to proceed. From the perspective of designing an acceleration architecture, the important points to remember are:

  • Moving data across PCIe is expensive — even at the latest generation, latency is high. For larger data transfers, bandwidth can easily become a system bottleneck.

  • Bandwidth and latency between the DDR4 and the FPGA is significantly better than over PCIe, but touching external memory is still expensive in terms of overall system performance.

  • Within the FPGA fabric, streaming from one operation to the next is effectively free (this is our train analogy from earlier; we will explore it in more detail later).