HBM Configuration and Use - 2021.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
Release Date
2022-03-29
Version
2021.2 English

Some algorithms are memory bound, limited by the 77 GB/s bandwidth available on DDR-based Alveo cards. For those applications there are High Bandwidth Memory (HBM) based Alveo cards, providing up to 460 GB/s memory bandwidth. For the Alveo implementation, two 16-layer HBM (HBM2 specification) stacks are incorporated into the FPGA package and connected into the FPGA fabric with an interposer. A high-level diagram of the two HBM stacks is as follows.

Figure 1. High-Level Diagram of Two HBM Stacks

This implementation provides:

  • 16 GB HBM
  • 256 MB for Alveo U50 HBM segments, called pseudo channels (PCs)
  • 512 MB for Alveo U55C PCs
  • An independent AXI channel for communication with the FPGA through a segmented crossbar switch per pseudo channel
  • A two-channel memory controller per two PCs
  • 14.375 GB/s max theoretical bandwidth per PC
  • 460 GB/s ( 32 * 14.375 GB/s) max theoretical bandwidth for the HBM subsystem

Although each PC has a theoretical maximum performance of 14.375 GB/s, this is less than the theoretical maximum of 19.25 GB/s for a DDR channel. To get better than DDR performance, designs must efficiently use multiple AXI masters into the HBM subsystem. The programmable logic has 32 HBM AXI interfaces that can access any memory location in any of the PCs on either of the HBM stacks through a built-in switch providing access to the full 8 GB for Alveo U50 and 16 GB for Alveo U55C memory space. For more detailed information on the Alveo U50 and U55C, refer to and , respectively. For more detailed information on the HBM, refer to AXI High Bandwidth Controller LogiCORE IP Product Guide (PG276).

Note: Because of the complexity and flexibility of the built-in switch, there are many combinations that result in congestion at a particular memory location or in the switch itself. Interleaved read and write transactions cause a drop in efficiency with respect to read-only or write-only due to memory controller timing parameters (bus turnaround). Write transactions that span both HBM stacks will also experience degraded performance, and should be avoided. It is important to plan memory accesses so that kernels access limited memory where possible, and to isolate the memory accesses for different kernels into different HBM PCs.

Connection to the HBM is managed by the HBM Memory Subsystem (HMSS) IP, which enables all HBM PCs, and automatically connects the XDMA to the HBM for host access to global memory. When used with the Vitis compiler, the HMSS is automatically customized to activate only the necessary memory controllers and ports as specified by the --connectivity.sp option to connect both the user kernels and the XDMA to those memory controllers for optimal bandwidth and latency. Refer to the Using HBM Tutorial for additional information and examples.

In the following config file example, the kernel input ports in1 and in2 are connected to HBM PCs 0 and 1 respectively, and writes output buffer out to HBM PCs 3–4. Each HBM PC is 256 MB, giving a total of 1 GB of memory access for this kernel.

[connectivity]
sp=krnl.in1:HBM[0]
sp=krnl.in2:HBM[1]
sp=krnl.out:HBM[3:4]
Note: In the config file, only the mapping to the HBM pseudo channel is defined, and each AXI interface should only access a contiguous subset of the available 32 HBM PCs. The HMSS chooses the appropriate HBM port to access memory and to maximize bandwidth and minimize latency.

The HBM ports are located in the bottom SLR of the device. The HMSS automatically handles the placement and timing complexities of AXI interfaces crossing super logic regions (SLR) in SSI technology devices. By default, without specifying the --connectivity.sp or --connectivity.slr options on v++, all kernel AXI interfaces access HBM[0] and all kernels are assigned to SLR0. However, you can specify the SLR assignments of kernels using the --connectivity.slr option. Refer to Assigning Compute Units to SLRs for more information.