HBM Configuration and Use - 2023.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2023-07-17

Version

2023.1 English

Some algorithms are memory bound, limited by the 77 GB/s bandwidth available on DDR-based Alveo cards. For those applications there are High Bandwidth Memory (HBM) based Alveo cards, providing up to 460 GB/s memory bandwidth. For the Alveo implementation, two 16-layer HBM (HBM2 specification) stacks are incorporated into the FPGA package and connected into the FPGA fabric with an interposer. A high-level diagram of the two HBM stacks is as follows.

Figure 1. High-Level Diagram of Two HBM Stacks

This implementation provides:

16 GB HBM using 512 MB pseudo channels (PCs) for Alveo U55C accelerator cards, as described in Alveo U55C Data Center Accelerator Cards Data Sheet (DS978)
8 GB total HBM using 256 MB PCs for Alveo U280 accelerator cards, and U50 cards as described in Alveo U50 Data Center Accelerator Cards Data Sheet (DS965)
An independent AXI channel for communication between the Vitis kernels and the HBM PCs through a segmented crossbar switch network
A two-channel memory controller for addressing two PCs
14.375 GB/s max theoretical bandwidth per PC
460 GB/s ( 32 × 14.375 GB/s) max theoretical bandwidth for the HBM subsystem

Although each PC has a theoretical maximum performance of 14.375 GB/s, this is less than the theoretical maximum of 19.25 GB/s for a DDR channel. To get better than DDR performance, designs must efficiently integrate multiple AXI masters into the HBM subsystem. The programmable logic has 32 HBM AXI interfaces that can access any memory location in any of the PCs on either of the HBM stacks through a built-in switch network providing full access to the memory space.

Connection to the HBM is managed by the HBM Memory Subsystem (HMSS) IP, which enables all HBM PCs, and automatically connects the XDMA to the HBM for host access to global memory. When used with the Vitis compiler, the HMSS is automatically customized to activate only the necessary memory controllers and ports as specified by the --connectivity.sp option to connect both the user kernels and the XDMA to those memory controllers for optimal bandwidth and latency.

Note: Because of the complexity and flexibility of the built-in switch network, there are many implementations that result in congestion at a particular memory location or in the switch network. Interleaved read and write transactions cause a drop in efficiency with respect to read-only or write-only due to memory controller timing parameters (bus turnaround). Write transactions that span both HBM stacks will also experience degraded performance, and should be avoided. It is important to plan memory accesses so that kernels access limited memory where possible, and configure kernel connectivity to isolate the memory accesses for different kernels into different HBM PCs.

The --connectivity.sp option to connect kernels to HBM PCs is:

sp=<compute_unit_name>.<argument>:<HBM_PC>

In the following config file example, the kernel input ports in1 and in2 are connected to HBM PCs 0 and 1 respectively, and the output buffer out is connected to HBM PCs 3–4

[connectivity]
sp=krnl.in1:HBM[0]
sp=krnl.in2:HBM[1]
sp=krnl.out:HBM[3:4]

Each HBM PC is 256 MB, giving a total of 1 GB of memory access for this kernel. Refer to the Using HBM Tutorial for additional information and examples.

Tip: The config file specifies the mapping of a kernel argument to one or more HBM PCs. When mapping to multiple PCs each AXI interface should only access a contiguous subset of the available 32 HBM PCs. For example, HBM[3:7].

When implementing the connection between the kernel argument and the specified PC, the HMSS automatically selects the appropriate channel in the switch network to connect the AXI Slave interface port to access memory, maximize bandwidth, and minimize latency given the pseudo channel number or range. However, the --connectivity.sp syntax for HBM also lets you specify which channel index for the switch network the HMSS should use for connecting the kernel interface. The --connectivity.sp syntax for specifying the switch network index is:

sp=<compute_unit_name>.<argument>:<bank_name>.<index>

When specifying the switch index, only one index can be specified per sp option. You cannot reuse an index which has already been used by another sp option or line in the config file.

In this case, the last line of the previous example could be rewritten as: sp=krnl.out:HBM[3:4].3 to use switch channel 3 (S_AXI03) or sp=krnl.out:HBM[3:4].4 to use switch channel 4 (S_AXI04), as shown in the figure above.

Either sp option would route the kernel's data transactions through one of the left-most network switch blocks to reduce implementation complexity. Using any index in the range 0 to 7 would use one of the two left-most switch blocks in the network, Using any other index would force the use of additional switch blocks, which adds routing complexity and could negatively impact performance depending on the application.

The HBM ports are located in the bottom SLR of the device. The HMSS automatically handles the placement and timing complexities of AXI interfaces crossing super logic regions (SLR) in SSI technology devices. By default, without specifying the --connectivity.sp or --connectivity.slr options on v++, all kernel AXI interfaces access HBM[0] and all kernels are assigned to SLR0.

However, you can specify the SLR assignments of kernels using the --connectivity.slr option. For devices or platforms that use multiple SLRs, you are strongly recommended to define CU assignments to specific SLRs. Refer to Assigning Compute Units to SLRs on Alveo Accelerator Cards for more information.