HBM Overview - 2023.1 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
Release Date
2023.1 English

There are some algorithms that are memory bound, limited by the 77 GB/s bandwidth available on DDR-based AMD Alveo™ cards. For those applications, there are high-bandwidth memory (HBM)-based Alveo cards, providing up to 460 GB/s memory bandwidth. This module will walk you through some of the structural differences between DDR and HBM and introduce how you can take advantage of the higher bandwidth.

DDR memories have been used in cards and computers for decades. There is a memory controller in the FPGA, that talks across traces on the printed circuit board (PCB) to an on-card DDR module. The memory controller sees all the memory in the DDR module. For Alveo cards with multiple DDR banks, the FPGA needs to implement a memory controller or each DDR modules used in an application.

HBM is a memory technology that takes advantage of newer chip fabrication techniques to allow for more bandwidth and more bandwidth/watt than traditional DDR implementations. Memory manufacturer uses stacked die and through silicon via chip fabrication techniques to stack multiple smaller DDR-based memories into a single larger faster memory stack.

For the Alveo implementation, two 16-layer HBM (HBM2 specification) stacks are incorporated into the FPGA package — connected into the FPGA fabric with an interposer. The implementation provides:

  • 8 GB HBM memory

  • 32 HBM Pseudo Channels (PC), sometime also referred as banks, each of 256 MB (2 GB)

  • An independent AXI channel for communication with the FPGA through a segmented crossbar switch per pseudo channel

  • A 2-channel memory controller per two PCs

  • 14.375 GB/s max theoretical bandwidth per PC

  • 460 GB/S (32 * 14.375 GB/s) max theoretical bandwidth for the HBM sub-system and 420 GB/s (~ 90% efficiency) achievable bandwidth

Each pseudo channel has a max theoretical performance of 14.375 GB/s, less than the theoretical 19.25 GB/s for a DDR channel. To get better than DDR performance, designs must use multiple AXI masters efficiently into the HBM subsystem.

The following figure will help you visualize HBM subsystem and FPGA connectivity from the 32 AXI channels (shown by the 32 pairs of up/down arrows), into the segmented crossbar switch (shown by the eight white boxes highlighted in red), to the memory controllers leading to the pseudo channels:

HBM Overview

The crossbar switch is hardened switch. It offers great flexibility with little change to application i.e., simple reconfiguration of memory specification. The switch also consumes very little logic for this flexibility leaving the Alveo device open for more kernel logic.

The segmented crossbar switch can become the bottleneck impacting an applications actual HBM performance. Let’s review the structure of the switch to better understand how to use it. The switch is composed of eight 4x4 switch segments. The 4x4 segment is detailed in the following.

HBM 4x4 Switch

The fastest connections are from an AXI channel to the memory address of the aligned PC, M0→S0 (0-256 MB), M1->S1(256-512 MB), and etc. That would limit a design to accessing 32 individual PCs of 256 MB segments each. For a performance trade off, the segmented crossbar switch allows any AXI master to access any of the addresses in the 8 GB HBM range. This is different compared to DDR configurations where if an AXI master port is connected to DDR0, then only the addresses within DDR0 can be accessed. For HBM, if an address is outside the aligned PC, then it will traverse the segmented crossbar to get to the correct PC, via the local 4x4 connectivity shown above, or traversing to another 4x4 switch on the L/R connections. When you instruct the tools to connect one master AXI interface to multiple PC, there is a an internal mechanism that uses the memory specification for a particular kernel master.

Performance will be impacted by two factors:

  • Each connection in the switch has the same bandwidth.

  • Crossing from 4x4 switch to another 4x4 switch increases latency.

The fastest connections will be from the AXI master to one of the 4 aligned pseudo channels in the same switch. If multiple masters are spanning the range, the left ↔ right switch structure can become bandwidth saturated.

For more information on HBM controller, refer to AXI High Bandwidth Memory Controller v1.0