Some algorithms are memory bound, limited by the 77 GB/s bandwidth available on DDR-based Alveo cards. For those applications there are High Bandwidth Memory (HBM) based Alveo cards, providing up to 460 GB/s memory bandwidth. For the Alveo implementation, two 16-layer HBM (HBM2 specification) stacks are incorporated into the FPGA package and connected into the FPGA fabric with an interposer. A high-level diagram of the two HBM stacks is as follows.
This implementation provides:
- 16 GB HBM using 512 MB pseudo channels (PCs) for Alveo U55C accelerator cards, as described in Alveo U55C Data Center Accelerator Cards Data Sheet (DS978)
- 8 GB total HBM using 256 MB PCs for Alveo U280 accelerator cards, and U50 cards as described in Alveo U50 Data Center Accelerator Cards Data Sheet (DS965)
- An independent AXI channel for communication between the Vitis kernels and the HBM PCs through a segmented crossbar switch network
- A two-channel memory controller for addressing two PCs
- 14.375 GB/s max theoretical bandwidth per PC
- 460 GB/s ( 32 × 14.375 GB/s) max theoretical bandwidth for the HBM subsystem
Although each PC has a theoretical maximum performance of 14.375 GB/s, this is less than the theoretical maximum of 19.25 GB/s for a DDR channel. To get better than DDR performance, designs must efficiently integrate multiple AXI masters into the HBM subsystem. The programmable logic has 32 HBM AXI interfaces that can access any memory location in any of the PCs on either of the HBM stacks through a built-in switch network providing full access to the memory space.
Connection to the HBM is managed by
the HBM Memory Subsystem (HMSS) IP, which enables
all HBM PCs, and automatically connects the XDMA to
the HBM for host access to global memory. When used
with the Vitis compiler, the HMSS is automatically
customized to activate only the necessary memory controllers and ports as specified by
the --connectivity.sp
option to connect both the user
kernels and the XDMA to those memory controllers for optimal bandwidth and latency.
--connectivity.sp
option to
connect kernels to HBM PCs is:
sp=<compute_unit_name>.<argument>:<HBM_PC>
In the following config file example, the kernel input ports in1
and in2
are connected
to HBM PCs 0 and 1 respectively, and the output
buffer out
is connected to HBM PCs 3–4
[connectivity]
sp=krnl.in1:HBM[0]
sp=krnl.in2:HBM[1]
sp=krnl.out:HBM[3:4]
Each HBM PC is 256 MB, giving a total of 1 GB of memory access for this kernel. Refer to the Using HBM Tutorial for additional information and examples.
--connectivity.sp
syntax for HBM also lets you specify which channel index for the switch network the
HMSS should use for connecting the kernel interface. The --connectivity.sp
syntax for specifying the switch network index is:
sp=<compute_unit_name>.<argument>:<bank_name>.<index>
When specifying the switch index, only one index can be specified per
sp
option. You cannot reuse an index which has
already been used by another sp
option or line in the
config file.
In this case, the last line of the previous example could be rewritten as:
sp=krnl.out:HBM[3:4].3
to use switch channel 3
(S_AXI03
) or sp=krnl.out:HBM[3:4].4
to use switch channel 4 (S_AXI04
), as shown in the figure above.
Either sp
option would route the
kernel's data transactions through one of the left-most network switch blocks to reduce
implementation complexity. Using any index in the range 0 to 7 would use one of the two
left-most switch blocks in the network, Using any other index would force the use of
additional switch blocks, which adds routing complexity and could negatively impact
performance depending on the application.
The HBM ports are located in the
bottom SLR of the device. The HMSS automatically handles the placement and timing
complexities of AXI interfaces crossing super logic regions (SLR) in SSI technology
devices. By default, without specifying the --connectivity.sp
or --connectivity.slr
options on v++
, all kernel AXI interfaces access
HBM[0] and all kernels are assigned to SLR0.
However, you can specify the SLR assignments of kernels using the
--connectivity.slr
option. For devices or platforms
that use multiple SLRs, you are strongly recommended to define CU assignments to
specific SLRs. Refer to Assigning Compute Units to SLRs on Alveo Accelerator Cards for more
information.