Improving HBM Bandwidth with Multiple M_AXI Interfaces

Improving HBM Bandwidth with Multiple M_AXI Interfaces - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2023-12-18

Version

2023.2 English

Vitis HLS supports features on top-level array or pointer arguments to improve the parallelism and throughput achievable from C++ code with a few code changes.

Improvements to the ARRAY_PARTITION Pragma

The ARRAY_PARTITION pragma or directive can be applied to arrays on top-level arguments, at the interface of the top-level function. Used alongside the INTERFACE pragma, this combination lets Vitis HLS create multiple M_AXI interfaces for a single software argument to use with HBM Pseudo Channels. Use ARRAY_PARTITION factor=N to create N point-to-point connections between the M_AXI interfaces and HBM Pseudo Channels (or HBM_PC) to achieve up to N times the throughput of a single interface.

For the micro-architecture, an explicit parallelization of the array accesses and associated datapath is needed to match the increase of M_AXI interfaces. This translates to a similar parallelization from a coding perspective.

Array partitioning types can be complete, cyclic, or block, though for this application only cyclic or block are used. Each type can lead to a different micro-architecture.

Cyclic partitions can be used within a loop: consecutive accesses of the array will connect to different M_AXI and their matching HBM_PC, and you can partially unroll the loop with the same factor.
```
constexpr int partitions = 4;
constexpr int NWORDS=32<<20;
void example(int a[NWORDS]  , int b[1]) {
    #pragma HLS INTERFACE m_axi port=a depth=NWORDS bundle=gmem max_widen_bitwidth=512
    #pragma HLS ARRAY_PARTITION cyclic variable=a factor=partitions
    #pragma HLS INTERFACE m_axi port=b depth=1 bundle=outmem 
    int tot=0;
accessloop:
    for (int i=0; i < NWORDS; ++i) {
        #pragma HLS PIPELINE II=1
        #pragma HLS UNROLL factor=partitions
        tot+= a[i] * i; // computation on a[i]
    }
    b[0]=tot;
}
```
The preceding example shows the use of a sized array rather than a pointer, and the depth option is provided with the INTERFACE pragma. Both methods are equivalent and at least one is needed for C-RTL co-simulation, to size the RTL buffers correctly. The sizing information on the arguments and interfaces is not needed when used with the Vitis target flow as pointers are sized to 64 bits automatically. Additionally, the cyclic partitioning is size-independent.

Block partitions can be used when the design includes several instances of the same compute function, and each compute function will access a different block of the array via different M_AXI and their matching HBM_PC.

constexpr int partitions = 4;
constexpr int NWORDS=32<<20;
void my_pe(int a[NWORDS/partitions], int &b, int offset) { // PE processes a fraction of the data
    // offset is needed to adjust the coefficient to multiply the array value
    int tot=0;
accessloop:
    for (int i=0; i < NWORDS/partitions; ++i) {
        #pragma HLS PIPELINE II=1
        tot += a[i] * (i+offset); // computation on a[i]
    }
    b=tot;
}

void example(int a[NWORDS  ], int b[1]) {
    #pragma HLS INTERFACE m_axi port=a depth=NWORDS bundle=gmem max_widen_bitwidth=MAXWBW
    #pragma HLS ARRAY_PARTITION block variable=a factor=partitions
    #pragma HLS INTERFACE m_axi port=b depth=1 bundle=outmem 
    
    int partial_totals[partitions], tot;
    #pragma HLS ARRAY_PARTITION complete variable=partial_totals
    #pragma HLS DATAFLOW
instanceloop_unrolled:
    for (int i=0; i < partitions; ++i) {
        #pragma HLS UNROLL
        // each instance accesses a different block
        my_pe( &a[NWORDS*i/partitions], partial_totals[i], NWORDS*i/partitions);
    }

reduce:
    for (int i=tot=0; i < partitions; ++i) {
        #pragma HLS PIPELINE off
        tot+=partial_totals[i];
    }
    b[0]=tot;
}

In the above example, the array sizing information is needed because the block partitioning is dependent on the size of the original array.

The M_AXI interfaces generated will have their bundle names derived from the original bundle name, with decimal suffix added. For example, pragma HLS INTERFACE bundle=gmem with a partition factor=4 will generate 4 M_AXI named m_axi_gmem_0, m_axi_gmem_1, m_axi_gmem_2 and m_axi_gmem_3. You will need to update the connectivity settings for the system to integrate all the M_AXI interfaces. Refer to Mapping Kernel Ports to Memory in Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for additional information.

In the examples above, the function signature used by the software application to call the example() module will change because of the use of ARAY_PARTITION factor=4. The function signature changes from example(int a[NWORDS], int b[1]) to the following:

void example(
    int a_0[NWORDS/4], 
    int a_1[NWORDS/4], 
    int a_2[NWORDS/4], 
    int a_3[NWORDS/4], 
    int b[1]);

Generated Helper Functions for Host Side Array Partitioning

The software application will have to partition the buffer containing the array into the same cyclic or block partitioning used by the HLS design as described above. To help with this task, Vitis HLS generates helper functions to split host array buffers into multiple partitions, and the reverse helper functions to recombine several partitions into one array. Those helper functions are:

<top_name>_Set_<array_name> is used to partition an array into several partitions. For example, with a factor=4 and int datatype:
```
void <DUT>_Set_<ARG>(int *dst[4], int *src, unsigned long long num_elements);
```
<top_name>_Get_<array_name> is used to recombine several partitions into a single array. For example with a factor=4 and int datatype:
```
void <DUT>_Get_<ARG>(int *dst, int *src[4], unsigned long long num_elements)
```
Tip: The order of the arguments follows the convention: destination(s), source(s), number of elements.
Generated helper functions are available as assembly files: ./project/solution/impl/ip/drivers/example_v1_0/src/hbm_helper_x86_64.s & hbm_helper_aarch64.s
The generated helper functions are also available as compiled files: ./project/solution/impl/ip/drivers/example_v1_0/src/hbm_helper_x86_64.o & hbm_helper_aarch64.o

Note: Header files are not generated automatically, so you need to include function declarations in the software application. The following example shows a factor=4 and int datatype:

extern "C" void example_Set_a(int **, int *, long);
extern "C" void example_Get_a(int *, int **, long);

Example usage:

int *host_in=new int[NWORDS];
for (size_t i = 0; i < NWORDS; ++i) {
    host_in[i] = ... // initialize the host input memory 
}

xrt::bo bo_Inputs[partitions];
int* bufIn_map[partitions];
for (int i = 0; i < partitions; i++) {
    // buffer object matching kernel arguments/memory banks
    bo_Inputs[i] = xrt::bo(device,NWORDS*sizeof(int)/partitions, krnl.group_id(i));
    // Map buffer object data to user pointer for manipulation
    bufIn_map[i] = bo_Inputs[i].template map<int*>();
}

// partition the input data into the mapped partitions
example_Set_a(bufIn_map, host_in, NWORDS);

xrt::run run(krnl); // run object for that kernel

for (int i = 0; i < partitions; i++) { 
    // sync updated bo contents to board
    bo_Inputs[i].sync(XCL_BO_SYNC_BO_TO_DEVICE);
}
for (int i = 0; i < partitions; i++) {
    // set arguments according to function signature
    run.set_arg(i,bo_Inputs[i]);
}
...
run.start();
run.wait();