Vitis HLS supports features on top-level array or pointer arguments to improve the parallelism and throughput achievable from C++ code with a few code changes.
Improvements to the ARRAY_PARTITION Pragma
The ARRAY_PARTITION pragma or directive
can be applied to arrays on top-level arguments, at the interface of the top-level
function. Used alongside the INTERFACE pragma, this combination lets Vitis HLS create multiple M_AXI interfaces for a
single software argument to use with HBM Pseudo
Channels. Use ARRAY_PARTITION factor=N
to create N point-to-point
connections between the M_AXI interfaces and HBM
Pseudo Channels (or HBM_PC) to achieve up to N times
the throughput of a single interface.
For the micro-architecture, an explicit parallelization of the array accesses and associated datapath is needed to match the increase of M_AXI interfaces. This translates to a similar parallelization from a coding perspective.
Array partitioning types can be complete, cyclic, or block, though for this application only cyclic or block are used. Each type can lead to a different micro-architecture.
-
Cyclic partitions can be used within a loop: consecutive accesses of the array will connect to different M_AXI and their matching HBM_PC, and you can partially unroll the loop with the same factor.
constexpr int partitions = 4; constexpr int NWORDS=32<<20; void example(int a[NWORDS] , int b[1]) { #pragma HLS INTERFACE m_axi port=a depth=NWORDS bundle=gmem max_widen_bitwidth=512 #pragma HLS ARRAY_PARTITION cyclic variable=a factor=partitions #pragma HLS INTERFACE m_axi port=b depth=1 bundle=outmem int tot=0; accessloop: for (int i=0; i < NWORDS; ++i) { #pragma HLS PIPELINE II=1 #pragma HLS UNROLL factor=partitions tot+= a[i] * i; // computation on a[i] } b[0]=tot; }
The preceding example shows the use of a sized array rather than a pointer, and the depth option is provided with the INTERFACE pragma. Both methods are equivalent and at least one is needed for C-RTL co-simulation, to size the RTL buffers correctly. The sizing information on the arguments and interfaces is not needed when used with the Vitis target flow as pointers are sized to 64 bits automatically. Additionally, the cyclic partitioning is size-independent.
- Block partitions can be used when the design includes several instances of
the same compute function, and each compute function will access a different
block of the array via different M_AXI and their matching HBM_PC.
constexpr int partitions = 4; constexpr int NWORDS=32<<20; void my_pe(int a[NWORDS/partitions], int &b, int offset) { // PE processes a fraction of the data // offset is needed to adjust the coefficient to multiply the array value int tot=0; accessloop: for (int i=0; i < NWORDS/partitions; ++i) { #pragma HLS PIPELINE II=1 tot += a[i] * (i+offset); // computation on a[i] } b=tot; } void example(int a[NWORDS ], int b[1]) { #pragma HLS INTERFACE m_axi port=a depth=NWORDS bundle=gmem max_widen_bitwidth=MAXWBW #pragma HLS ARRAY_PARTITION block variable=a factor=partitions #pragma HLS INTERFACE m_axi port=b depth=1 bundle=outmem int partial_totals[partitions], tot; #pragma HLS ARRAY_PARTITION complete variable=partial_totals #pragma HLS DATAFLOW instanceloop_unrolled: for (int i=0; i < partitions; ++i) { #pragma HLS UNROLL // each instance accesses a different block my_pe( &a[NWORDS*i/partitions], partial_totals[i], NWORDS*i/partitions); } reduce: for (int i=tot=0; i < partitions; ++i) { #pragma HLS PIPELINE off tot+=partial_totals[i]; } b[0]=tot; }
In the above example, the array sizing information is needed because the block partitioning is dependent on the size of the original array.
The M_AXI interfaces generated will have their bundle names derived from the original bundle name, with decimal suffix added. For example, pragma HLS INTERFACE bundle=gmem with a partition factor=4 will generate 4 M_AXI named m_axi_gmem_0, m_axi_gmem_1, m_axi_gmem_2 and m_axi_gmem_3. You will need to update the connectivity settings for the system to integrate all the M_AXI interfaces. Refer to Mapping Kernel Ports to Memory in Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for additional information.
In the examples above, the function signature used by the software application to call the
example()
module will change because of the use ofARAY_PARTITION factor=4
. The function signature changes fromexample(int a[NWORDS], int b[1])
to the following:void example( int a_0[NWORDS/4], int a_1[NWORDS/4], int a_2[NWORDS/4], int a_3[NWORDS/4], int b[1]);
Generated Helper Functions for Host Side Array Partitioning
The software application will have to partition the buffer containing the array into the same cyclic or block partitioning used by the HLS design as described above. To help with this task, Vitis HLS generates helper functions to split host array buffers into multiple partitions, and the reverse helper functions to recombine several partitions into one array. Those helper functions are:
-
<top_name>_Set_<array_name>
is used to partition an array into several partitions. For example, with afactor=4
andint
datatype:void <DUT>_Set_<ARG>(int *dst[4], int *src, unsigned long long num_elements);
-
<top_name>_Get_<array_name>
is used to recombine several partitions into a single array. For example with afactor=4
andint
datatype:void <DUT>_Get_<ARG>(int *dst, int *src[4], unsigned long long num_elements)
Tip: The order of the arguments follows the convention:destination(s)
,source(s)
,number of elements
. - Generated helper functions are available as assembly files: ./project/solution/impl/ip/drivers/example_v1_0/src/hbm_helper_x86_64.s & hbm_helper_aarch64.s
- The generated helper functions are also available as compiled files: ./project/solution/impl/ip/drivers/example_v1_0/src/hbm_helper_x86_64.o & hbm_helper_aarch64.o
factor=4
and int
datatype:
extern "C" void example_Set_a(int **, int *, long);
extern "C" void example_Get_a(int *, int **, long);
Example usage:
int *host_in=new int[NWORDS];
for (size_t i = 0; i < NWORDS; ++i) {
host_in[i] = ... // initialize the host input memory
}
xrt::bo bo_Inputs[partitions];
int* bufIn_map[partitions];
for (int i = 0; i < partitions; i++) {
// buffer object matching kernel arguments/memory banks
bo_Inputs[i] = xrt::bo(device,NWORDS*sizeof(int)/partitions, krnl.group_id(i));
// Map buffer object data to user pointer for manipulation
bufIn_map[i] = bo_Inputs[i].template map<int*>();
}
// partition the input data into the mapped partitions
example_Set_a(bufIn_map, host_in, NWORDS);
xrt::run run(krnl); // run object for that kernel
for (int i = 0; i < partitions; i++) {
// sync updated bo contents to board
bo_Inputs[i].sync(XCL_BO_SYNC_BO_TO_DEVICE);
}
for (int i = 0; i < partitions; i++) {
// set arguments according to function signature
run.set_arg(i,bo_Inputs[i]);
}
...
run.start();
run.wait();