Setting Data Width in OpenCL Kernels - 2021.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
Release Date
2021.2 English

For OpenCL kernels, the API provides attributes to support incrementing AXI data width usage. To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm:

Examine the combined functionality on the following case:

__attribute__((reqd_work_group_size(64, 1, 1)))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
    size_t idx = get_global_id(0);
    c[idx] = a[idx] + b[idx];

In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.

Declares that int is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied.
__attribute__((reqd_work_group_size(X, Y, Z)))
Defines the total number of work items (where X, Y, and Z are positive constants). X*Y*Z is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64*1*1=64.

The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through reqd_work_group_size.

The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.

The __attribute__((xcl_zero_global_work_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagates to the alignment of the memory accesses (less hardware).

It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.

For example:

  • For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
    • The resulting access requires a lot of multiplexers and will create a difficult issue for the scheduler (might severely increase memory usage and compilation time). Xilinx recommends using partitioning factors that are powers of two (as the vectorization factor is always a power of two).
  • If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
    • This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
    • The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
    • Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable method).