When OpenCL API kernels are submitted for execution on an OpenCL device, they execute within an index space, called an ND range, which can have 1, 2, or 3 dimensions. This is called the global size in the OpenCL API. The work-group size defines the amount of the ND range that can be processed by a single invocation of a kernel compute unit (CU). The work-group size is also called the local size in the OpenCL API. The OpenCL compiler can determine the work-group size based on the properties of the kernel and selected device. After the work-group size (local size) is determined, the ND range (global size) is divided automatically into work-groups, and the work-groups are scheduled for execution on the device.
Although the OpenCL compiler can define the work-group size, the specification of the REQD_WORK_GROUP_SIZE attribute on the kernel to define the work-group size is highly recommended for FPGA implementations of the kernel. The attribute is recommended for performance optimization during the generation of the custom logic for a kernel.
OpenCL kernel functions are executed
exactly one time for each point in the ND range index space. This unit of work for each
point in the ND range is called a work-item. Work-items are organized into work-groups,
which are the unit of work scheduled onto compute units. The optional REQD_WORK_GROUP_SIZE
attribute defines the work-group size of a compute unit that must be used as the local_work_size
argument to clEnqueueNDRangeKernel
. This allows the compiler to optimize the generated code
appropriately for this kernel.
Place this attribute before the kernel definition, or before the primary function specified for the kernel.
__attribute__((reqd_work_group_size(<X>, <Y>, <Z>)))
- <X>, <Y>, <Z>: Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.
The following OpenCL C kernel code shows a vector addition design where two arrays of data are summed into a third array. The required size of the work-group is 16x1x1. This kernel will execute 16 times to produce a valid result.
#include <clc.h>
// For VHLS OpenCL C kernels, the full work group is synthesized
__attribute__ ((reqd_work_group_size(16, 1, 1)))
__kernel void
vadd(__global int* a,
__global int* b,
__global int* c)
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];