The OpenCL-based execution model supports data parallel and task parallel programming models. An OpenCL host generally needs to call different kernels multiple times. These calls are enqueued in a command queue, either in a certain sequence, or in an out-of-order command queue. Then depending on the availability of compute resources and task data they get scheduled for execution on the device.
Kernel calls can be enqueued for execution on a command queue using
clEnqueueTask
. The dispatching process is executed
on the host processor. The dispatcher invokes kernel execution after transferring the
kernel arguments to the accelerator running on the device. The dispatcher uses a
low-level Xilinx Runtime (XRT) library for
transferring kernel arguments and issuing trigger commands for starting the compute. The
overhead of dispatching the commands and arguments to the accelerator can be between 30
µs and 60 µs, depending on the number of arguments set for the kernel. You can reduce
the impact of this overhead by minimizing the number of times the kernel needs to be
executed, and minimizing calls to clEnqueueTask
.
Ideally, you should finish all the compute in a single call to clEnqueueTask
.
You can minimize the calls to clEnqueueTask
by batching your data and invoking the kernel one time, with
a loop wrapped around the original implementation to avoid the overhead of multiple
enqueue calls. It can also improve data transfer performance between the host and
accelerator, by transferring fewer large data packets rather than many small data
packets. For more information on reducing overhead on kernel execution, see Kernel Execution.
#define SIZE 256
extern "C" {
void add(int *a , int *b, int inc){
int buff_a[SIZE];
for(int i=0;i<size;i++)
{
buff_a[i] = a[i];
}
for(int i=0;i<size;i++)
{
b[i] = a[i]+inc;
}
}
}
num_batches
argument the
kernel can process multiple inputs of size 256 in a single call and avoid the overhead
of multiple clEnqueueTask
calls. The host application
changes to allocate data and buffers in chunks of SIZE *
num_batches
, essentially batching the memory allocation and transfer of
data between the host global and device memory.
#define SIZE 256
extern "C" {
void add(int *a , int *b, int inc, int num_batches){
int buff_a[SIZE];
for(int j=0;j<num_batches;j++)
{
for(int i=0;i<size;i++)
{
buff_a[i] = a[i];
}
for(int i=0;i<size;i++)
{
b[i] = a[i]+inc;
}
}
}
}