Often the compute intensive task required by the host application can be defined inside a single kernel, and the kernel is executed only once to work on the entire data range. Because there is an overhead associated with multiple kernel executions, invoking a single monolithic kernel can improve performance. Though the kernel is executed only one time, and works on the entire range of the data, the parallelism is achieved on the FPGA inside the kernel hardware. If properly coded, the kernel is capable of achieving parallelism by various techniques such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). These different kernel coding techniques are discussed in Developing PL Kernels using C++.
When the kernel is compiled to a single hardware instance (or CU) on the FPGA, the
simplest method of executing the kernel is using clEnqueueTask
as shown below.
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
XRT schedules the workload, or the data passed through OpenCL buffers from the kernel arguments, and schedules the kernel tasks to run on the accelerator on the AMD FPGA.
clEnqueueNDRangeKernel
is supported (only
for OpenCL kernel), AMD recommends using
clEnqueueTask
.However, sometimes using a single clEnqueueTask to run the kernel is not
always feasible due to various reasons. For example, the kernel
code can become too big and complex to optimize if it attempts
to perform all compute intensive tasks in a single execution.
Sometimes multiple kernels can be designed performing different
tasks on the FPGA in parallel, requiring multiple enqueue
commands. Or the host application can be receiving data over
time, and not all the data can be processed at one time.
Therefore, depending on the situation and application, you may
need to break the data and the task of the kernel into multiple
clEnqueueTask
commands. In this case, an out-of-order command queue, or an
in-order command queue can determine how the kernel tasks are
processed as explained in Command Queues. In addition, multiple
kernel tasks can be implemented as blocking events, or
non-blocking events as described in Event Synchronization. These can all affect
the performance of the design.
The following topics discuss various methods you can use to run a kernel, run multiple kernels, or run multiple instances of the same kernel on the accelerator.