Kernel Execution - 2024.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
Release Date
2024-07-03
Version
2024.1 English

Often the compute intensive task required by the host application can be defined inside a single kernel, and the kernel is executed only once to work on the entire data range. Because there is an overhead associated with multiple kernel executions, invoking a single monolithic kernel can improve performance. Though the kernel is executed only one time, and works on the entire range of the data, the parallelism is achieved on the FPGA inside the kernel hardware. If properly coded, the kernel is capable of achieving parallelism by various techniques such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). These different kernel coding techniques are discussed in Developing PL Kernels using C++.

When the kernel is compiled to a single hardware instance (or CU) on the FPGA, the simplest method of executing the kernel is using clEnqueueTask as shown below.

err = clEnqueueTask(commands, kernel, 0, NULL, NULL);

XRT schedules the workload, or the data passed through OpenCL buffers from the kernel arguments, and schedules the kernel tasks to run on the accelerator on the AMD FPGA.

Important: Though using clEnqueueNDRangeKernel is supported (only for OpenCL kernel), AMD recommends using clEnqueueTask.

However, sometimes using a single clEnqueueTask to run the kernel is not always feasible due to various reasons. For example, the kernel code can become too big and complex to optimize if it attempts to perform all compute intensive tasks in a single execution. Sometimes multiple kernels can be designed performing different tasks on the FPGA in parallel, requiring multiple enqueue commands. Or the host application can be receiving data over time, and not all the data can be processed at one time. Therefore, depending on the situation and application, you may need to break the data and the task of the kernel into multiple clEnqueueTask commands. In this case, an out-of-order command queue, or an in-order command queue can determine how the kernel tasks are processed as explained in Command Queues. In addition, multiple kernel tasks can be implemented as blocking events, or non-blocking events as described in Event Synchronization. These can all affect the performance of the design.

The following topics discuss various methods you can use to run a kernel, run multiple kernels, or run multiple instances of the same kernel on the accelerator.