Sometimes, the data processed by a compute unit passes from one stage of processing in the kernel, to the next stage of processing. In this case, the first stage of the kernel may be free to begin processing a new set of data. In essence, like a factory assembly line, the kernel can accept new data while the original data moves down the line.
To understand this approach, assume a kernel has only one CU on the FPGA, and the host application enqueues the kernel multiple times with different sets of data. As shown in Using Host Pointer Buffers, the host application can migrate data to the device global memory ahead of the kernel execution, thus hiding the data transfer latency by the kernel execution, enabling software pipelining.
However, by default, a kernel can only start processing a new set of data only
when it has finished processing the current set of data. Although clEnqueueMigrateMemObject
hides the data
transfer time, multiple kernel executions still remain sequential.
By enabling host-to-kernel dataflow, it is possible to further improve the
performance of the accelerator by restarting the kernel with a new set of
data while the kernel is still processing the previous set of data. As
discussed in Enabling Host-to-Kernel Dataflow, the kernel
must implement the ap_ctrl_chain
interface, and must be
written to permit processing data in stages. In this case, XRT restarts the
kernel as soon as it is able to accept new data, thus overlapping multiple
kernel executions. However, the host program must keep the command queue
filled with requests so that the kernel can restart as soon as it is ready
to accept new data.
The following is a conceptual diagram for host-to-kernel dataflow.
The longer the kernel takes to process a set of data from start to finish, the
greater the opportunity to use host-to-kernel dataflow to improve
performance. Rather than waiting until the kernel has finished processing
one set of data, simply wait until the kernel is ready to begin processing
the next set of data. This allows temporal
parallelism, where different stages of the same kernel
processes a different set of data from multiple clEnqueueTask
commands, in a pipelined manner.
For advanced designs, you can effectively use both the spatial parallelism with multiple CUs to process data, combined with temporal parallelism using host-to-kernel dataflow, overlapping kernel executions on each compute unit.