Sometimes the compute intensive task required by the host application can
process the data across multiple hardware instances of the same kernel, or compute units
(CUs) to achieve data parallelism on the FPGA. If a single kernel has been compiled into
multiple CUs, the clEnqueueTask
command can be called
multiple times in an out-of-order command queue, to enable data parallelism. Each call
of clEnqueueTask
would schedule a workload of data in
different CUs, working in parallel.