Recommended: For improved
performance, AMD recommends using a single out-of-order command
queue and manage event dependencies and synchronizations explicitly, instead of using
multiple command queues.
The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.
Figure 1. Example with Two In-Order Command Queues
The following is code extracted from host.cpp of the concurrent_kernel_execution example that sets up multiple in-order command queues and enqueues commands into each queue:
OCL_CHECK(
err,
cl::CommandQueue ooo_queue(context,
device,
CL_QUEUE_PROFILING_ENABLE |
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
&err));
...
printf("[OOO Queue]: Enqueueing scale kernel\n");
OCL_CHECK(
err,
err = ooo_queue.enqueueTask(
kernel_mscale,nullptr,&ooo_events[0]));
set_callback(ooo_events[0], "scale");
...
// This is an out of order queue, events can be executed in any order. Since
// this call depends on the results of the previous call we must pass the
// event object from the previous call to this kernel's event wait list.
printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
kernel_wait_events.resize(0);
kernel_wait_events.push_back(ooo_events[0]);
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_madd,
&kernel_wait_events, // Event from previous call
&ooo_events[1]));
set_callback(ooo_events[1], "addition");
...
// This call does not depend on previous calls so we are passing nullptr
// into the event wait list. The runtime should schedule this kernel in
// parallel to the previous calls.
printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_mmult,
nullptr,
&ooo_events[2]));
set_callback(ooo_events[2], "matrix multiplication");