Lab 2: Kernel and Host Code Synchronization - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English

For this step, look at the source code in src/sync_host.cpp and examine the execution loop (line 55). This is the same code used in the previous section of this tutorial.

  // -- Execution -----------------------------------------------------------

  for(unsigned int i=0; i < numBuffers; i++) {
    tasks[i].run(api);
  }
  clFinish(api.getQueue());

In this example, the code implements a free-running pipeline. No synchronization is performed until the end, when a call to clFinish is performed on the event queue. While this creates an effective pipeline, this implementation has an issue related to buffer allocation, as well as execution order. This is because it is only possible to release buffers after they are no longer needed, which implies a synchronization point.

For example, there could be issues if the numBuffer variable is increased to a large number, which would occur when processing a video stream. In this case, buffer allocation and memory usage can become problematic because the host memory is pre-allocated and shared with the FPGA. In such a case, this example will probably run out of memory.

Similarly, as each of the calls to execute the accelerator are independent and un-synchronized (out-of-order queue), it is likely that the order of execution between the different invocations is not aligned with the enqueue order. As a result, if the host code is waiting for a specific block to be finished, this might not occur until much later than expected. This effectively disables any host code parallelism while the accelerator is operating.

To alleviate these issues, the OpenCL framework provides two methods of synchronization.

  • clFinish call

  • clWaitForEvents call

  1. Open the src/sync_host.cpp file in an editor, and look at the Execution region. To illustrate the behavior, make the following modifications to the execution loop.

    // -- Execution -----------------------------------------------------------
    
    int count = 0;
    for(unsigned int i=0; i < numBuffers; i++) {
      count++;
      tasks[i].run(api);
      if(count == 3) {
        count = 0;
        clFinish(api.getQueue());
      }
    }
    clFinish(api.getQueue());
    
  2. Compile and execute the sync_host.cpp code.

    make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=sync
    
  3. After the run completes, open the Application Timeline using the Vitis analyzer, then click the Application Timeline located at left side panel.

    vitis_analyzer sync/pass.hw.xilinx_u200_gen3x16_xdma_2_202110_1.xclbin.run_summary
    

    If you zoom in on the Application Timeline, an image is displayed similar to the following figure. ../../../_images/clFinish_vitis.PNG

    In the figure, the key elements are the red box named clFinish and the large gap between the kernel that enqueues every three invocations of the accelerator.

    The call to clFinish creates a synchronization point on the complete OpenCL command queue. This implies that all commands enqueued onto the given queue will have to be completed before clFinish returns control to the host program. As a result, all activities, including the buffer communication, need to be completed before the next set of three accelerator invocations can resume. This is effectively a barrier synchronization.

    While this enables a synchronization point where buffers can be released, and all processes are guaranteed to have completed, it also prevents overlap at the synchronization point.

  4. Look at an alternative synchronization scheme, where the synchronization is performed based on the completion of a previous execution of a call to the accelerator. Edit the sync_host.cpp file to change the execution loop as follows.

      // -- Execution -----------------------------------------------------------
    
      for(unsigned int i=0; i < numBuffers; i++) {
        if(i < 3) {
          tasks[i].run(api);
        } else {
          tasks[i].run(api, tasks[i-3].getDoneEv());
        }
      }
      clFinish(api.getQueue());
    
  5. Recompile the application, rerun the program, and review the run_summary in Vitis analyze:

    make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=sync
    vitis_analyzer sync/pass.hw.xilinx_u200_gen3x16_xdma_2_202110_1.xclbin.run_summary
    

    If you zoom in on the Application Timeline, an image is displayed similar to the following figure. ../../../_images/clEventSync_vitis.PNG

    In the later part of the timeline, there are five executions of pass executed without any unnecessary gaps. However, even more telling are the data transfers at the point of the marker. At this point, three packages were sent over to be processed by the accelerator, and one was already received back. Because you have synchronized the next scheduling of Write/Execute/Read on the completion of the first accelerator invocation, you now observe another write operation before the third pass has even completed. This clearly identifies an overlapping execution.

    In this case, you synchronized the full next accelerator execution on the completion of the execution scheduled three invocations earlier by using the following event synchronization in the run method of the class task.

        if(prevEvent != nullptr) {
          clEnqueueMigrateMemObjects(api.getQueue(), 1, &m_inBuffer[0],
                                    0, 1, prevEvent, &m_inEv);
       } else {
         clEnqueueMigrateMemObjects(api.getQueue(), 1, &m_inBuffer[0],
                                    0, 0, nullptr, &m_inEv);
        }
    

    While this is the common synchronization scheme between enqueued objects in OpenCL, you can alternatively synchronize the host code by calling the following API.

      clWaitForEvents(1,prevEvent);
    

    This allows for additional host code computation while the accelerator is operating on earlier enqueued tasks. This is not explored further here, but rather left to you as an additional exercise.

    NOTE: Because this synchronization scheme allows the host code to operate after the completion of an event, it is possible to code up a buffer management scheme. This will avoid running out of memory for long running applications.