Lab 1: Pipelined Kernel Execution Using Out-of-Order Event Queue - 2023.1 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-08-02
Version
2023.1 English

In this first exercise, you will look at a pipelined kernel execution.

In this build you are only dealing with a single instance of the kernel, or compute unit (CU) running in the hardware. However, as previously described, the run of a kernel also requires the transmission of data to and from the CU. These activities should be pipelined to minimize the idle-time of the kernel working with the host application.

Open the host code, src/pipeline_host.cpp, and look at the execution loop starting at line 55.

  // -- Execution -----------------------------------------------------------

  for(unsigned int i=0; i < numBuffers; i++) {
    tasks[i].run(api);
  }
  clFinish(api.getQueue());

In this case, the code schedules all the buffers and lets them execute. Only at the end does it actually synchronize and wait for completion.

  1. Compile and run the host code (srcPipeline/host.cpp) using the following command.

    make TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline
    

    Compared to the kernel compilation time, this build step takes very little time.

  2. You are now ready to run the application.

    The runtime data is generated by the host program due to settings specified in the xrt.ini file, as described in Enabling Profiling in Your Application. This file is found at ./reference-files/auxFiles/xrt.ini, and is copied to the runPipeline directory by the make run command.

    The xrt.ini file contains the following settings:

    [Debug]
    opencl_trace=true
    device_trace=coarse
    

    Use the following command to run the application.

    make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline
    

    After the run completes, open the Application Timeline using the Vitis analyzer, then select the Application Timeline located in left side panel.

    vitis_analyzer pipeline/xrt.run_summary
    

    NOTE: In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in Working with the Analysis View. You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.

    The Application Timeline view illustrates the full run of the executable. The three main sections of the timeline are:

    • OpenCL API Calls

    • Data Transfer

    • Kernel Enqueues

  3. Zoom in on the section illustrating the actual accelerator execution, and select one of the kernel enqueue blocks on Row 0 to see an image similar to the following figure. Ordered Queue

    The blue arrows identify dependencies, and you can see that every Write/Execute/Read task execution has a dependency on the previous Write/Execute/Read operation set. This effectively serializes the execution. In this case, the dependency is created by using an ordered queue.

  4. Open the file src/pipeline_host.cpp in a text editor.

    In the Common Parameters section, as shown at line 27 of the pipeline_host.cpp, the oooQueue parameter is set to false.

     bool         oooQueue                 = false;
    

    You can break this dependency by changing the out-of-order parameter to true.

     bool         oooQueue                 = true;
    
  5. Recompile the application, rerun the program, and review the run_summary in the Vitis analyzer:

    make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline
    vitis_analyzer pipeline/xrt.run_summary
    

    If you zoom in on the Application Timeline, and click any kernel enqueue, you should see results similar to the following figure. Out-of-Order Queue

    If you select other pass kernel enqueues, you will see that all 10 are now showing dependencies only within the Write/Execute/Read group. This allows the read and write operations to overlap with the execution, and you are effectively pipelining the software write, execute, and read. This can considerably improve the overall performance because the communication overhead is occurring concurrently with the execution of the accelerator.