In this first exercise, you will look at a pipelined kernel execution.
In this build you are only dealing with a single instance of the kernel, or compute unit (CU) running in the hardware. However, as previously described, the run of a kernel also requires the transmission of data to and from the CU. These activities should be pipelined to minimize the idle-time of the kernel working with the host application.
Open the host code, src/pipeline_host.cpp
, and look at the execution loop starting at line 55.
// -- Execution -----------------------------------------------------------
for(unsigned int i=0; i < numBuffers; i++) {
tasks[i].run(api);
}
clFinish(api.getQueue());
In this case, the code schedules all the buffers and lets them execute. Only at the end does it actually synchronize and wait for completion.
Compile and run the host code (
srcPipeline/host.cpp
) using the following command.make TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline
Compared to the kernel compilation time, this build step takes very little time.
You are now ready to run the application.
The runtime data is generated by the host program due to settings specified in the
xrt.ini
file, as described in Enabling Profiling in Your Application. This file is found at./reference-files/auxFiles/xrt.ini
, and is copied to therunPipeline
directory by themake run
command.The
xrt.ini
file contains the following settings:[Debug] opencl_trace=true device_trace=coarse
Use the following command to run the application.
make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline
After the run completes, open the Application Timeline using the Vitis analyzer, then select the Application Timeline located in left side panel.
vitis_analyzer pipeline/xrt.run_summary
NOTE: In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in Working with the Analysis View. You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.
The Application Timeline view illustrates the full run of the executable. The three main sections of the timeline are:
OpenCL API Calls
Data Transfer
Kernel Enqueues
Zoom in on the section illustrating the actual accelerator execution, and select one of the kernel enqueue blocks on
Row 0
to see an image similar to the following figure.The blue arrows identify dependencies, and you can see that every Write/Execute/Read task execution has a dependency on the previous Write/Execute/Read operation set. This effectively serializes the execution. In this case, the dependency is created by using an ordered queue.
Open the file
src/pipeline_host.cpp
in a text editor.In the Common Parameters section, as shown at line 27 of the
pipeline_host.cpp
, theoooQueue
parameter is set tofalse
.bool oooQueue = false;
You can break this dependency by changing the out-of-order parameter to
true
.bool oooQueue = true;
Recompile the application, rerun the program, and review the run_summary in the Vitis analyzer:
make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=pipeline vitis_analyzer pipeline/xrt.run_summary
If you zoom in on the Application Timeline, and click any kernel enqueue, you should see results similar to the following figure.
If you select other pass kernel enqueues, you will see that all 10 are now showing dependencies only within the Write/Execute/Read group. This allows the read and write operations to overlap with the execution, and you are effectively pipelining the software write, execute, and read. This can considerably improve the overall performance because the communication overhead is occurring concurrently with the execution of the accelerator.