Lab 3: OpenCL API Buffer Size - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English

In the final section of this tutorial, you will investigate how the buffer size impacts the total performance. In this section you will focus on the host code in src/buf_host.cpp.

The execution loop is the same as it was at the end of the prior section. However, in the src/buf_host.cpp file the number of tasks to be processed has increased to 100. The goal of this change is to get 100 accelerator calls to transfer 100 buffers and read 100 buffers. This enables the tool to get a more accurate average throughput estimate per transfer.

A second command line option (SIZE=) has also been added to specify the buffer size for a specific run. The actual buffer size transferred during a single write or read is determined by calculating to the power of the specified argument (pow(2, argument)) multiplied by 512-bits.

  1. Compile and run the host code.

    make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 SIZE=14 LAB=buf
    

    The argument SIZE is used as a second argument to the host executable.

    NOTE: If SIZE is not specified it defaults to SIZE=14.

    This allows the code to execute the implementation with different buffer sizes and measure throughput by monitoring the total compute time. This number is calculated in the test bench and reported through the FPGA Throughput output.

  2. After the run completes, open the Application Timeline using the Vitis analyzer, then click the Application Timeline located at left side panel.

    vitis_analyzer buf/pass.hw.xilinx_u200_gen3x16_xdma_2_202110_1.xclbin.run_summary
    

    Examine the tmeline to review the operation.

  3. To ease the sweeping of different buffer sizes, an additional makefile target was created, wich can be run using the following command.

    make TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 bufRunSweep
    

    NOTE: The sweeping script (auxFiles/run.py) requires a Python installation, which is available in most systems.

    Executing the sweep will run and record the FPGA throughput for buffer SIZE arguments from 8 to 19. The measured throughput values are recorded together with the actual number of bytes per transfer in the buf/results.csv file, which is printed at the end of the makefile execution.

    When analyzing these numbers, a step function similar to the following image should be displayed.
    ../../../_images/stepFunc.PNG

    This image shows that the buffer size (x-axis, bytes per transfer) clearly impacts performance (y-axis, FPGA Throughput in MB/s), and starts to level out around 2 MB.

    NOTE: This image is created through gnuplot from the results.csv file, and if found on your system, it will be displayed automatically after you run the sweep.

Concerning host code performance, this step function identifies a relationship between buffer size and total execution speed. As shown in this example, it is easy to take an algorithm and alter the buffer size when the default implementation is based on a small amount of input data. It does not have to be dynamic and runtime deterministic, as performed here, but the principle remains the same. Instead of transmitting a single value set for one invocation of the algorithm, you would transmit multiple input values and repeat the algorithm execution on a single invocation of the accelerator.