In the Vitis environment, the host application can be written in C++ using the Xilinx Runtime (XRT) native C++ API. Just as re-architecting the kernel code is required to enable parallelism on the hardware and optimize the memory accesses, host programming is equally important to ensure high performance over the CPU. You can have an optimized kernel to meet the required performance but the application performance won't be optimal if the utilization of CPU and FPGA is not high. Here are some of the considerations while creating a host program:
- Reducing the overhead of kernel enqueuing: There is an overhead of dispatching the commands and arguments by the host to the kernel before enqueuing the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be enqueued by the host.
- Maximize the data transfer bandwidth between the host and device: The data transfer between host and device memory should be large enough to maximize the PCIe bandwidth. At the same time, the buffer shouldn't be too large to initiate the kernel execution.
- Data availability for the kernel compute: The host should send the data to the FPGA device memory as soon as possible so that the compute can be initiated and the kernel is not starved due to data availability.
- Overlapping data transfers with kernel computation: Applications, such as database analytics or video, have a much larger data set than can be stored in the available global device memory on the acceleration device. They require the data to be processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieving high performance for these applications.
You may need to try out different buffer sizes for data movement between host and device memory to optimize the application performance. Using the Vitis tools, you can explore applications using an emulation target for estimated performance results and finally run on the hardware for the accurate performance results. After running the application, you can use Vitis analyzer to visualize the data movement between host and FPGA memory as well as kernel and device memory.
The following snapshot is based on the Timeline Trace (host application timeline) as seen in the Analysis view of the Vitis analyzer. This view displays the data movement on the horizontal axis as "Time" to identify any potential performance improvement opportunities.
The row with "Data Transfer:Read" displays when the host is reading the data from the device memory and "Data Transfer:Write" displays when the host is writing the data to the device memory. The row with "Kernel Enqueue" displays when the kernel is executing.
Like Task level parallelism achieved within the kernel, similar parallelism can also be achieved between the host and CPU. By enabling this, both CPU and FPGA can be active at the same time and result in high performance. As you program the host code, you need to think of ways to keep the FPGA and CPU both busy at the same time. This is usually the property of the algorithm and the designer has to carefully write the host program to get the max utilization of FPGA and CPU.
Looking at the above Application Timeline, it's easy to identify the gaps when the CPU is idle. Similarly, the FPGA is idle when the host is transferring the data. So the host program needs to feed the data to the device memory as soon as possible so that kernel on FPGA is not starved for data. Here, a large buffer is sent one time from the host to the CPU for computing. The FPGA is idle until the data is completely transferred to the device memory. When the FPGA is executing the compute function, the CPU is idle at that time. For this application, it's not required to send the large buffer and the application can produce better results by overlapping host transfer and kernel compute.
With the change on the host side by splitting the data into multiple chunks, the kernel compute can be initiated earlier and data transfer between the host and FPGA can be hidden. The host data transfer and the accelerated function on FPGA can now run in parallel and improve overall application performance. This technique has been demonstrated and explained in detail in Bloom Filter Tutorial.
Vitis analyzer is not just limited to showing the timeline trace but also provides application guidance on profiling, presents the synthesis reports with timing and resource information as well as the critical path in your design. You can refer to Using the Vitis Analyzer for more information on the tool.
For more information on host programming refer to XRT Native API, and the Example - Overlap of Host program and Kernel execution.