Best Practices for Acceleration with Vitis

Best Practices for Acceleration with Vitis - 2020.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2021-03-22

Version

2020.2 English

Below are some specific things to keep in mind when developing your application code and hardware function in the Vitis™ core development kit.

Review the Methodology for Accelerating Applications with the Vitis Software Platform section for information about acceleration methodology.
Look to accelerate functions that have a high ratio of compute time to input and output data volume. Compute time can be greatly reduced using FPGA kernels, but data volume adds transfer latency.
Accelerate functions that have a self-contained control structure and do not require regular synchronization with the host.
Transfer large blocks of data from host to global device memory. One large transfer is more efficient than several smaller transfers. Run a bandwidth test to find the optimal transfer size.
Only copy data back to host when necessary. Data written to global memory by a kernel can be directly read by another kernel. Memory resources include PLRAM (small size but fast access with lowest latency), HBM (moderate size and access speed with some latency), and DDR (large size but slow access with high latency).
Take advantage of the multiple global memory resources to evenly distribute bandwidth across kernels.
Maximize bandwidth usage between kernel and global memory by performing 512-bit wide bursts.
Cache data in local memory within the kernels. Accessing local memories is much faster than accessing global memory.
In the host application, use events and non-blocking transactions to launch multiple requests in a parallel and overlapping manner.
In the FPGA, use different kernels to take advantage of task-level parallelism and use multiple CUs to take advantage of data-level parallelism to execute multiple tasks in parallel and further increase performance.
Within the kernels take advantage of tasks-level with dataflow and instruction-level parallelism with loop unrolling and loop pipelining to maximize throughput.
Some Xilinx FPGAs contain multiple partitions called super logic regions (SLRs). Keep the kernel in the same SLR as the global memory bank that it accesses.
Use software and hardware emulation to validate your code frequently to make sure it is functionally correct.
Frequently review the Vitis Guidance report as it provides clear and actionable feedback regarding deficiencies in your project.