Baseline Hardware Implementation Performance - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
Release Date
2022.2 English

The simplest and most straightforward hardware implementation can be achieved by passing this current kernel source code through the Vitis HLS tool. It will pipeline the innermost loop with II=1, performing only one multiply-accumulate(MAC) per cycle. The performance can be estimated based on the MACs as follows:

 MACs per Cycle = 1
 Hardware Fmax(MHz) = 300
 Throughput  = 300/225 = 1.33 (MPixels/s) =  1.33 MB/s

Here the hardware clock frequency is assumed to be 300MHz because, in general, for the U200 Xilinx Alveo Data Center card, this is the maximum supported clock frequency when using Vitis HLS based design flow. The performance turns out to be 1.33 MB/s with baseline hardware implementation. From the convolution filter source code, it can also be estimated how much memory bandwidth is needed at the input and output for achieved throughput. From the convolution filter source code also shown above, it is clear that the inner two loops, while calculating a single output pixel, performs 225(15*15) reads at the input so:

Output Memory Bandwidth = Throughput = 1.33 MB/s
Input Memory Bandwidth  = Throughput * 225 = 300 MB/s

For the baseline implementation, the memory bandwidth requirements are very trivial, assuming that PCIe and device DDR memory bandwidths on Xilinx Acceleration Cards/Boards are of the order of 10s of GB/s. As you have seen in previous sections, the throughput required for 60FPS 1080p HD video is 373 MB/s. So it clear that to meet the performance requirement:

Acceleration Factor to Meet 60FPS Performance = 373/1.33 = 280x
Acceleration Factor to Meet SW Performance    = 14.5/1.33 = 10.9x