From the above calculations, it is clear that you need to improve the performance of a baseline hardware implementation by 280x to process 60 FPS. One of the paths you can take is to start unrolling the inner loops and pipeline. For example, by unrolling the innermost loop, which iterates 15 times, you can improve the performance by 15x. With that one change, the hardware performance will already be better than software-only implementation but not yet good enough to meet the required video performance. Another approach you can follow is to unroll the inner two loops and gain in performance by 15*15=225, which means a throughput of one-output pixel per cycle. The performance and memory bandwidth requirements will be as follows:
Throughput = Fmax * Pixels produced per cycle = 300 * 1 = 300 MB/s
Output Memory Bandwidth = Fmax * Pixels produced per cycle = 300 MB/s
Input Memory Bandwidth = Fmax * Input pixels read per output pixel = 300 * 225 = 67.5 GB/s
The required output memory bandwidth scales linearly with throughput, but input memory bandwidth has gone up enormously and might not be sustainable. A closer look at the convolution filter will reveal that it is not required to read all 225(15x15) pixels from the input memory for processing. An innovative caching scheme can be built to avoid such extensive use of input memory bandwidth.
The convolution filter belongs to a class of kernels known as stencil kernels, which can be optimized to increase input data reuse extensively, which can result in substantially reduced memory bandwidth requirements. With a caching scheme, you can bring the input bandwidth required to be the same as the output, which is around 300 MB/s. With the optimized data reuse scheme, when both inner loops are unrolled, it will require that only one-input pixel is read for producing one output pixel on average and hence input memory bandwidth of 300 MB/s.
Although you can reduce the input bandwidth, the achieved performance will still only be 300 MB/s, which is less than the required 373 MB/s. To deal with this, you can look for other ways to increase the throughput of hardware. One approach is to duplicate kernel instances, also called compute units. In terms of heterogeneous computing, you can increase the number of compute units so that you can process data in parallel. In the convolution filter case, you can process all color channels (YUV) on separate compute units. When using three compute units, one for each color channel, the expected performance summary will be as follows:
Throughput(estimated) = Performance of Single Compute Unit * No. Compute Units = 300 x 3 = 900 MB/s
Acceleration Against Software Implementation = 900/14.5 = 62x
Kernel Latency ( per image on any color channel ) = (1920*1080) / 300 = 6.9 ms
Video Processing Rate = (1/Kernel Latency) = 144 FPS
In this lab, you have learned about:
Basics of convolution filter
Profiled the performance of a software-only implementation
Estimated the performance and requirements for hardware implementations
Given these performance numbers, architecture selection, and implementation details, the next lab will show how you can design the kernel hardware and end up with an accelerated application that provides performance very close to these estimates.
Next Lab Module: Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter
Copyright © 2020–2023 Advanced Micro Devices, Inc