Running the Application - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English

With the XRT initialized, run the application by running the following command from the build directory:

./08_opencv_resize <path_to_image>

A fish.jpg is provided in design_source/test_data directory

Because of the way we’ve configured the hardware in this example, your image must conform to certain requirements. Because we’re processing eight pixels per clock, your input width must be a multiple of eight.

If it isn’t then the program will output an error message informing you which condition was not satisfied. This is of course not a fundamental requirement of the library; we can process images of any resolution and other numbers of pixels per clock. But, for optimal performance if you can ensure the input image meets certain requirements you can process it significantly faster. In addition to the resized images from both the hardware and software OpenCV implementations, the program will output messages similar to this:

-- Example 8: OpenCV Image Resize and Blur --

OpenCV conversion done!  Image resized 1920x1080 to 640x360 and blurred 7x7!
Starting Xilinx OpenCL implementation...
Matrix has 3 channels
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’

OpenCV resize operation:            7.170 ms
OpenCL initialization:              275.349 ms
OCL input buffer initialization:    4.347 ms
OCL output buffer initialization:   0.131 ms
FPGA Kernel resize operation:       4.788 ms

In the previous example the CPU and the FPGA were pretty much tied for the small example. But while we’ve added a significant processing time for the CPU functions, the FPGA runtime hasn’t increased much at all!

Let’s now double the input size, going from a 1080p image to a 4k image. Change the code for this example, as we did with Example 7 and recompile.

Running the example again, we see something very interesting:

-- Example 8: OpenCV Image Resize and Blur --
OpenCV conversion done!  Image resized 1920x1080 to 3840x2160 and blurred 7x7!
Starting Xilinx OpenCL implementation...
Matrix has 3 channels
Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’

OpenCV resize operation:            102.977 ms
OpenCL initialization:              250.000 ms
OCL input buffer initialization:    3.473 ms
OCL output buffer initialization:   7.827 ms
FPGA Kernel resize operation:       7.069 ms

What wizardry is this!? The CPU runtime has increased by nearly 10x, but the FPGA runtime has barely moved at all!

Like we said, FPGAs are _really+ good at doing things in pipelines. This algorithm isn’t I/O bound, it’s processor bound. We can decompose it to process more data faster (Amdahl’s Law) by calculating multiple pixels per clock, and by streaming from one operation to the next and doing more operations in parallel (Gustafson’s Law). We can even decompose the Gaussian Blur into individual component calculations and run those in parallel (which wehave done, in the Vitis Vision library).

Now that we’re bound by computation and not bandwidth we can easily see the benefits of acceleration. If we put this in terms of FPS, our x86-class CPU instance can now process 9 frames per second while our FPGA card can handle a whopping 141. And adding additional operations will continue to bog down the CPU, but so long as you don’t run out of resources in the FPGA you can effectively continue this indefinitely. In fact, our kernels are still quite small compared to the resource availability on the Alveo U200 card.

To compare it to the previous example, again for a 1920x1200 input image, we get the results shown below. The comparison column will compare the “Scale Up” results from Example 7 with the scaled up results from this example.

Operation Scale Down Scale Up Δ7→8
Software Resize 7.170 ms 102.977 ms 91.285 ms
Hardware Resize 4.788 ms 7.069 ms 385 µs
ΔAlveo→CPU −2.382 ms −95.908 mss −90.9 ms

We hope you can see the advantage!