According to characteristic of bicubic interpolation, we get derivative int_{xy} by first calculating int_{y} and then int_{x} from those. And interpolator optimizated is a 8x acceleration compare with original algorithm using sliding window on FPGA. The implemention is shown in the figure below:
The kernel will do the following steps:
- Load data from HBM: Load the original image pixels to stream and eight-bits represents a pixel. If NPPC=1 refers one pixel processing for every clock and NPPC=8 refers eight pixels processing for every clock.
- Image resampling: In image processing, we apply cubic interpolation to a data set from sliding window which is our proposed a structure for image processing. Here we can process 8 pixels in a clock that we take full advantage of the features of URAM to support. We can get several(<=8) results using 8 interpolator with ervery eight pixels, and put these results into a stream.
- Pick out pixel: We would pick out real and effective pixels from a 72-bits unit and making up these pixels a 64-bits unit.
- Load output to HBM: Scan stream to get data and write back to HBM.