The convolutional filter consists of three independent CUs which will
each process one color channel from the video image stream. The example does not use an
actual video stream to keep project simple and stay focused on VSC. The example host
code simply generates a random image with three color channels which is passed to a
software function which is using the VSC specific code. This results in C-threads for
sending and receiving data from host to device plus the compute()
call.
The function passes all the filter parameters and data pointers to source and destination images. The code snippet below gives the definition of this function. The essential steps include:
- Create buffer pool handles (
srcBufPool
...) to enable sending and receiving data between the host and the device. Attributes indicate if the buffers are inputs or outputs. - Call
conv_acc::send_while()
using a lambda function. The lambda function allocates buffers on the device, copies host data to the device buffers, then calls thecompute()
function (which ultimately runs the hardware accelerated function).send_while()
keeps calling the lambda function as long as it returntrue
. - Call
conv_acc::receive_all_in_order()
also using a lambda function to receive the processed buffers. - Use a
join()
call to wait and synchronize everything.
Tip: Refer to VPP_ACC Class API for an explanation of the various functions.
#include "conv_filter_acc_wrapper.hpp"
int conv_filter_execute_fpga(
const char coeffs[FILTER_V_SIZE][FILTER_H_SIZE],
float factor,
short bias,
unsigned short width,
unsigned short height,
unsigned int numImages,
YUVImage srcImage,
YUVImage dstImage
)
{
auto srcBufPool = conv_acc::create_bufpool(vpp::input);
auto dstBufPool = conv_acc::create_bufpool(vpp::output);
auto coeffsBufPool = conv_acc::create_bufpool(vpp::input);
int run = 0;
int dataSizePerChannel = width * height ;
// sending input
conv_acc::send_while([=]()->bool {
conv_acc::set_handle(run);
unsigned char * srcBuf = (unsigned char *)conv_acc::alloc_buf(srcBufPool, 3*dataSizePerChannel);
unsigned char * dstBuf = (unsigned char *)conv_acc::alloc_buf(dstBufPool, 3*dataSizePerChannel);
char * coeffsBuf = ( char *)conv_acc::alloc_buf(coeffsBufPool, FILTER_V_SIZE*FILTER_H_SIZE);
// initialize all input data before parallel computes
unsigned char * srcChannel[3] = {srcImage.yChannel, srcImage.uChannel, srcImage.vChannel};
for (int ch = 0; ch < 3; ch++){
std::memcpy(srcBuf+ch*dataSizePerChannel, srcChannel[ch], dataSizePerChannel);
}
std::memcpy(coeffsBuf,coeffs,256);
// execute conv_acc<NCU> parallel computes
for (int ch = 0; ch < 3; ch++){
conv_acc::compute(coeffsBuf,
factor,
bias,
width,
height,
srcBuf + ch*dataSizePerChannel,
dstBuf + ch*dataSizePerChannel);
}
return (++run < numImages);
});
// receive lambda function for receive thread
conv_acc::receive_all_in_order([=]() {
int run = conv_acc::get_handle();
unsigned char * dstBuf = (unsigned char *)conv_acc::get_buf(dstBufPool);
unsigned char * dstChannel[3] = {dstImage.yChannel, dstImage.uChannel, dstImage.vChannel};
for (int ch = 0; ch < 3; ch++){
std::memcpy(dstChannel[ch], dstBuf+ch*dataSizePerChannel, dataSizePerChannel);
}
});
// wait for both loops to finish
conv_acc::join();
return 0;
}