Navigate to the function runOnfpga
in 02-bloom/reference_files/compute_score_fpga_kernel.cpp
.
The algorithm has been updated to receive 512-bits of words from the DDR with the following arguments:
input_words
: 512-bit input data.output_flags
: 512-bit output data.Additional arguments:
bloom_filter
: Pointer of array with Bloom coefficients.Total number of words to be computed
load_filter
: Enable or disable of loading coefficients. This only needs to be loaded one time.
The first step of the kernel development methodology requires structuring the kernel code into the Load-Compute-Store pattern. This means creating a top-level function,
runOnfpga
with:Added sub-functions in the
compute_hash_flags_dataflow
for Load, Compute and Store.Local arrays or
hls::stream
variables to pass data between these functions.
The source code has the following INTERFACE pragmas for
input_words
,output_flags
andbloom_filter
.#pragma HLS INTERFACE m_axi port=output_flags bundle=maxiport0 offset=slave #pragma HLS INTERFACE m_axi port=input_words bundle=maxiport0 offset=slave #pragma HLS INTERFACE m_axi port=bloom_filter bundle=maxiport1 offset=slave
where:
m_axi
: Interface pragmas are used to characterize the AXI Master ports.port
: Specifies the name of the argument to be mapped to the AXI4 interface.offset=slave
: Indicates that the base address of the pointer is made available through the AXI4-Lite slave interface of the kernel.bundle
: Specifies the name of them_axi
interface. In this example, theinput_words
andoutput_flags
are mapped to amaxiport0
andbloom_filter
argument is mapped tomaxiport1
.
The function
runOnfpga
loads the Bloom filter coefficients and calls thecompute_hash_flags_dataflow
function which has the main functionality of the Load, Compute and Store functions.Refer to the function
compute_hash_flags_dataflow
in the02-bloom/cpu_src/compute_score_fpga_kernel.cpp
file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit.The kernel interface to the DDR memories is an AXI interface that is kept at its maximum width of 512 at the input and output. The
compute_hash_flags
function input can have a width different than 512, managed through “PARALLELIZATION”. To deal with these variations on the processing element boundaries, “Resize” blocks are inserted that adapt between the memory interface width and the processing unit interface width. Essentially, blocks named “Buffer” are memory adapters that convert between streams, and the AXI and “Resize” blocks adapt to interface widths as it depends on PARALLELIZATION factor chosen for the given configuration.The input of the
compute_hash_flags_dataflow
function,input_words
are read as 512-bit burst reads from the global memory over an AXI interface anddata_from_gmem
, the stream of 512-bit values are created.hls_stream::buffer(data_from_gmem, input_words, total_size/(512/32));
The stream of parallel words,
word_stream
(equals PARALLELIZATION words) are created fromdata_from_gmem
ascompute_hash_flags
requires 128-bit for 4 words to process in parallel.hls_stream::resize(word_stream, data_from_gmem, total_size/(512/32));
The function
compute_hash_flags_dataflow
calls thecompute_hash_flags
function for computing hash of parallel words.With
PARALLELIZATION=4
, the output of thecompute_hash_flags
,flag_stream
is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream asdata_to_mem
.hls_stream::resize(data_to_gmem, flag_stream, total_size/(512/8));
The stream of 512-bit values,
data_to_mem
is written as 512-bit values to the global memory over an AXI interface usingoutput_flags
.hls_stream::buffer(output_flags, data_to_gmem, total_size/(512/8));
The
#pragmas HLS DATAFLOW
is added to enable task-level pipelining. This enables DATAFLOW and will instruct the Vitis High-Level Synthesis (HLS) compiler to run all the functions simultaneously, creating a pipeline of concurrently running tasks.void compute_hash_flags_dataflow( ap_uint<512>* output_flags, ap_uint<512>* input_words, unsigned int bloom_filter[PARALLELIZATION][bloom_filter_size], unsigned int total_size) { #pragma HLS DATAFLOW hls::stream<ap_uint<512> > data_from_gmem; hls::stream<parallel_words_t> word_stream; hls::stream<parallel_flags_t> flag_stream; hls::stream<ap_uint<512> > data_to_gmem; . . . . }