Now that you have the top-level function, runOnfpga
updated with the proper datawidths and interface types, you need to identify the loops to optimize to improve latency and throughput.
The
runOnfpga
function reads the Bloom filter coefficients from the DDR usingmaxiport1
and saves the coefficients into thebloom_filter_local
local array. This only needs to be read one time.if(load_filter==true) { read_bloom_filter: for(int index=0; index<bloom_filter_size; index++) { #pragma HLS PIPELINE II=1 unsigned int tmp = bloom_filter[index]; for (int j=0; j<PARALLELISATION; j++) { bloom_filter_local[j][index] = tmp; }
#pragma HLS PIPELINE II=1
is added to initiate the burst DDR accesses and read the Bloom filter coefficients every cycle.The expected latency is about 16,000 cycles because the
bloom_filter_size
is fixed to 16,000. You should confirm this after you run HLS Synthesis.
Within the
compute_hash_flags
function, thefor
loop is rearchitected as nested for the loop to compute 4 words in parallel.void compute_hash_flags ( hls::stream<parallel_flags_t>& flag_stream, hls::stream<parallel_words_t>& word_stream, unsigned int bloom_filter_local[PARALLELISATION][bloom_filter_size], unsigned int total_size) { compute_flags: for(int i=0; i<total_size/PARALLELISATION; i++) { #pragma HLS LOOP_TRIPCOUNT min=1 max=10000 parallel_words_t parallel_entries = word_stream.read(); parallel_flags_t inh_flags = 0; for (unsigned int j=0; j<PARALLELISATION; j++) { #pragma HLS UNROLL unsigned int curr_entry = parallel_entries(31+j*32, j*32); unsigned int frequency = curr_entry & 0x00ff; unsigned int word_id = curr_entry >> 8; unsigned hash_pu = MurmurHash2(word_id, 3, 1); unsigned hash_lu = MurmurHash2(word_id, 3, 5); bool doc_end= (word_id==docTag); unsigned hash1 = hash_pu&hash_bloom; bool inh1 = (!doc_end) && (bloom_filter_local[j][ hash1 >> 5 ] & ( 1 << (hash1 & 0x1f))); unsigned hash2=(hash_pu+hash_lu)&hash_bloom; bool inh2 = (!doc_end) && (bloom_filter_local[j][ hash2 >> 5 ] & ( 1 << (hash2 & 0x1f))); inh_flags(7+j*8, j*8) = (inh1 && inh2) ? 1 : 0; } flag_stream.write(inh_flags); } }
Added
#pragma HLS UNROLL
Unrolls internal loop to make four copies of the Hash functionality.
Vitis HLS will try to pipeline the outer loop with
II=1
. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel.Added
#pragma HLS LOOP_TRIPCOUNT
min=1 max=3500000`Reports the latency of the function after HLS Synthesis.