Now that you have the top-level function, runOnfpga updated with the proper datawidths and interface types, you need to identify the loops to optimize to improve latency and throughput.
The
runOnfpgafunction reads the Bloom filter coefficients from the DDR usingmaxiport1and saves the coefficients into thebloom_filter_locallocal array. This only needs to be read one time.if(load_filter==true) { read_bloom_filter: for(int index=0; index<bloom_filter_size; index++) { #pragma HLS PIPELINE II=1 unsigned int tmp = bloom_filter[index]; for (int j=0; j<PARALLELISATION; j++) { bloom_filter_local[j][index] = tmp; }
#pragma HLS PIPELINE II=1is added to initiate the burst DDR accesses and read the Bloom filter coefficients every cycle.The expected latency is about 16,000 cycles because the
bloom_filter_sizeis fixed to 16,000. You should confirm this after you run HLS Synthesis.
Within the
compute_hash_flagsfunction, theforloop is rearchitected as nested for the loop to compute 4 words in parallel.void compute_hash_flags ( hls::stream<parallel_flags_t>& flag_stream, hls::stream<parallel_words_t>& word_stream, unsigned int bloom_filter_local[PARALLELISATION][bloom_filter_size], unsigned int total_size) { compute_flags: for(int i=0; i<total_size/PARALLELISATION; i++) { #pragma HLS LOOP_TRIPCOUNT min=1 max=10000 parallel_words_t parallel_entries = word_stream.read(); parallel_flags_t inh_flags = 0; for (unsigned int j=0; j<PARALLELISATION; j++) { #pragma HLS UNROLL unsigned int curr_entry = parallel_entries(31+j*32, j*32); unsigned int frequency = curr_entry & 0x00ff; unsigned int word_id = curr_entry >> 8; unsigned hash_pu = MurmurHash2(word_id, 3, 1); unsigned hash_lu = MurmurHash2(word_id, 3, 5); bool doc_end= (word_id==docTag); unsigned hash1 = hash_pu&hash_bloom; bool inh1 = (!doc_end) && (bloom_filter_local[j][ hash1 >> 5 ] & ( 1 << (hash1 & 0x1f))); unsigned hash2=(hash_pu+hash_lu)&hash_bloom; bool inh2 = (!doc_end) && (bloom_filter_local[j][ hash2 >> 5 ] & ( 1 << (hash2 & 0x1f))); inh_flags(7+j*8, j*8) = (inh1 && inh2) ? 1 : 0; } flag_stream.write(inh_flags); } }
Added
#pragma HLS UNROLLUnrolls internal loop to make four copies of the Hash functionality.
Vitis HLS will try to pipeline the outer loop with
II=1. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel.Added
#pragma HLS LOOP_TRIPCOUNTmin=1 max=3500000`Reports the latency of the function after HLS Synthesis.