There are two steps we need to take to improve the situation of several processes needing for access to the variable p
. You can certainly make these changes manually, but you might find it easier to open a version of the code that has already been factored.
Select
Version2
in the Component dropdown of the Flow pane.The first change will be to the signature of the function
ntt
. We want to separate the input and output. This will allow inputs and outputs to proceed in parallel, without contention. These code changes can be seen in the following diff output:polyvec.h -void ntt(uint16_t p[N]); +void ntt(uint16_t p_in[N], uint16_t p_out[N]);
polyvec.cpp -void ntt(uint16_t p[N]) { +void ntt(uint16_t p_in[N], uint16_t p_out[N]) {
The other change is similar in that we basically want to separate the input and output of each loop statement within the
ntt
function. To do that, we’ll simply declare several arrays internally:uint16_t p_stage1[N]; uint16_t p_stage2[N]; uint16_t p_stage3[N]; uint16_t p_stage4[N]; uint16_t p_stage5[N]; uint16_t p_stage6[N];
And then each stage must then be modified to use those new arrays. For example Stage 3 reads inputs from
p_stage2[]
and generate outputs intop_stage3[]
:// Stage 3 len >>= 1;// = 32; ntt_stage3: for(start = 0; start < N; start = j + len) { int16_t zeta = zetas[k++]; ntt_stage3i: for(j = start; j < start + len; j++) { #pragma HLS PIPELINE int16_t t = fqmul(zeta, p_stage2[j + len]); p_stage3[j + len] = p_stage2[j] - t; p_stage3[j] = p_stage2[j] + t; } }
Now, we should have resolved the memory access conflicts. We can check and confirm this by re-running the Code Analyzer.
Run C Simulation on this component and open the HLS Code Analyzer, then open the graph for the function
ntt
:
With the dependencies removed, a clear structure emerges. What we now see is a sequence of processes with purely feed forward data transfer occurring in the region.
The TI of each stages have decreased again and are now in the range 135-578.
Version | polyvec_ntt TI | ntt TI |
---|---|---|
Version0: baseline code | 51.5M | 402k |
Version1: manual unroll | 804k | 6281 |
Version2: remove p[] dependencies | 230k | 1540 |
In addition, we can see that each data transfer is independent of any other, and has a write to read access mode (in the table below we searched to display only names with “p_”):
All of this is to say that this function is an excellent candidate for optimization using the dataflow pragma. We’ll do that now, in the next section: add dataflow pragma to enable the task level parallelism.