Dataflow is an optimization in HLS that creates various processing stages from a sequence of functions, where the output of one stage is stored in a buffer to become the input of the next stage. In this way, we implement a Task Level Parallelism (TLP), where several computational tasks operate simultaneously.
Returning to the context of NTT, what we want to do is create a dataflow region where each of the stages is a task. Thus, each of the processes shown in the graph above, which are currently each 2 nested loops, need to be encapsulated in their own function: we created a function template; the different function instances in the software will infer different modules in the hardware and we use the template parameter for the stage number.
//using a template parameter to instanciate multiple different functions -> different hardware modules
template <int stage>
void ntt_stage(uint16_t p_in[N], uint16_t p_out[N]){
unsigned int start=0, j=0;
unsigned int len = K >> stage;
unsigned int k = 1 << stage;
ntt_stage_loop1: for(start = 0; start < N; start = j + len) {
int16_t zeta = zetas[k]; k++;
ntt_stage_loop2: for(j = start; j < start + len; j++) {
#pragma HLS PIPELINE
int16_t t = fqmul(zeta, p_in[j + len]);
p_out[j + len] = p_in[j] - t;
p_out[j] = p_in[j] + t;
}
}
}
void ntt(uint16_t p_in[N], uint16_t p_out[N]) {
uint16_t p_stage1[N];
uint16_t p_stage2[N];
uint16_t p_stage3[N];
uint16_t p_stage4[N];
uint16_t p_stage5[N];
uint16_t p_stage6[N];
#pragma HLS DATAFLOW
ntt_stage<0>(p_in, p_stage1);
ntt_stage<1>(p_stage1, p_stage2);
ntt_stage<2>(p_stage2, p_stage3);
ntt_stage<3>(p_stage3, p_stage4);
ntt_stage<4>(p_stage4, p_stage5);
ntt_stage<5>(p_stage5, p_stage6);
ntt_stage<6>(p_stage6, p_out );
}
Two further changes were made to the code compared to Version2
:
Use dataflow in the
polyvec_ntt()
function: as we expect thentt()
function to have a smaller initiation interval (II) than its full latency, now thatntt()
runs in dataflow mode, then we can restart thentt()
function earlier and because it is called from within a loop, then this loop inside thepolyvec_ntt()
function needs to also run in dataflow. If it is not running in dataflow mode then the loop will wait the full latency of thentt()
function to start the next loop iteration which is not efficient given the code changes performed.Separate the input and output arrays: this saves the copy time that we had in the previous version and further removes dependencies.
void polyvec_ntt(polyvec *vin, polyvec *vout) {
polyvec_ntt_loop:for( unsigned int i = 0; i < K; ++i) {
#pragma HLS DATAFLOW
poly_ntt(vin->vec+i, vout->vec+i);
}
}
As before, these code modifications have already been completed, so you can switch to the next component to switch to the final version 3
of the code and run analysis one final time.
Select
Version3
in the Component dropdown of the Flow pane, run C Simulation on this component, open the HLS Code Analyzer, and select functionntt
to show the following analysis results:In this final version of the code, the analyzer view shows just the 7 functions which implement the 7 stages and the channels that transfer data between them. This confirms the applicability of the code to the dataflow optimization.
The TI of each stages are similar to the previous version because the code didn’t change and are now in the range 136-451.
Version | polyvec_ntt TI | ntt TI |
---|---|---|
Version0: baseline code | 51.5M | 402k |
Version1: manual unroll | 804k | 6281 |
Version2: remove p[] dependencies | 230k | 1540 |
Version3: function template + dataflow + independent IO | 199k | 1552 |
To confirm that we can check the successful application of the dataflow optimization, the Dataflow Viewer but this requires that C-Synthesis is run.
In the Flow panel, under C Synthesis, press Run
Once that completes, expand reports and select Dataflow Viewer.
You need to expand the box of the process
nnt2_U0
so get to this representation:
Whereas the Code Analyzer analyzes C code to estimate synthesis results to provide early guidance, this Dataflow Viewer is now using the results of the synthesis to confirm those expectations. As shown in the view above, we can now confirm visually that the dataflow optimization is being applied to this code. It has automatically created 7 tasks and inferred Parallel In Parallel Out (PIPO) buffers to store the results in between each task. Now that we have confirmed the presence of Task Level Parallelism in the code, we will check the performance improvement afford by this optimization in the final section.