Version3 - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

Dataflow is an optimization in HLS that creates various processing stages from a sequence of functions, where the output of one stage is stored in a buffer to become the input of the next stage. In this way, we implement a Task Level Parallelism (TLP), where several computational tasks operate simultaneously.

Returning to the context of NTT, what we want to do is create a dataflow region where each of the stages is a task. Thus, each of the processes shown in the graph above, which are currently each 2 nested loops, need to be encapsulated in their own function: we created a function template; the different function instances in the software will infer different modules in the hardware and we use the template parameter for the stage number.

//using a template parameter to instanciate multiple different functions -> different hardware modules
template <int stage>
void ntt_stage(uint16_t p_in[N], uint16_t p_out[N]){
    unsigned int start=0, j=0;
    unsigned int len = K >> stage; 
    unsigned int k = 1 << stage; 

    ntt_stage_loop1: for(start = 0; start < N; start = j + len) {
        int16_t zeta = zetas[k]; k++;
        ntt_stage_loop2: for(j = start; j < start + len; j++) {
            #pragma HLS PIPELINE
            int16_t t = fqmul(zeta, p_in[j + len]);
            p_out[j + len] = p_in[j] - t;
            p_out[j] = p_in[j] + t;
        }
    }
}

void ntt(uint16_t p_in[N], uint16_t p_out[N]) { 
    uint16_t p_stage1[N];
    uint16_t p_stage2[N];
    uint16_t p_stage3[N];
    uint16_t p_stage4[N];
    uint16_t p_stage5[N];
    uint16_t p_stage6[N];
#pragma HLS DATAFLOW
    ntt_stage<0>(p_in,     p_stage1);
    ntt_stage<1>(p_stage1, p_stage2);
    ntt_stage<2>(p_stage2, p_stage3);
    ntt_stage<3>(p_stage3, p_stage4);
    ntt_stage<4>(p_stage4, p_stage5);
    ntt_stage<5>(p_stage5, p_stage6);
    ntt_stage<6>(p_stage6, p_out  );
}

Two further changes were made to the code compared to Version2:

  • Use dataflow in the polyvec_ntt() function: as we expect the ntt() function to have a smaller initiation interval (II) than its full latency, now that ntt() runs in dataflow mode, then we can restart the ntt() function earlier and because it is called from within a loop, then this loop inside the polyvec_ntt() function needs to also run in dataflow. If it is not running in dataflow mode then the loop will wait the full latency of the ntt() function to start the next loop iteration which is not efficient given the code changes performed.

  • Separate the input and output arrays: this saves the copy time that we had in the previous version and further removes dependencies.

void polyvec_ntt(polyvec *vin, polyvec *vout) {
    polyvec_ntt_loop:for( unsigned int i = 0; i < K; ++i) {
        #pragma HLS DATAFLOW
        poly_ntt(vin->vec+i, vout->vec+i);
    }
}

As before, these code modifications have already been completed, so you can switch to the next component to switch to the final version 3 of the code and run analysis one final time.

  1. Select Version3 in the Component dropdown of the Flow pane, run C Simulation on this component, open the HLS Code Analyzer, and select function ntt to show the following analysis results:

    Code Analyzer ntt with Dataflow

    In this final version of the code, the analyzer view shows just the 7 functions which implement the 7 stages and the channels that transfer data between them. This confirms the applicability of the code to the dataflow optimization.

    The TI of each stages are similar to the previous version because the code didn’t change and are now in the range 136-451.

Version polyvec_ntt TI ntt TI
Version0: baseline code 51.5M 402k
Version1: manual unroll 804k 6281
Version2: remove p[] dependencies 230k 1540
Version3: function template + dataflow + independent IO 199k 1552

To confirm that we can check the successful application of the dataflow optimization, the Dataflow Viewer but this requires that C-Synthesis is run.

  1. In the Flow panel, under C Synthesis, press Run

  2. Once that completes, expand reports and select Dataflow Viewer.

    You need to expand the box of the process nnt2_U0 so get to this representation:

    Dataflow Viewer ntt

Whereas the Code Analyzer analyzes C code to estimate synthesis results to provide early guidance, this Dataflow Viewer is now using the results of the synthesis to confirm those expectations. As shown in the view above, we can now confirm visually that the dataflow optimization is being applied to this code. It has automatically created 7 tasks and inferred Parallel In Parallel Out (PIPO) buffers to store the results in between each task. Now that we have confirmed the presence of Task Level Parallelism in the code, we will check the performance improvement afford by this optimization in the final section.