Version2 - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

There are two steps we need to take to improve the situation of several processes needing for access to the variable p. You can certainly make these changes manually, but you might find it easier to open a version of the code that has already been factored.

  1. Select Version2 in the Component dropdown of the Flow pane.

    The first change will be to the signature of the function ntt. We want to separate the input and output. This will allow inputs and outputs to proceed in parallel, without contention. These code changes can be seen in the following diff output:

    polyvec.h
    -void ntt(uint16_t p[N]);
    +void ntt(uint16_t p_in[N], uint16_t p_out[N]);
    
    polyvec.cpp
    -void ntt(uint16_t p[N]) {
    +void ntt(uint16_t p_in[N], uint16_t p_out[N]) {
    

    The other change is similar in that we basically want to separate the input and output of each loop statement within the ntt function. To do that, we’ll simply declare several arrays internally:

        uint16_t p_stage1[N];
        uint16_t p_stage2[N];
        uint16_t p_stage3[N];
        uint16_t p_stage4[N];
        uint16_t p_stage5[N];
        uint16_t p_stage6[N];
    

    And then each stage must then be modified to use those new arrays. For example Stage 3 reads inputs from p_stage2[] and generate outputs into p_stage3[]:

        // Stage 3
        len >>= 1;// = 32;
        ntt_stage3: for(start = 0; start < N; start = j + len) {
            int16_t zeta = zetas[k++];
            ntt_stage3i: for(j = start; j < start + len; j++) {
    #pragma HLS PIPELINE
                int16_t t = fqmul(zeta, p_stage2[j + len]);
                p_stage3[j + len] = p_stage2[j] - t;
                p_stage3[j] = p_stage2[j] + t;
            }
        }
    

    Now, we should have resolved the memory access conflicts. We can check and confirm this by re-running the Code Analyzer.

  2. Run C Simulation on this component and open the HLS Code Analyzer, then open the graph for the function ntt:

    Code Analyzer ntt dependencies removed

With the dependencies removed, a clear structure emerges. What we now see is a sequence of processes with purely feed forward data transfer occurring in the region.

The TI of each stages have decreased again and are now in the range 135-578.

Version polyvec_ntt TI ntt TI
Version0: baseline code 51.5M 402k
Version1: manual unroll 804k 6281
Version2: remove p[] dependencies 230k 1540

In addition, we can see that each data transfer is independent of any other, and has a write to read access mode (in the table below we searched to display only names with “p_”):

Code Analyzer Channels ntt with dependencies removed

All of this is to say that this function is an excellent candidate for optimization using the dataflow pragma. We’ll do that now, in the next section: add dataflow pragma to enable the task level parallelism.