Version1 - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

Version1 is based on Version0 and the only code changes made so far are that we’ve manually unrolled the loop named ntt_stage_outer in the function ntt, the previous code is looking like this:

    ntt_stage_outer:for(len = K; len >= 2; len >>= 1) {
        for(start=0;...
            for(j=start;...
                // loop body 

and after unrolling ntt_stage_outer with induction variable len, the code is changed into:

    // Stage 0
    len = K; // K is defined to be 128, so len is 128
    for(start=0;...
        for(j=start;...
            // loop body 

    // Stage 1
    len >>= 1; // len is now 64
    for(start=0;...
        for(j=start;...
            // loop body 

    [..repeat..]

    // Stage 6
    len >>= 1; //  len is now 2
    for(start=0;...
        for(j=start;...
            // loop body 

    // manual unrolling stops here:
    // the original loop condition is len >= 2 so the next iteration will not happen.

Let’s begin by simulating the code provided in Version1.

  1. Run C Simulation in the Flow pane with Code Analyzer.

    Similar to before, run C Simulation but let’s start directly with Code Analyzer enabled, that’s the way the new versions are setup (you can disable Code Analyzer in the settings in the Component Settings window).

  2. Select Run under C Simulation in the Flow pane again.

    Let’s start C simulation and wait until it is run. When it completes, the Code Analyzer view will be enabled for selection under the C simulation reports.

  3. When it completes, expand the Reports dropdown in the Flow pane and click Code Analyzer.

    The Code Analyzer graph report will appear and show the top level hardware function for synthesis, polyvec_ntt. Similarly to the previous version, this function does not bring more insights to us. Let’s note the TI value of 804k cycles in a table for later reference and discussion. Let’s expand the process body and drill down the function hierarchy: we enter poly_ntt which calls ntt and we note its estimated TI value of 6281 clock.

Version polyvec_ntt TI ntt TI
Version0: baseline code 51.5M 402k
Version1: manual unroll 804k 6281

We can be pleased with this is a huge improvement over the previous version by explicitly exposing the parallelism.

  1. Use the Function dropdown to select ntt and wait for HLS to analyze this code.

    Upon selecting ntt, the HLS tool will process and display a detailed graph representing the function’s internal operations and their interdependencies.

    Code Analyzer view of ntt

    The graph illustrates the processes that comprise the function and the dependencies therein. Each node in the graph represents a computational process. In our function ntt, most of the nodes in this graph represent the sequential stages of the algorithm. The edges denote data dependencies between the processes, where the data in one array is accessed by multiple processes.

    At the bottom of the screen, there’s an additional panel where additional details on the processes and channels can be found:

    Code Analyzer ntt processes

    This “Processes” pane provides a list of all processes in the function. In addition, the transaction interval (TI) is provided as an estimate of performance, and code guidance can provide a suggestion for how one might refactor the code to improve performance, if applicable. The TI of each stages are in the range 643-1154 and are listed both in the graph and in the table.

    Furthermore, by selecting “Channels,” you can view an additional aspect of the analysis:

    Code Analyzer ntt channels

    The “Channels” pane lists every data transaction that was analyzed. A variable might show up in the list twice if multiple transactions occurred on that variable. The chart also provides data on each transaction, like the bitwidth, throughput, and total volume of data being passed.

    One thing is clear from these three views provided by the code analyzer; there is a multitude of dependencies which span several processes. We can dive in on the channel view to get more information.

    We can observe that array p is used several times. Let’s make use the search function to show only uses of p[*]:

  2. Select the search box and type p[*] to show only this array in the table. We’ve highlighted the first few lines of the table with their matching edges from the graph shown, they are also annotated in red on the screenshot.

    Code Analyzer ntt channels

In the main graph, we can expand any of the processes labelled ntt_stageN to investigate the source of the problem. In each of the stages of this algorithm, the data accesses that same variable p. This is fine on a sequential processor like a CPU but will restrict performance on a parallel architecture.

In the next section, we’ll address this dependency on p and show one simple way to eliminate this contention and pave the way for acceleration.