Version1
is based on Version0
and the only code changes made so far are that we’ve manually unrolled the loop named ntt_stage_outer
in the function ntt
, the previous code is looking like this:
ntt_stage_outer:for(len = K; len >= 2; len >>= 1) {
for(start=0;...
for(j=start;...
// loop body
and after unrolling ntt_stage_outer
with induction variable len
, the code is changed into:
// Stage 0
len = K; // K is defined to be 128, so len is 128
for(start=0;...
for(j=start;...
// loop body
// Stage 1
len >>= 1; // len is now 64
for(start=0;...
for(j=start;...
// loop body
[..repeat..]
// Stage 6
len >>= 1; // len is now 2
for(start=0;...
for(j=start;...
// loop body
// manual unrolling stops here:
// the original loop condition is len >= 2 so the next iteration will not happen.
Let’s begin by simulating the code provided in Version1
.
Run C Simulation in the Flow pane with Code Analyzer.
Similar to before, run C Simulation but let’s start directly with Code Analyzer enabled, that’s the way the new versions are setup (you can disable Code Analyzer in the settings in the Component Settings window).
Select Run under C Simulation in the Flow pane again.
Let’s start C simulation and wait until it is run. When it completes, the Code Analyzer view will be enabled for selection under the C simulation reports.
When it completes, expand the Reports dropdown in the Flow pane and click Code Analyzer.
The Code Analyzer graph report will appear and show the top level hardware function for synthesis,
polyvec_ntt
. Similarly to the previous version, this function does not bring more insights to us. Let’s note the TI value of 804k cycles in a table for later reference and discussion. Let’s expand the process body and drill down the function hierarchy: we enterpoly_ntt
which callsntt
and we note its estimated TI value of 6281 clock.
Version | polyvec_ntt TI | ntt TI |
---|---|---|
Version0: baseline code | 51.5M | 402k |
Version1: manual unroll | 804k | 6281 |
We can be pleased with this is a huge improvement over the previous version by explicitly exposing the parallelism.
Use the Function dropdown to select
ntt
and wait for HLS to analyze this code.Upon selecting
ntt
, the HLS tool will process and display a detailed graph representing the function’s internal operations and their interdependencies.The graph illustrates the processes that comprise the function and the dependencies therein. Each node in the graph represents a computational process. In our function
ntt
, most of the nodes in this graph represent the sequential stages of the algorithm. The edges denote data dependencies between the processes, where the data in one array is accessed by multiple processes.At the bottom of the screen, there’s an additional panel where additional details on the processes and channels can be found:
This “Processes” pane provides a list of all processes in the function. In addition, the transaction interval (TI) is provided as an estimate of performance, and code guidance can provide a suggestion for how one might refactor the code to improve performance, if applicable. The TI of each stages are in the range 643-1154 and are listed both in the graph and in the table.
Furthermore, by selecting “Channels,” you can view an additional aspect of the analysis:
The “Channels” pane lists every data transaction that was analyzed. A variable might show up in the list twice if multiple transactions occurred on that variable. The chart also provides data on each transaction, like the bitwidth, throughput, and total volume of data being passed.
One thing is clear from these three views provided by the code analyzer; there is a multitude of dependencies which span several processes. We can dive in on the channel view to get more information.
We can observe that array
p
is used several times. Let’s make use the search function to show only uses ofp[*]
:Select the search box and type
p[*]
to show only this array in the table. We’ve highlighted the first few lines of the table with their matching edges from the graph shown, they are also annotated in red on the screenshot.
In the main graph, we can expand any of the processes labelled ntt_stageN
to investigate the source of the problem. In each of the stages of this algorithm, the data accesses that same variable p
. This is fine on a sequential processor like a CPU but will restrict performance on a parallel architecture.
In the next section, we’ll address this dependency on p
and show one simple way to eliminate this contention and pave the way for acceleration.