By using Code Analyzer, we were able to rapidly analyze and improve the design without running C Synthesis, so let’s now compare the estimation results from Code Analyzer against the estimations from C Synthesis:
Version | CA: polyvec_ntt TI | CA: ntt TI | C-Synthpolyvec_nttlatency=interval |
---|---|---|---|
Version0: baseline code | 51.5M | 402k | not computed |
Version1: manual unroll | 804k | 6281 | 688k |
Version2: remove p[] dependencies | 230k | 1540 | 295k |
Version3: function template + dataflow + independent IO | 199k | 1552 | 100k |
The estimations from Code Analyzer were guiding us to do the correct code changes and despite the estimations not matching exactly those from C Synthesis, they are showing the same trend and order of magnitude which is what allowed us making rapid changes.
For the change between Version2
and Version3
, we can also notice the advantage of optimizing this function with the DATAFLOW pragma and task-level parallelism as it becomes obvious when viewed this way. Without dataflow, both the latency and the interval of ntt
are equal to the sum of the functions it calls. With dataflow, the initiation interval of ntt
becomes equal to the maximum interval of one of NTT’s sub stages. So whilst the latencies are the same in both versions, the initiation interval is much better: C Synthesis shows the latency to be 2114 cycles and the II to be 770 which is about a third, because each of these stages is running concurrently on a different iteration of data. when compounded all together the Version3
is running about 3 times faster than Version2
;
This is the advantage of the dataflow pragma and task-level parallelism.
The Resources Estimations table below from C-Synthesis using a Versal Premium (xcvp1202-vsva2785-1LP-i-L) further shows that the design resources didn’t bloat between the initial and final version. The biggest changes are the DSP resources to perform the computations in parallel and the BRAM to implement the needed intermediate ping-pong buffers.
Version | latency | BRAM | DSP | FF | LUT |
---|---|---|---|---|---|
Version0: baseline code | not computed | 1 (~0%) | 3 (~0%) | 20844 (1%) | 23036 (2%) |
Version1: manual unroll | 688k | 1 (~0%) | 21 (~0%) | 115896 (6%) | 134623 (14%) |
Version2: remove p[] dependencies | 295k | 9 (~0%) | 21 (~0%) | 9449 (~0%) | 11701 (1%) |
Version3: function template + dataflow + independent IO | 100k | 19 (~0%) | 21 (~0%) | 13720 (~0%) | 17940 (1%) |