Comparison of all versions - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

By using Code Analyzer, we were able to rapidly analyze and improve the design without running C Synthesis, so let’s now compare the estimation results from Code Analyzer against the estimations from C Synthesis:

Version CA: polyvec_ntt TI CA: ntt TI C-Synthpolyvec_nttlatency=interval
Version0: baseline code 51.5M 402k not computed
Version1: manual unroll 804k 6281 688k
Version2: remove p[] dependencies 230k 1540 295k
Version3: function template + dataflow + independent IO 199k 1552 100k

The estimations from Code Analyzer were guiding us to do the correct code changes and despite the estimations not matching exactly those from C Synthesis, they are showing the same trend and order of magnitude which is what allowed us making rapid changes.

For the change between Version2 and Version3, we can also notice the advantage of optimizing this function with the DATAFLOW pragma and task-level parallelism as it becomes obvious when viewed this way. Without dataflow, both the latency and the interval of ntt are equal to the sum of the functions it calls. With dataflow, the initiation interval of ntt becomes equal to the maximum interval of one of NTT’s sub stages. So whilst the latencies are the same in both versions, the initiation interval is much better: C Synthesis shows the latency to be 2114 cycles and the II to be 770 which is about a third, because each of these stages is running concurrently on a different iteration of data. when compounded all together the Version3 is running about 3 times faster than Version2; This is the advantage of the dataflow pragma and task-level parallelism.

The Resources Estimations table below from C-Synthesis using a Versal Premium (xcvp1202-vsva2785-1LP-i-L) further shows that the design resources didn’t bloat between the initial and final version. The biggest changes are the DSP resources to perform the computations in parallel and the BRAM to implement the needed intermediate ping-pong buffers.

Version latency BRAM DSP FF LUT
Version0: baseline code not computed 1 (~0%) 3 (~0%) 20844 (1%) 23036 (2%)
Version1: manual unroll 688k 1 (~0%) 21 (~0%) 115896 (6%) 134623 (14%)
Version2: remove p[] dependencies 295k 9 (~0%) 21 (~0%) 9449 (~0%) 11701 (1%)
Version3: function template + dataflow + independent IO 100k 19 (~0%) 21 (~0%) 13720 (~0%) 17940 (1%)