The C++ kernels destined implemented onto the device look-up tables (LUTs) and flops (also known as, the “fabric”) are automatically compiled with the high-level synthesis tool Vitis HLS. In this tutorial, you run Vitis HLS “manually” to gain additional insights about the underlying synthesis technology and the Cholesky kernel algorithm.
Click to expand! (instructions for Vitis HLS
)
Open a terminal and set up Vitis.
Navigate to
./build/cholesky_kernel_hw_emu/cholesky_kernel
.There should be yet another
cholesky_kernel
directory at that level.
Run
vitis_hls -p cholesky_kernel &
(to start the Vitis high-level synthesis GUI).Vitis HLS now shows the high-level synthesis report.
In the GUI, expand the Synthesis Summary Report window.
Expand the loops and function in the Performance & Resources section.
Right click the II violation, and select Goto Source.
NOTE: You can restore the original Vitis HLS window layout via the “Window” menu -> “Reset Perspective”.
Initiation Interval
You see an initiation interval (II) violation of eight for two loops in this function. One of them looks like this:
for (int k = 0; k < j; k++)
{
tmp += dataA[j][k] * dataA[j][k];
}
Because this version of the algorithm uses double data types with an accumulation, the silicon needs eight cycles at 300 MHz to perform and complete the operation before starting the next. So you can only compute samples one after another by intervals of eight cycles. This is the first bottleneck that you will tackle in the next module.
Kernel Latency
Look at the latency.
cholesky_kernel/solution/syn/report/cholesky_kernel_csynth.rpt
* Loop:
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
| | Latency (cycles) | Iteration | Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count | Pipelined|
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
|- VITIS_LOOP_32_.. | ?| ?| 3| 1| 1| ?| yes |
|- Loop_first_col | ?| ?| 34| 1| 1| ?| yes |
|- Loop_col | ?| ?| ?| -| -| ?| no |
| + Loop_diag | 17| 2097161| 18| 8| 1| 1 ~ 262144 | yes |
| + Loop_row | ?| ?| 61 ~ 2097205| -| -| ?| no |
| ++ Loop_vec_mul | 17| 2097161| 18| 8| 1| 1 ~ 262144 | yes |
|- VITIS_LOOP_67_.. | ?| ?| 4| 1| 1| ?| yes |
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
Notice that:
The
VITIS
prefixed loops: these are loops automatically labeled by Vitis HLS since none were applied in the source code for them. The other loops did have a label; it is shown in the table.The question marks (?) denote a metric that cannot be calculated because dependent on scalar input to the function and indeed in this example the matrix size is configurable and latency will vary depending on the size.
The last “Pipeline” column indicates if a loop was constrained to process its inputs at each cycle. The simple loops or most inner nested loops are the ones generally “pipelined” automatically by the tool
As an input to the Cholesky function, the user passes the size of the matrix N (in the example you ran, it was 64).
The first loop requires N iterations at II=1 so it takes Nx3 to complete since the iteration latency is 3.
The Loop_first_col
loop takes Nx34. The Loop_col
loop runs N times (Loop_diag
is N x 18) + (Loop_row
is N x (N + 18)). The last loop also requires N iterations like the first one.
You can roughly estimate the duration to be:
N(18N+N(18N+residual1)+residual2) = 18N<sup>3</sup> + (18+residual1)N<sup>2</sup> + residual2.N
.
So essentially the algorithm latency goes by the cube of N, the size of the matrix.