Vitis HLS for Kernel Optimizations - 2023.1 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-08-02
Version
2023.1 English

The C++ kernels destined implemented onto the device look-up tables (LUTs) and flops (also known as, the “fabric”) are automatically compiled with the high-level synthesis tool Vitis HLS. In this tutorial, you run Vitis HLS “manually” to gain additional insights about the underlying synthesis technology and the Cholesky kernel algorithm.

Click to expand! (instructions for Vitis HLS)
  1. Open a terminal and set up Vitis.

  2. Navigate to ./build/cholesky_kernel_hw_emu/cholesky_kernel.

    • There should be yet another cholesky_kernel directory at that level.

  3. Run vitis_hls -p cholesky_kernel & (to start the Vitis high-level synthesis GUI).

  4. Vitis HLS now shows the high-level synthesis report.

  5. In the GUI, expand the Synthesis Summary Report window.

  6. Expand the loops and function in the Performance & Resources section.

  7. Right click the II violation, and select Goto Source.

NOTE: You can restore the original Vitis HLS window layout via the “Window” menu -> “Reset Perspective”.

Initiation Interval

You see an initiation interval (II) violation of eight for two loops in this function. One of them looks like this:

for (int k = 0; k < j; k++)
{
    tmp += dataA[j][k] * dataA[j][k];
}

Because this version of the algorithm uses double data types with an accumulation, the silicon needs eight cycles at 300 MHz to perform and complete the operation before starting the next. So you can only compute samples one after another by intervals of eight cycles. This is the first bottleneck that you will tackle in the next module.

Kernel Latency

Look at the latency.

cholesky_kernel/solution/syn/report/cholesky_kernel_csynth.rpt

* Loop:
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
|                    | Latency (cycles) |  Iteration  |  Initiation Interval  |    Trip    |          |
|       Loop Name    |  min   |   max   |   Latency   |  achieved |   target  |    Count   | Pipelined|
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
|- VITIS_LOOP_32_..  |       ?|        ?|            3|          1|          1|           ?|    yes   |
|- Loop_first_col    |       ?|        ?|           34|          1|          1|           ?|    yes   |
|- Loop_col          |       ?|        ?|            ?|          -|          -|           ?|    no    |
| + Loop_diag        |      17|  2097161|           18|          8|          1| 1 ~ 262144 |    yes   |
| + Loop_row         |       ?|        ?| 61 ~ 2097205|          -|          -|           ?|    no    |
|  ++ Loop_vec_mul   |      17|  2097161|           18|          8|          1| 1 ~ 262144 |    yes   |
|- VITIS_LOOP_67_..  |       ?|        ?|            4|          1|          1|           ?|    yes   |
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+

Notice that:

  • The VITIS prefixed loops: these are loops automatically labeled by Vitis HLS since none were applied in the source code for them. The other loops did have a label; it is shown in the table.

    • The question marks (?) denote a metric that cannot be calculated because dependent on scalar input to the function and indeed in this example the matrix size is configurable and latency will vary depending on the size.

    • The last “Pipeline” column indicates if a loop was constrained to process its inputs at each cycle. The simple loops or most inner nested loops are the ones generally “pipelined” automatically by the tool

As an input to the Cholesky function, the user passes the size of the matrix N (in the example you ran, it was 64).

The first loop requires N iterations at II=1 so it takes Nx3 to complete since the iteration latency is 3. The Loop_first_col loop takes Nx34. The Loop_col loop runs N times (Loop_diag is N x 18) + (Loop_row is N x (N + 18)). The last loop also requires N iterations like the first one.

You can roughly estimate the duration to be: N(18N+N(18N+residual1)+residual2) = 18N<sup>3</sup> + (18+residual1)N<sup>2</sup> + residual2.N.

So essentially the algorithm latency goes by the cube of N, the size of the matrix.