Vitis HLS for Kernel Optimizations - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English

The C++ kernels destined implemented onto the device LUTs and flops (a.k.a the “fabric”) are automatically compiled with the high-level synthesis tool Vitis HLS. In this tutorial we run Vitis HLS “manually” to gain additional insights about the underlying synthesis technology and the Cholesky kernel algorithm.

Click to expand! (instructions for Vitis HLS)
  1. Open a terminal and setup Vitis

  2. Navigate to ./build/cholesky_kernel_hw_emu/cholesky_kernel

    • There should be yet another cholesky_kernel directory at that level

  3. Run: vitis_hls -p cholesky_kernel & (to start the Vitis high-level synthesis GUI)

  4. Vitis HLS now shows the high-level synthesis report

  5. In the GUI expand the Synthesis Summary Report window

  6. Expand the loops and function in the Performance & Resources section

  7. Right click on the II violation as shown in this clip to locate it in the code: 50s HLS looping GIF

Note: you can restore the original Vitis HLS window layout via the “Window” menu -> “Reset Perspective”.

Initiation Interval

We see an II violation of 8 for two loops in this function. One of them looks like this:

// Loop only takes one element every 8 clock cycles!!!
// We expected one every cycle (II of 1)
for (int k = 0; k < j; k++)
{
    tmp += dataA[j][k] * dataA[j][k];
}

Since this version of the algorithm uses double data types with an accumulation, the silicon needs 8 cycles at 300MHz to perform and complete the operation before starting the next. So we can only compute samples one after another by intervals of 8 cycles… This is the first bottleneck that we’ll tackle in the next module.

Kernel Latency

Let’s now look at latency.

cholesky_kernel/solution/syn/report/cholesky_kernel_csynth.rpt

* Loop:
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
|                    | Latency (cycles) |  Iteration  |  Initiation Interval  |    Trip    |          |
|       Loop Name    |  min   |   max   |   Latency   |  achieved |   target  |    Count   | Pipelined|
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+
|- VITIS_LOOP_32_..  |       ?|        ?|            3|          1|          1|           ?|    yes   |
|- Loop_first_col    |       ?|        ?|           34|          1|          1|           ?|    yes   |
|- Loop_col          |       ?|        ?|            ?|          -|          -|           ?|    no    |
| + Loop_diag        |      17|  2097161|           18|          8|          1| 1 ~ 262144 |    yes   |
| + Loop_row         |       ?|        ?| 61 ~ 2097205|          -|          -|           ?|    no    |
|  ++ Loop_vec_mul   |      17|  2097161|           18|          8|          1| 1 ~ 262144 |    yes   |
|- VITIS_LOOP_67_..  |       ?|        ?|            4|          1|          1|           ?|    yes   |
+--------------------+--------+---------+-------------+-----------+-----------+------------+----------+

Notice that:

  • The VITIS prefixed loops: these are loops automatically labeled by Vitis HLS since none were applied in the source code for them. The other loops did have a label, it’s shown in the table.

  • The question marks (?), they denote a metric that cannot be calculated because dependent on scalar input to the function and indeed in this example the matrix size is configurable and latency will vary depending on the size.

  • The last “Pipeline” column indicates if a loop was constrained to process its inputs at each cycle. The simple loops or most inner nested loops are the ones generally “pipelined” automatically by the tool

As an input to the Cholesky function the user passes the size of the matrix N (in the example we ran it was 64).

The first loop requires N iterations at II=1 so it takes Nx3 to complete since the iteration latency is 3. The Loop_first_col loop takes Nx34 The Loop_col loop runs N times ( (Loop_diag is N * 18) + (Loop_row is N * (N + 18)) Last loop also requires N iterations like the first one.

Some we can roughly estimate the duration to be: N(18N+N(18N+residual1)+residual2) = 18N3 + (18+residual1)N2 + residual2.N

So essentially the algorithm latency goes by the cube of N, the size of the matrix.

Adding a C++ testbench for the kernel

For this tutorial we provide a pre-made C++ “main” program to wrap around the kernel and simulate in the Vitis HLS environment.

Instructions:

  1. In a terminal, from the docs directory:

    cp -r ./hls_tb ./module1_baseline/build/cholesky_kernel_hw_emu/cholesky_kernel
    cp ./module1_baseline/src/cholesky_kernel.hpp ./module1_baseline/build/cholesky_kernel_hw_emu/cholesky_kernel/hls_tb
    
  2. If the Vitis HLS GUI was closed, open it again:

    cd ./module1_baseline/build/cholesky_kernel_hw_emu/cholesky_kernel
    vitis_hls -p cholesky_kernel &
    
  3. In the “Explorer” window left pane of the GUI, locate “Test Bench” under “Source”. Right-click -> “Add file…”, select test_hls.cpp. Repeat this operation for the two data files: matrix_input.dat and golden_result.dat in ./hls_tb/tb_data

  4. Now select “Project”-> “Run C simulation” in main menu. This runs a purely functional simulation called “Csim”, none of what HLS synthesizes is involved.

  5. Select “Project”-> “Run C simulation”

  6. Select “Solution” -> “Run C Synthesis” -> “Active Solution”

  7. Run “Solution” -> “Run C/RTL Cosimulation”. In the popup window select Okay.

The Vitis HLS Cosimulation runs a cycle accurate RTL simulation which shows the actual latency in clock cycles. In the test bench the matrix is a 16x16.