In this module, the code for the algorithm is moved into the header file, cholesky_kernel.hpp
.
There is now an explicit parallelization and the number of parallel compute. It is determined by NCU
, a constant set in cholesky_kernel.cpp
through #define NCU 16
.
NCU
is passed as a template parameter to the chol_col_wrapper
function (see below). The DATAFLOW
pragma applies to the loop that calls chol_col
16 times:
template <typename T, int N, int NCU>
void chol_col_wrapper(int n, T dataA[NCU][(N + NCU - 1) / NCU][N], T dataj[NCU][N], T tmp1, int j)
{
#pragma HLS DATAFLOW
Loop_row:
for (int num = 0; num < NCU; num++)
{
#pragma HLS unroll factor = NCU
chol_col<T, N, NCU>(n, dataA[num], dataj[num], tmp1, num, j);
}
}
To ensure DATAFLOW
is applied the dataA is divided into NCU
portions.
Finally, the loop is unrolled with a factor NCU
which implies you have NCU
(i.e., 16) copies of chol_col
created each working on a chunk of the data.