Multiple cascaded kernel chains can be used in parallel using the TP_SSR
template parameter. The input matrix will be split across the TP_DIM_A
dimension for each rank of cascade, but there will be no split for the input vector which is only split when TP_CASC_LEN > 1
. Each rank of SSR will produce an equal split of the output. The outputs to each SSR rank should be concatenated together to produce the resulting final output of the matrix-vector multiplication.
The number of rows in the matrix (TP_DIM_A
) must be a multiple of 256/ 8/sizeof(TT_DATA_A)
. This is equivalent to the number of samples of TT_DATA_A
that can occupy a 256-bit register. When SSR is being used, the value of TP_DIM_A
must also be a multiple of TP_SSR
.
The number of columns and size of the input vector (TP_DIM_B
) must be a multiple of 256/sizeof(TT_DATA_B)
. When multiple kernels are used in the cascade, the value of TP_DIM_B
must also be a multiple of TP_CASC_LEN
.
Matrix and vector input data can be zero-padded to meet these requirements.
You can find a full list of descriptions and parameters in API Reference Overview.
Connections to the cascade and ssr ports can be made as follows:
for (int ssrIdx = 0; ssrIdx < TP_SSR; ssrIdx++) { for (int cascIdx = 0 ; cascIdx < TP_CASC_LEN; cascIdx++) { connect<>(inA[(ssrIdx * TP_CASC_LEN) + cascIdx], matrix_vector_mulGraph.inA[(ssrIdx * TP_CASC_LEN) + cascIdx]); connect<>(inB[(ssrIdx * TP_CASC_LEN) + cascIdx], matrix_vector_mulGraph.inB[(ssrIdx * TP_CASC_LEN) + cascIdx]); } connect<>(matrix_vector_mulGraph.out[ssrIdx], out[ssrIdx]); }