The SSR operation is controlled by parameter TP_SSR and SSR enables running multiple instances of a kernel in parallel where each instance runs on a separate tile. The input data is split and distributed to the parallel kernel instances.
- Input matrix A is split and distributed to parallel kernels. The split is based on the COLUMNS and thus
TP_DIM_A_COLSmust be divisible byTP_SSR.- Input matrix B is not split and a copy of it is passed to each parallel kernel.