Multiple kernels can be chained together in cascade using the TP_CASC_LEN
template parameter. The input matrix and vector will be split across TP_DIM_B
and processed to a TP_CASC_LEN
number of kernels. The accumulated partial results of each kernel are passed to the successive kernel via a cascade stream until the end of the cascade chain, whereby the final kernel will output the expected results to the output port. Cascade connections are made internally to the matrix multiply graph and external interfaces to the graph remain unchanged.
Each AI Engine kernel in the array is given a sub-matrix and a split of the vector, so the interface to the graph is an array of ports for both A and B. The split will occur along the TP_DIM_B
dimension. For example, the matrix data to each kernel will be of a size TP_DIM_A
x TP_DIM_B/TP_CASC_LEN
, and the vector will contain TP_DIM_B/TP_CASC_LEN
elements.