Multiple kernels are chained together in cascade using the template parameter TP_CASC_LEN. The input matrix and vector will be split across TP_DIM_B and processed to a TP_CASC_LEN number of kernels. The accumulated partial results of each kernel are passed to the successive kernel via a cascade stream until the end of the cascade chain whereby the final kernel will output the expected results to the output port. Cascade connections are made internally to the matrix multiply graph and external interfaces to the graph remain unchanged.
Each AI Engine kernel in the array is given a sub-matrix and a split of the vector, so the interface to the graph is an array of ports for both A and B. The split will occur along the TP_DIM_B dimension. For example, the matrix data to each kernel will be of a size TP_DIM_A x TP_DIM_B/TP_CASC_LEN and the vector will contain TP_DIM_B/TP_CASC_LEN elements.
The number of rows in the matrix (TP_DIM_A) must be a multiple of 256 / 8 / sizeof(TT_DATA_A), this is equivalent to the number of samples of TT_DATA_A that can occupy a 256-bit register.
The number of columns, and size of the input vector (TP_DIM_B) must be a multiple of 256 / sizeof(TT_DATA_B). When multiple kernels are used in cascade, the value of TP_DIM_B must also be a multiple of TP_CASC_LEN.
Matrix and vector input data can be zero-padded to meet these requirements.
Find a full list of descriptions and parameters in the API Reference Overview.
Connections to the cascade ports can be made as follows:
for (int i = 0 ; i < TP_CASC_LEN; i++) { connect<>(inA[i], matrix_vector_mulGraph.inA[i]); connect<>(inB[i], matrix_vector_mulGraph.inB[i]); } connect<>( matrix_vector_mulGraph.out, out);