matrix_mult performs a General Matrix Multiply (GEMM), taking two input matrices of configurable dimensions *and data type.
These are the templates to configure the Matrix Multiply graph class.
Parameters:
TT_DATA_A | describes the type of individual data samples input of Matrix A to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat. |
TT_DATA_B | describes the type of individual data samples input of Matrix B to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat. The following rules apply:
|
TP_DIM_A | is an unsigned integer which describes the number of elements along the unique dimension (rows) of Matrix A. |
TP_DIM_AB | is an unsigned integer which describes the number of elements along the common dimension of Matrix A (columns) and Matrix B (rows). |
TP_DIM_B | is an unsigned integer which describes the number of elements along the unique dimension (columns) of Matrix B. |
TP_SHIFT | describes power of 2 shift down applied to the accumulation of product terms before each output. TP_SHIFT must be in the range 0 to 59 (61 for AIE1). |
TP_RND | describes the selection of rounding to be applied during the shift down stage of processing. Although, TP_RND accepts unsigned integer values descriptive macros are recommended where
|
TP_DIM_A_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. Note, a COL_MAJOR matrix can be transposed to become a ROW_MAJOR matrix. |
TP_DIM_B_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. |
TP_DIM_OUT_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. |
TP_ADD_TILING_A | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_ADD_TILING_B | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_ADD_DETILING_OUT | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_INPUT_WINDOW_VSIZE_A | describes the number of samples in the window API used for input to Matrix A. It must be of size TP_DIM_A*TP_DIM_AB. |
TP_INPUT_WINDOW_VSIZE_B | describes the number of samples in the window API used for input to Matrix B. It must be of size TP_DIM_B*TP_DIM_AB Note, the output window will be of size TP_DIM_A * TP_DIM_B. |
TP_CASC_LEN | describes the number of AIE kernels the matrix multiplication will be divided into in series. TP_CASC_LEN splits the operation over shared dimension TP_DIM_AB, where each kernel utilizes the cascade stream to pass partial accumulation results to the next kernel. In effect, dot(A,B) + C. |
TP_SAT | describes the selection of saturation to be applied during the shift down stage of processing. TP_SAT accepts unsigned integer values, where:
|
TP_SSR | describes the number of kernels (or cascaded kernel chains) that will compute the matrix multiplication in parallel. Each SSR rank will receive an equal sized split (along the unique dimension) of Matrix A data. There is no splitting of the Matrix B data when TP_SSR > 1 (only split when TP_CASC_LEN > 1). The Matrix B inputs across a chain of cascaded kernels will be the same across all SSR ranks |
template < typename TT_DATA_A, typename TT_DATA_B, unsigned int TP_DIM_A, unsigned int TP_DIM_AB, unsigned int TP_DIM_B, unsigned int TP_SHIFT, unsigned int TP_RND, unsigned int TP_DIM_A_LEADING = ROW_MAJOR, unsigned int TP_DIM_B_LEADING = COL_MAJOR, unsigned int TP_DIM_OUT_LEADING = ROW_MAJOR, unsigned int TP_ADD_TILING_A = 1, unsigned int TP_ADD_TILING_B = 1, unsigned int TP_ADD_DETILING_OUT = 1, unsigned int TP_INPUT_WINDOW_VSIZE_A = TP_DIM_A* TP_DIM_AB, unsigned int TP_INPUT_WINDOW_VSIZE_B = TP_DIM_B* TP_DIM_AB, unsigned int TP_CASC_LEN = 1, unsigned int TP_SAT = 1, unsigned int TP_SSR = 1 > class matrix_mult_graph: public graph // fields port <input> inA[TP_CASC_LEN *TP_SSR] port <input> inB[TP_CASC_LEN *TP_SSR] port <output> out[TP_SSR] kernel m_MatmultKernels[TP_CASC_LEN *TP_SSR] kernel untiler[TP_SSR] kernel tilerA[TP_CASC_LEN *TP_SSR] kernel tilerB[TP_CASC_LEN *TP_SSR]