#include "matrix_mult_graph.hpp"
matrix_mult performs a GEneral Matrix Multiply (GEMM), taking two input matrices of configurable dimensions and data type.
These are the templates to configure the Matrix Multiply graph class.
TT_DATA_A | describes the type of individual data samples input of Matrix A to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat. |
TT_DATA_B | describes the type of individual data samples input of Matrix B to the gemm function. This is a typename and must be one of the following: int16, cint16, int32, cint32, float, cfloat. The following rules apply:
TP_DIM_A | is an unsigned integer which describes the number of elements along the unique dimension (rows) of Matrix A. |
TP_DIM_AB | is an unsigned integer which describes the number of elements along the common dimension of Matrix A (columns) and Matrix B (rows). |
TP_DIM_B | is an unsigned integer which describes the number of elements along the unique dimension (columns) of Matrix B. |
TP_SHIFT | describes power of 2 shift down applied to the accumulation of product terms before each output. TP_SHIFT must be in the range 0 to 61. |
TP_RND | describes the selection of rounding to be applied during the shift down stage of processing. Although, TP_RND accepts unsignedinteger values descriptive macros are recommended where
TP_DIM_A_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. Note, a COL_MAJOR matrix can be transposed to become a ROW_MAJOR matrix. |
TP_DIM_B_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. |
TP_DIM_OUT_LEADING | describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. |
TP_ADD_TILING_A | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_ADD_TILING_B | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_ADD_DETILING_OUT | describes wether or not to add an additional kernel to rearrange the matrix samples into their required position. Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph. |
TP_INPUT_WINDOW_VSIZE_A | describes the number of samples in the window API used for input to Matrix A. It must be of size TP_DIM_A*TP_DIM_AB*N. Typical use has N=1, however N>1 can be utilised to minimise overhead of window API. This parameter is optional and has a default value of TP_DIM_A*TP_DIM_AB (N=1). |
TP_INPUT_WINDOW_VSIZE_B | describes the number of samples in the window API used for input to Matrix B. It must be of size TP_DIM_B*TP_DIM_AB*M. Typical use has M=1, however M>1 can be utilised to minimise overhead of window API. This parameter is optional and has a default value of TP_DIM_B*TP_DIM_AB (M=1). Note, the output window will be of size: (TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB). When N and M is 1, output window size will be TP_DIM_A * TP_DIM_B. |
TP_CASC_LEN | describes the number of AIE Tiles to split the GEMM operation into. TP_CASC_LEN splits the operation over TP_DIM_AB, where each kernel utilises the cascade stream to pass partial accumulation results to the next kernel. In effect, dot(A,B) + C. Note, it is also possible to tile the operation over multiple AIE tiles by instantiating multiple GEMM graphs with smaller dimensions. |
template < typename TT_DATA_A, typename TT_DATA_B, unsigned int TP_DIM_A, unsigned int TP_DIM_AB, unsigned int TP_DIM_B, unsigned int TP_SHIFT, unsigned int TP_RND, unsigned int TP_DIM_A_LEADING = ROW_MAJOR, unsigned int TP_DIM_B_LEADING = COL_MAJOR, unsigned int TP_DIM_OUT_LEADING = ROW_MAJOR, unsigned int TP_ADD_TILING_A = 1, unsigned int TP_ADD_TILING_B = 1, unsigned int TP_ADD_DETILING_OUT = 1, unsigned int TP_INPUT_WINDOW_VSIZE_A = TP_DIM_A* TP_DIM_AB, unsigned int TP_INPUT_WINDOW_VSIZE_B = TP_DIM_B* TP_DIM_AB, unsigned int TP_CASC_LEN = 1 > class matrix_mult_graph: public graph // typedefs typedef matrix_mult <TT_DATA_A, TT_DATA_B, TP_DIM_A, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_B, TP_SHIFT, TP_RND, TP_DIM_A_LEADING, TP_DIM_B_LEADING, TP_DIM_OUT_LEADING, (TP_INPUT_WINDOW_VSIZE_A/TP_CASC_LEN), (TP_INPUT_WINDOW_VSIZE_B/TP_CASC_LEN), cascIn, cascOut> matMultCasc typedef typename std::conditional < (TP_CASC_LEN==1), matMultCasc <false, false>, no_kernel>::type onlyMatMult typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <false, true>, onlyMatMult>::type firstMatMult typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <true, false>, firstMatMult>::type lastMatMult typedef typename std::conditional < (TP_CASC_LEN> 2), matMultCasc <true, true>, lastMatMult>::type middleMatMult typedef tilerKernelClass <tilingScheme.Atile, tilingScheme.ABtile, dimAPerKernel, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_A_LEADING, TT_DATA_A> TilerClassA typedef tilerKernelClass <tilingScheme.ABtile, tilingScheme.Btile, (TP_DIM_AB/TP_CASC_LEN), dimBPerKernel, TP_DIM_B_LEADING, TT_DATA_B> TilerClassB typedef untilerKernelClass <tilingScheme.Atile, tilingScheme.Btile, dimAPerKernel, dimBPerKernel, TP_DIM_OUT_LEADING, outType_t <TT_DATA_A, TT_DATA_B>> DetilerClassOut // structs struct no_kernel // fields port <input> inA[TP_CASC_LEN] port <input> inB[TP_CASC_LEN] port <output> out kernel m_MatmultKernels[TP_CASC_LEN] kernel untiler kernel tilerA[TP_CASC_LEN] kernel tilerB[TP_CASC_LEN] static constexpr middleMatMult::tilingStruct tilingScheme static constexpr unsigned int dimAPerKernel static constexpr unsigned int dimBPerKernel static constexpr bool isRedundantTilerA static constexpr bool isRedundantTilerB static constexpr bool isRedundantTilerOut
port <input> inA [TP_CASC_LEN]
The input A data to the function. This input is a window of samples of TT_DATA_A type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_A, which is derived from TP_DIM_A, TP_DIM_AB.
port <input> inB [TP_CASC_LEN]
The input B data to the function. This input is a window of samples of TT_DATA_B type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_B, which is derived from TP_DIM_AB and TP_DIM_B.
port <output> out
A window API of TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB samples, or simply TP_DIM_A * TP_DIM_B samples of a derived output type.
kernel m_MatmultKernels [TP_CASC_LEN]
The array of kernels that will be created and mapped onto AIE tiles. Number of kernels ( TP_CASC_LEN
) will be connected with each other by cascade interface.
kernel untiler
The kernel that that will be created when output tiling is enabled ( TP_ADD_DETILING_OUT = 1
kernel tilerA [TP_CASC_LEN]
The array of kernels that will be created when tiling on input A is enabled ( TP_ADD_TILING_A = 1
). Kernels will pre-process and sent the data through cascade interface to corresponding: m_MatmultKernels
kernel tilerB [TP_CASC_LEN]
The array of kernels that will be created when tiling on input A is enabled ( TP_ADD_TILING_A = 1
). Kernels will pre-process and sent the data through cascade interface to corresponding: m_MatmultKernels
kernel* getKernels ()
Access function to get pointer to kernel (or first kernel in a chained configuration).
matrix_mult_graph ()
This is the constructor function for the Matric Multiply graph.