AI Engine API User Guide (AIE) 2023.2
|
The AIE API encapsulates the matrix multiplication functionality in the aie::mmul class template. More...
Overview
The AIE API encapsulates the matrix multiplication functionality in the aie::mmul class template.
This class template is parametrized with the matrix multiplication shape (MxKxN), the data types and, optionally, the requested accmululation precision. The resulting class defines a function that performs the multiplication and a data type for the result that can be converted to an accumulator/vector. The function interprets the input vectors as matrices as described by the shape parameters.
The following code snippet shows a portable sample blocked multiplication using the aie::mmul class. The matrices are assumed to be pre-tiled as defined by the mmul shape (MxK for A, KxN for B, and MxN for C).
Classes | |
struct | aie::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, AccumTag > |
Type that encapsulates a blocked matrix multiplication C = A x B. More... | |
Matrix Multiplication Modes
Supported Matrix Multiplication Modes
Arch. | 8b x 4b | 8b x 8b | 16b x 8b | 8b x 16b | 16b x 16b | 32b x 16b | 16b x 32b | 32b x 32b | bfloat16 x bfloat16 | float x float |
---|---|---|---|---|---|---|---|---|---|---|
AIE | 4x8x4 4x16x4a 8x8x4a 2x8x8 4x8x8a 1x16x8 2x16x8a 4x16x8a | 4x4x4 8x4x4a 4x8x4a 4x4x8a | 4x4x8a 4x4x4a 8x8x1ab | 4x4x4a 2x4x8a 4x4x8a 4x2x8a 8x8x1ab | 2x4x8a 4x4x4a 4x2x4a 2x2x4 2x4x4a 4x4x2a 2x2x8a | 4x2x2 2x4x8a 4x4x4a | 4x2x4a 2x2x2 2x4x2a 2x8x2a 4x2x2a 4x4x2a 2x4x4a 4x4x1a | 4x2x4a 2x2x2a 2x4x2ab 2x8x2ab 4x2x2a 4x4x2a 2x4x4a 4x4x1ab | ||
AIE-ML | 4x16x8 8x16x8a 4x32x8ab | 4x8x4ab 4x16x4ab 8x8x4ab 2x8x8 4x8x8 8x8x8a 1x16x8ab 2x16x8ab 4x16x8ab | 4x4x4ab 8x4x4ab 4x8x4 4x4x8 2x8x8 | 4x4x8ab 4x4x4ab | 4x4x4 2x4x8 4x4x8ab 4x2x8 8x2x8a | 2x4x8 4x4x4 4x2x4 | 2x4x8 4x4x4 | 4x2x4a 4x4x4ab 8x2x4a | 4x8x4 8x8x4a 4x16x8ab |
Arch. | 16b x c16b | 16b x c32b | c16b x 16b | c16b x c16b | c16b x 32b | c16b x c32b | 32b x c16b | 32b x c32b | c32b x 16b | c32b x c16b | c32b x 32b | c32b x c32b | float x cfloat | cfloat x float | cfloat x cfloat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AIE | 4x2x2 4x4x4a 4x4x1 | 2x4x2a 2x4x4a 2x8x2a 4x4x2a 4x4x1a | 2x2x4 2x2x8a 2x4x4a 2x4x8a 4x2x4a 4x4x2a 4x4x4a | 2x2x2 2x4x2a 2x8x2a 2x4x4a 4x2x2a 4x4x2a 4x2x4a 4x4x1a | 2x2x2 2x4x2a 2x8x2a 2x4x4a 4x2x2a 4x4x2a 4x2x4a 4x4x1a | 2x2x2a 2x4x2a 4x2x1a | 2x2x2 2x4x2a 2x8x2a 2x4x4a 4x2x2a 4x4x2a 4x2x4a 4x4x1a | 2x2x2a 2x4x2a 4x2x1a | 2x4x2a 2x8x2a 2x4x4a 4x4x2a | 2x2x2a 2x4x2a 4x4x1a | 1x2x2 2x2x2a 2x4x2a 4x4x1a | 1x2x2a 2x2x1a 2x2x1 | 2x2x2a 2x4x2a 4x2x1a | 2x2x2a 2x4x2a 4x4x1a 2x4x1ab | 2x2x2a 2x2x4a 2x4x2a 4x2x2a 4x2x1a |
AIE-ML | 2x4x8ab 4x4x4ab | 1x4x8ab 2x4x8ab | 1x2x4ab 1x2x8ab 2x2x8ab 1x4x8ab 2x4x8ab | 1x2x8ab |
- Note
- a - Emulated using multiple intrinsic calls
b - Require additional data manipulation
GEMM leveraging multidimensional addressing
- Note
- Multi-dimensional addressing and the corresponding tensor buffer streams were introduced with AIE-ML
Below is an example of an optimized bfloat16 GEMM kernel in which both input matrices, A and B, are addressed in the following 4D patterns:
It is assumed that the data for both input matrices are pre-tiled and that the tiles are laid out in column-major order in memory.
Supported Sparse Matrix Multiplication Modes
AIE-ML introduced hardware support for sparse matrix multiplication. For an M x K x N matrix multiplication with A being M x K, B being K x N, and C being M x N, a sparse B matrix may be stored in memory using a data layout which avoids storing zero values.
- Note
- Sparse matrix multiplications require that the sparse data be stored in column major layout. An internal transpose of the partially decompressed data is required by the underlying intrinsics and is carried out automatically by the API.
Arch. | 8b x 4b | 8b x 8b | 16b x 8b | 16b x 16b | bfloat16 x bfloat16 |
---|---|---|---|---|---|
AIE-ML | 4x32x8 | 4x16x8 8x16x8a 4x16x16ab | 2x16x8 4x16x8a | 2x8x8 4x8x8a 2x8x16ab | 4x16x4 4x16x8ab |
- Note
- a - Emulated using multiple intrinsic calls
b - Require additional data manipulation
The following example shows an optimized int8 * sparse int8
GEMM:
Class Documentation
◆ aie::mmul
struct aie::mmul |
struct aie::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, AccumTag >
Type that encapsulates a blocked matrix multiplication C = A x B.
Objects of this type encapsulate the current result of the multiplication. The first result is computed with the mul method. New multiplications can be accumulated using the mac method.
- Template Parameters
-
M_Elems Rows in matrix A. K_Elems Columns in matrix A / Rows in matrix B. N_Elems Columns in matrix B. TypeA Type of the elements in matrix A. It must meet ElemBaseType. TypeB Type of the elements in matrix B. By default is the same as TypeA. It must meet ElemBaseType. AccumTag Type of the elements of the accumulator that contains the results to be written in matrix C. It must meet AccumElemBaseType. If not specified, it uses the default accumulation type for multiplications of TypeA x TypeB.
Public Types | |
using | accum_type = typename mmul_impl::accum_type |
using | mmul_impl = detail::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, detail::to_native_accum_bits_for_mul_types_tag< TypeA, TypeB, AccumTag >()> |
Public Member Functions | |
mmul () | |
Constructor. | |
mmul (const accum_type &acc) | |
Constructor. | |
mmul (const binary_op< accum_type, bool, Operation::Zero > &op) | |
Constructor. | |
template<typename T > | |
mmul (const vector< T, size_C > &v, int shift=0) | |
Constructor. | |
template<VectorOrOp VecA, VectorOrOp VecB> requires (VecA::size() == size_A && VecB::size() == size_B && std::is_same_v<typename VecA::value_type, TypeA> && std::is_same_v<typename VecB::value_type, TypeB>) | |
void | mac (const VecA &a, const VecB &b) |
Multiply the two given matrices and add it to the result. | |
template<unsigned ElemsA = size_A, unsigned ElemsB = size_B> requires (arch::is(arch::AIE_ML) && ElemsB == size_B) | |
void | mac (const vector< TypeA, ElemsA > &a, const sparse_vector< TypeB, ElemsB > &b) |
Multiply the two given matrices and add it to the result. | |
template<VectorOrOp VecA, VectorOrOp VecB> requires (VecA::size() == size_A && VecB::size() == size_B && std::is_same_v<typename VecA::value_type, TypeA> && std::is_same_v<typename VecB::value_type, TypeB>) | |
void | mul (const VecA &a, const VecB &b) |
Initialize the result value with the multiplication of the two given matrices. | |
template<unsigned ElemsA, unsigned ElemsB = size_B> requires (arch::is(arch::AIE_ML) && ElemsB == size_B) | |
void | mul (const vector< TypeA, ElemsA > &a, const sparse_vector< TypeB, ElemsB > &b) |
Initialize the result value with the multiplication of the two given matrices. | |
operator accum_type () const | |
Conversion operator to accumulator. | |
mmul & | operator= (const accum_type &acc) |
Reinitialize the mmul object using the given accumulator. | |
accum_type | to_accum () const |
Return the result of the multiplication as an accumulator. | |
template<typename T > | |
vector< T, size_C > | to_vector (int shift=0) const |
Return the result of the multiplication as a vector of the requested type. | |
Static Public Member Functions | |
static constexpr unsigned | size () |
Returns number of elements in matrix C. | |
Static Public Attributes | |
static constexpr unsigned | K = K_Elems |
Number of columns in matrix A, and number of rows in matrix B. | |
static constexpr unsigned | M = M_Elems |
Number of rows in matrix A. | |
static constexpr unsigned | N = N_Elems |
Number of columns in matrix B. | |
static constexpr unsigned | size_A = M * K |
Number of elements in matrix A. | |
static constexpr unsigned | size_B = K * N |
Number of elements in matrix B. | |
static constexpr unsigned | size_C = M * N |
Number of elements in matrix C. | |
Member Typedef Documentation
◆ accum_type
using aie::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, AccumTag >::accum_type = typename mmul_impl::accum_type |
◆ mmul_impl
using aie::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, AccumTag >::mmul_impl = detail::mmul<M_Elems, K_Elems, N_Elems, TypeA, TypeB, detail::to_native_accum_bits_for_mul_types_tag<TypeA, TypeB, AccumTag>()> |
Constructor & Destructor Documentation
◆ mmul() [1/4]
|
inline |
Constructor.
Data is undefined.
◆ mmul() [2/4]
|
inline |
Constructor.
Data is initialized from the given accumulator.
Data is expected to be row-major layout.
- Parameters
-
acc Accumulator data is initialized from.
◆ mmul() [3/4]
|
inline |
Constructor.
Data is initialized from the given operation modifier.
- Parameters
-
op aie::op_zero operation.
◆ mmul() [4/4]
|
inline |
Constructor.
Data is initialized from the given vector.
Data is expected to be row-major layout.
- Parameters
-
v Vector data is initialized from. shift Upshift in bits to be applied to input data. This parameter is ignored for floating-point types.
Member Function Documentation
◆ mac() [1/2]
requires (VecA::size() == size_A && VecB::size() == size_B && std::is_same_v<typename VecA::value_type, TypeA> && std::is_same_v<typename VecB::value_type, TypeB>)
|
inline |
Multiply the two given matrices and add it to the result.
- Parameters
-
a Represents the A input matrix with row-major data layout. The number of elements must be mmul::size_A (M * K). It must meet VectorOrOp. b Represents the B input matrix with row-major data layout. The number of elements must be mmul::size_B (K * N). It must meet VectorOrOp.
◆ mac() [2/2]
requires (arch::is(arch::AIE_ML) && ElemsB == size_B)
|
inline |
Multiply the two given matrices and add it to the result.
Matrix B is sparse.
- Parameters
-
a Vector that represents the A input matrix. b Sparse vector that represents the B input matrix.
◆ mul() [1/2]
requires (VecA::size() == size_A && VecB::size() == size_B && std::is_same_v<typename VecA::value_type, TypeA> && std::is_same_v<typename VecB::value_type, TypeB>)
|
inline |
Initialize the result value with the multiplication of the two given matrices.
- Parameters
-
a Represents the A input matrix with row-major data layout. The number of elements must be mmul::size_A (M * K). It must meet VectorOrOp. b Represents the B input matrix with row-major data layout. The number of elements must be mmul::size_B (K * N). It must meet VectorOrOp.
◆ mul() [2/2]
requires (arch::is(arch::AIE_ML) && ElemsB == size_B)
|
inline |
Initialize the result value with the multiplication of the two given matrices.
Matrix B is sparse.
- Parameters
-
a Vector that represents the A input matrix. b Sparse vector that represents the B input matrix.
◆ operator accum_type()
|
inline |
Conversion operator to accumulator.
◆ operator=()
|
inline |
Reinitialize the mmul object using the given accumulator.
- Parameters
-
acc Accumulator data is initialized from.
◆ size()
|
inlinestaticconstexpr |
Returns number of elements in matrix C.
◆ to_accum()
|
inline |
Return the result of the multiplication as an accumulator.
◆ to_vector()
|
inline |
Return the result of the multiplication as a vector of the requested type.
- Parameters
-
shift Downshift in bits to be applied to output data. This parameter is ignored for floating-point types.
Member Data Documentation
◆ K
|
staticconstexpr |
Number of columns in matrix A, and number of rows in matrix B.
◆ M
|
staticconstexpr |
Number of rows in matrix A.
◆ N
|
staticconstexpr |
Number of columns in matrix B.
◆ size_A
|
staticconstexpr |
Number of elements in matrix A.
◆ size_B
|
staticconstexpr |
Number of elements in matrix B.
◆ size_C
|
staticconstexpr |
Number of elements in matrix C.