AOCL-DLP provides multiple GEMM variants optimized for different precision requirements and use cases.
Choosing a GEMM Variant:
Select the appropriate GEMM variant based on your precision and performance requirements:
High precision:
aocl_gemm_f32f32f32of32for full float32 operationsBalanced precision:
aocl_gemm_bf16bf16f32of32for reduced memory with float32 accumulationQuantized operations: Various u8/s8 variants for maximum performance
For a complete list of GEMM variants, see the GEMM API Reference and the GEMM Guide Wiki.
Basic GEMM Call Pattern:
The basic pattern for calling AOCL-DLP GEMM functions follows this structure:
#include "aocl_dlp.h"
// Basic f32 GEMM call: C = alpha * A * B + beta * C
aocl_gemm_f32f32f32of32(
'R', // Storage format (R=row-major, C=column-major)
'N', // TransA (N=no transpose, T=transpose)
'N', // TransB
m, n, k, // Matrix dimensions
1.0f, // alpha scalar
a, lda, 'N', // Matrix A, leading dimension, memory format
b, ldb, 'N', // Matrix B, leading dimension, memory format
0.0f, // beta scalar
c, ldc, // Matrix C, leading dimension
NULL // Post-operations metadata (NULL = no post-ops)
);
Matrix Reordering for Performance:
For matrices that will be reused multiple times, reordering can significantly improve performance:
// Get buffer size needed for reordering matrix B
msz_t buffer_size = aocl_get_reorder_buf_size_f32f32f32of32(
'R', // Storage order (row-major)
'N', // TransB
'B', // Matrix to reorder ('A' or 'B')
k, n, // Dimensions (rows and cols of B)
NULL // Post-operations metadata
);
// Allocate buffer and reorder matrix B
float* reordered_b = (float*)malloc(buffer_size);
aocl_reorder_f32f32f32of32(
'R', // Storage order
'N', // TransB
'B', // Matrix to reorder
b, // Source matrix
reordered_b, // Destination buffer
k, n, // Dimensions
ldb, // Leading dimension of source
NULL // Post-operations metadata
);
// Use reordered matrix in GEMM calls
aocl_gemm_f32f32f32of32(
'R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N',
reordered_b, ldb, 'R', // 'R' indicates reordered format
0.0f, c, ldc, NULL
);
// Clean up
free(reordered_b);
For detailed buffer size and reorder APIs, see the API Lifecycle documentation.
Batch GEMM Operations:
AOCL-DLP supports batch GEMM operations for processing multiple matrix multiplications efficiently:
// Batch f32 GEMM example
aocl_batch_gemm_f32f32f32of32(
transa_array, transb_array, // Arrays of transpose flags
m_array, n_array, k_array, // Arrays of dimensions
alpha_array, // Array of alpha scalars
a_array, lda_array, mtagA_array, // Arrays of A matrices and metadata
b_array, ldb_array, mtagB_array, // Arrays of B matrices and metadata
beta_array, // Array of beta scalars
c_array, ldc_array, // Arrays of C matrices
batch_size, // Number of GEMM operations
post_ops_array // Array of post-operations metadata
);
For more information, see aocl_batch_gemm_f32f32f32of32.