GEMM Operations - 5.2 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2025-12-29
Version
5.2 English

AOCL-DLP provides multiple GEMM variants optimized for different precision requirements and use cases.

Choosing a GEMM Variant:

Select the appropriate GEMM variant based on your precision and performance requirements:

  • High precision: aocl_gemm_f32f32f32of32 for full float32 operations

  • Balanced precision: aocl_gemm_bf16bf16f32of32 for reduced memory with float32 accumulation

  • Quantized operations: Various u8/s8 variants for maximum performance

For a complete list of GEMM variants, see the GEMM API Reference and the GEMM Guide Wiki.

Basic GEMM Call Pattern:

The basic pattern for calling AOCL-DLP GEMM functions follows this structure:

#include "aocl_dlp.h"

// Basic f32 GEMM call: C = alpha * A * B + beta * C
aocl_gemm_f32f32f32of32(
    'R',        // Storage format (R=row-major, C=column-major)
    'N',        // TransA (N=no transpose, T=transpose)
    'N',        // TransB
    m, n, k,    // Matrix dimensions
    1.0f,       // alpha scalar
    a, lda, 'N', // Matrix A, leading dimension, memory format
    b, ldb, 'N', // Matrix B, leading dimension, memory format
    0.0f,       // beta scalar
    c, ldc,     // Matrix C, leading dimension
    NULL        // Post-operations metadata (NULL = no post-ops)
);

Matrix Reordering for Performance:

For matrices that will be reused multiple times, reordering can significantly improve performance:

// Get buffer size needed for reordering matrix B
msz_t buffer_size = aocl_get_reorder_buf_size_f32f32f32of32(
    'R',     // Storage order (row-major)
    'N',     // TransB
    'B',     // Matrix to reorder ('A' or 'B')
    k, n,    // Dimensions (rows and cols of B)
    NULL     // Post-operations metadata
);

// Allocate buffer and reorder matrix B
float* reordered_b = (float*)malloc(buffer_size);
aocl_reorder_f32f32f32of32(
    'R',           // Storage order
    'N',           // TransB
    'B',           // Matrix to reorder
    b,             // Source matrix
    reordered_b,   // Destination buffer
    k, n,          // Dimensions
    ldb,           // Leading dimension of source
    NULL           // Post-operations metadata
);

// Use reordered matrix in GEMM calls
aocl_gemm_f32f32f32of32(
    'R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N',
    reordered_b, ldb, 'R',  // 'R' indicates reordered format
    0.0f, c, ldc, NULL
);

// Clean up
free(reordered_b);

For detailed buffer size and reorder APIs, see the API Lifecycle documentation.

Batch GEMM Operations:

AOCL-DLP supports batch GEMM operations for processing multiple matrix multiplications efficiently:

// Batch f32 GEMM example
aocl_batch_gemm_f32f32f32of32(
    transa_array, transb_array,  // Arrays of transpose flags
    m_array, n_array, k_array,   // Arrays of dimensions
    alpha_array,                 // Array of alpha scalars
    a_array, lda_array, mtagA_array,  // Arrays of A matrices and metadata
    b_array, ldb_array, mtagB_array,  // Arrays of B matrices and metadata
    beta_array,                  // Array of beta scalars
    c_array, ldc_array,          // Arrays of C matrices
    batch_size,                  // Number of GEMM operations
    post_ops_array               // Array of post-operations metadata
);

For more information, see aocl_batch_gemm_f32f32f32of32.