AOCL-BLAS - 5.0 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2024-12-14
Version
5.0 English

4. AOCL-BLAS#

AOCL-BLAS is a high-performant implementation of the Basic Linear Algebra Subprograms (BLAS). The BLAS was designed to provide the essential kernels of matrix and vector computation and are the most commonly used computationally intensive operations in dense numerical linear algebra. Select kernels have been optimized for the AMD “Zen”-based processors, for example, AMD EPYCTM, AMD RyzenTM, AMD RyzenTM ThreadripperTM processors by AMD and others.

AOCL_BLAS is developed as a forked version of BLIS (flame/blis), which is developed by members of the Science of High-Performance Computing (SHPC) group in the Institute for Computational Engineering and Sciences at The University of Texas at Austin and other collaborators (including AMD). All known features and functionalities of BLIS are retained and supported in AOCL-BLAS library, along with the standard BLAS and CBLAS interfaces. C++ template interfaces for the BLAS functionalities are also included.

4.1. Installation#

AOCL-BLAS can be installed from the source or pre-built binaries, as described in following sections.

4.1.1. Using Pre-Built Binaries#

AOCL-BLAS library binaries for Linux are available at the following URL: https://www.amd.com/en/developer/aocl/dense.html

Also, the AOCL-BLAS binary can be installed from the AOCL master installer tar file (https://www.amd.com/en/developer/aocl.html).

The master installer includes the following:

  • Single threaded and multi-threaded AOCL-BLAS binaries.

  • Binaries built with amdzen config with LP64 and ILP64 integer support.

  • Multi-threaded AOCL-BLAS binary (libblis-mt) built with OpenMP threading mode.

The tar file includes pre-built binaries of other AMD libraries as explained in Using Master Package.

4.1.2. Build from Source - Introduction#

When building from source, two different build systems are supported:

  • CMake - supported on Linux and Windows

  • configure/make - only supported on Linux

First we consider the different options available when creating and using AOCL-BLAS, and then look at some examples and platform-specific information for Linux and Windows.

The AOCL-BLAS source code is available at GitHub URL: amd/blis

Clone the Git repository amd/blis.git

Prerequisites

The following dependencies must be met for installing AOCL-BLAS:

  • Target CPU ISA supporting AVX2 and FMA (and preferably AVX512)

  • Python versions 3.4 or later

  • GNU Make 4.2 or later

  • CMake 3.22.0 or later

  • Compilers: either of

    • GCC versions 12.2 through 13.1

    • AOCC versions 4.2 or 5.0

4.1.3. Hardware Configuration#

AOCL-BLAS supports a wide range of different architectures, with optimizations focused on AMD “Zen” and compatible processors. AOCL-BLAS can be compiled for specific hardware by specifying the appropriate configuration option:

  • auto - This configuration generates a binary optimized for the build machine’s AMD “Zen” core architecture. This is useful when you build the library on the target system. Starting from the AOCL-BLAS 2.1 release, the auto configuration option enables selecting the appropriate build configuration based on the target CPU architecture. For example, for a build machine using the 1st Gen AMD EPYCTM (code name “Naples”) processor, the zen configuration will be auto-selected. For a build machine using the 2nd Gen AMD EPYCTM processor (code name “Rome”), the zen2 configuration will be auto-selected. From AOCL-BLAS 3.0 forward, zen3 will be auto-selected for the 3rd Gen AMD EPYCTM processor (code name “Milan”). From AOCL-BLAS 4.0 forward, zen4 will be auto-selected for the 4th Gen AMD EPYCTM processors (code name “Genoa”, “Bergamo” or “Siena”). From AOCL-BLAS 5.0 forward, zen5 will be auto-selected for the 5th Gen AMD EPYCTM processors (code name “Turin”).

  • zen - This configuration generates a binary compatible with AMD “Zen” architecture and is optimized for it. The architecture of the build machine is not relevant.

  • zen2 - This configuration generates binary compatible with AMD “Zen2” architecture and is optimized for it. The architecture of the build machine is not relevant.

  • zen3 - This configuration generates binary compatible with AMD “Zen3” architecture and is optimized for it. The architecture of the build machine is not relevant.

  • zen4 - This configuration generates binary compatible with AMD “Zen4” architecture and is optimized for it. The architecture of the build machine is not relevant.

  • zen5 - This configuration generates binary compatible with AMD “Zen5” architecture and is optimized for it. The architecture of the build machine is not relevant.

  • amdzen - The library built using this configuration generates a binary compatible with and optimized for AMD “Zen”, AMD “Zen2”, AMD “Zen3”, AMD “Zen4” and AMD “Zen5” architectures. A slower generic code path, compatible with older x86-64 processors is also included. The architecture of the build machine is not relevant. The architecture of the target machine is checked during the runtime, based on which the relevant optimizations are picked up automatically. This feature is also called Dynamic Dispatch. For more information, refer to the Dynamic Dispatch section below.

Table 4.1 AOCL-BLAS Specifying Hardware Configuration#

Example desired config

CMake options
————————————
configure options

Usage

amdzen

-DBLIS_CONFIG_FAMILY=amdzen
————————————
./configure … amdzen
Linux : No default is set
Windows: “auto” is default choice

When using configure, the desired config should be the last argument.

4.1.4. API Compatibility Layers (Calling AOCL-BLAS)#

AOCL-BLAS supports various API compatibility layers:

  • The BLAS/CBLAS standard enables application portability between various libraries. See Netlib BLAS for background information. They can be called from programs written in Fortran, C, C++ and compatible languages. Simple BLAS and CBLAS examples in Fortran and C are available in section AOCL-BLAS Usage in Fortran and C.

  • AOCL-BLAS also includes BLIS-specific APIs that provide more flexibility and control to help achieve the best performance in some situations. Details of these C interfaces are available at:

The following table lists all the supported layers and the CMake and configure options to control them, with the default setting in bold:

Table 4.2 AOCL-BLAS API Compatibility Layers#

API Layer

Header file

CMake options
————————————
configure options

Usage

BLAS (Fortran)

N/A

-DENABLE_BLAS=ON
-DENABLE_BLAS=OFF
————————————
--enable-blas
--disable-blas
Use this option when calling AOCL-BLAS from Fortran applications.

API Name Format: DGEMM

BLAS (C)

blis.h

-DENABLE_BLAS=ON
-DENABLE_BLAS=OFF
————————————
--enable-blas
--disable-blas”
Use this option when calling AOCL-BLAS from C application using BLAS type parameters.

API Name Format: dgemm_

CBLAS

cblas.h

-DENABLE_CBLAS=ON
-DENABLE_CBLAS=OFF
————————————
--enable-cblas
--disable-cblas
Use this option when calling AOCL-BLAS from C application using CBLAS type parameters. If enabled, BLAS API is also enabled.

API Name Format: cblas_dgemm
BLIS - C
(Non-standard)

blis.h

Enabled by default

————————————
Enabled by default

This is AOCL-BLAS library specific (non-standard) interface, it provides most flexibility in calling AOCL-BLAS for best performance. However, these applications will not be portable to other BLAS/CBLAS compatible libraries.

API Name Format: bli_gemm
API Name Format: bli_gemm_ex
BLIS - CPP
(Non-standard)

blis.hh

Enabled by default

————————————
Enabled by default

This is AOCL-BLAS library specific (non-standard) C++ interface. This interface follows same parameter order as CBLAS. However, these applications will not be portable to other BLAS/CBLAS compatible libraries.

API Name Format: blis::gemm

4.1.5. API Compatibility - Advanced Options#

The API compatibility can be further extended to meet additional requirements for input sizes (ILP64) and different ways in which complex numbers are returned by functions in the BLAS interface, which is related to the choice of compiler. The following table explains such options, with the default setting in bold:

Table 4.3 AOCL-BLAS Compatibility Advanced Options#

Build characteristic

CMake options
————————————
configure options

Usage

Choice of compiler and complex function return type

-DCOMPLEX_RETURN=gnu
-DCOMPLEX_RETURN=intel CC=clang CXX=clang++
————————————
--complex-return=gnu
--complex-return=intel CC=clang CXX=clang++
GNU and AOCC (based on LLVM) are currently supported.

Refer to Returning Complex Numbers for more information.

Integer size (LP64 vs IP64)

-DBLAS_INT_SIZE=32
-DBLAS_INT_SIZE=64
————————————
--blas-int-size=32
--blas-int-size=64
This option can be used to specify the integer types used in external BLAS and CBLAS interfaces.

4.1.6. Other Key Options#

The following table lists other key AOCL-BLAS build options and the CMake and configure options to control them, with the default setting in bold (where appropriate):

Table 4.4 AOCL-BLAS Specifying Other Build Options#

Build characteristic

CMake options
————————————
configure options

Usage

Serial or multithreaded (OpenMP)

-DENABLE_THREADING=openmp
-DENABLE_THREADING=no
————————————
--enable-threading=openmp
--enable-threading=no
Enabling multithreading with OpenMP will add a link time dependency on the compiler’s OpenMP runtime library.

Enable/Disable dynamic thread scaling

-DENABLE_AOCL_DYNAMIC=ON
-DENABLE_AOCL_DYNAMIC=OFF
————————————
–enable-aocl-dynamic
–disable-aocl-dynamic
In multithreaded builds with AOCL dynamic enabled, AOCL-BLAS may reduce the number of threads used within each API call from that requested where the problem parameters are small.

Installation directory

-DCMAKE_INSTALL_PREFIX=</desired/location/>
Default location: /usr/local/
————————————
--prefix=</desired/location/>
Default location: /usr/local/
Specifies target directory for installing AOCL-BLAS. Installation will include lib, include and share subdirectories.
On Linux these will be placed inside a directory lp64 or ilp64 as appropriate.

Enable LPGEMM add-on

-DENABLE_ADDON=”aocl_gemm”
Default: disabled
————————————
-a aocl_gemm
Default: disabled
LPGEMM provides a range of BF16 and INT8 GEMM operations with many supported pre-/post-ops, targeted at AI applications. See LPGEMM in AOCL-BLAS for details.

Change dynamic dispatch environment variables

-DRENAME_BLIS_ARCH_TYPE=<user-defined-name>
(Default name BLIS_ARCH_TYPE)
-DRENAME_BLIS_MODEL_TYPE=<user-defined-name>
(Default name BLIS_MODEL_TYPE)
————————————
--rename-blis-arch-type=
(Default name BLIS_ARCH_TYPE)
--rename-blis-model-type=
(Default name BLIS_MODEL_TYPE)
If dynamic dispatch is enabled in the configuration (e.g. amdzen), the default runtime choice of code path based on the hardware can be overridden by environment variables.
These options allow the environment variables BLIS_ARCH_TYPE and BLIS_MODEL_TYPE to be renamed. See Dynamic Dispatch for more details.

Disable dynamic dispatch environment variables

-DISABLE_BLIS_ARCH_TYPE=ON
-DISABLE_BLIS_ARCH_TYPE=OFF
————————————
--enable-blis-arch-type
--disable-blis-arch-type
If dynamic dispatch is enabled in the configuration (e.g. amdzen), alternatively use of these environment variables
(BLIS_ARCH_TYPE, BLIS_MODEL_TYPE and AOCL_ENABLE_INSTRUCTIONS) can be disabled. See Dynamic Dispatch for more details.

4.1.7. Build AOCL-BLAS from Source on Linux#

Below are some examples for configuration and build commands on Linux, for both CMake and configure build systems.

4.1.7.1. Single-Thread AOCL-BLAS#

Complete the following steps to install a single-thread AOCL-BLAS:

  1. Clone the AOCL-BLAS Git repository (amd/blis.git).

  2. Configure the library as required:

    # CMake commands
    
    # GCC (Default)
    $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \
      -DBLIS_CONFIG_FAMILY=auto
    
    # AOCC
    $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \
      -DCOMPLEX_RETURN=intel CC=clang CXX=clang++ -DBLIS_CONFIG_FAMILY=auto
    
    # Alternatively, using configure
    
    # GCC (Default)
    $ ./configure --enable-cblas --prefix=<your-install-dir> auto
    
    # AOCC
    $ ./configure --enable-cblas --prefix=<your-install-dir> \
      --complex-return=intel CC=clang CXX=clang++ auto
    
  3. To build the library, use the command:

    $ make
    
  4. To install the library on build machine, use the command:

    $ make install
    

4.1.7.2. Multi-Thread AOCL-BLAS#

Complete the following steps to install a multi-thread AOCL-BLAS:

  1. Clone the AOCL-BLAS Git repository (amd/blis.git).

  2. Configure the library as required:

    # CMake commands
    
    # GCC (Default)
    $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \
      -DENABLE_THREADING=[Mode] -DBLIS_CONFIG_FAMILY=auto
    
    # AOCC
    $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \
      -DCOMPLEX_RETURN=intel CC=clang CXX=clang++ \
      -DENABLE_THREADING=[Mode] -DBLIS_CONFIG_FAMILY=auto
    
    # Alternatively, using configure
    
    # GCC (Default)
    $ ./configure --enable-cblas --prefix=<your-install-dir> \
      --enable-threading=[Mode] auto
    
    # AOCC
    $ ./configure --enable-cblas --prefix=<your-install-dir> \
      --enable-threading=[Mode] \
      --complex-return=intel CC=clang CXX=clang++ auto
    

    Mode indicates one of the options in {openmp, no}. no option implies disable multi-threading.

  3. To build the library, use the command:

    $ make
    
  4. To install the library on build machine, use the command:

    $ make install
    

4.1.8. Build AOCL-BLAS from Source on Windows#

GitHub URL: amd/blis

AOCL-BLAS uses CMake along with Microsoft Visual Studio for building binaries from the sources on Windows. The following sections explain the GUI and command-line schemes of building the binaries and test suite.

Prerequisites

  • Windows 10/11 or Windows Server 2019/2022

  • LLVM 15/16 for AMD “Zen3” and AMD “Zen4” support (or LLVM 11 for AMD “Zen2” support)

  • LLVM plug-in for Microsoft Visual Studio (if latest version of LLVM is installed separately, this plugin enables linking Visual Studio with the installed LLVM toolchain)

  • For more information on CMake versions validated, refer to Build Utilities

  • Microsoft Visual Studio 2019 (build 16.8.7) and 2022 (build 17.3.2 through 17.7.5)

  • Microsoft Visual Studio tools (as shown in Microsoft Visual Studio Prerequisites):

    • Python development

    • Desktop development with C++: C++ Clang-Cl for v142 build tool (x64/x86)

_images/image8_aocl.png

Figure 4.1 Microsoft Visual Studio Prerequisites#

4.1.8.1. Building AOCL-BLAS Using GUI#

4.1.8.2. Preparing Project with CMake GUI#

Complete the following steps in the CMake GUI:

  1. Set the source (folder containing AOCL-BLAS source code) and build (folder in which the project files will be generated, for example, out) folder paths as shown in the following figure:

    _images/image10_aocl.png

    Figure 4.2 CMake Source and Build Folders#

    It is not recommended to use the folder named build since build is used by Linux build system.

  2. Click on the Configure button to prepare the project options.

  3. Set the generator to Visual Studio 17 2022 and the compiler to ClangCl as shown in the following figure:

    _images/image11_aocl.png

    Figure 4.3 Set Generator and Compiler#

  4. Update the options based on the project requirements. All the available options are listed in the following table:

    Table 4.5 CMake Config Options#

    Feature

    CMake Parameter

    AMD CPU architecture

    BLIS_CONFIG_FAMILY=zen / zen2 / zen3 / zen4 / zen5 / amdzen

    Shared library

    BUILD_SHARED_LIBS=ON

    Static library

    BUILD_SHARED_LIBS=OFF

    Debug/Release build type

    CMAKE_BUILD_TYPE=Debug / Release

    Enable single threading (disables AOCL dynamic dispatch)

    ENABLE_THREADING=no(default)

    Enable multi-threading (enables AOCL dynamic dispatch with OpenMP)

    ENABLE_THREADING=openmp

    AOCL Dynamic (automatically selected depending on the value of ENABLE_THREADING)

    ENABLE_AOCL_DYNAMIC=ON/OFF

    Enable BLAS/CBLAS support

    ENABLE_BLAS=ON ENABLE_CBLAS=ON

    Enable 32-bit integer size in BLIS and BLAS APIs

    INT_SIZE=32 and BLAS_INT_SIZE=32

    Enable 64-bit integer size in BLIS and BLAS APIs

    INT_SIZE=64 and BLAS_INT_SIZE=64

    Absolute path to the OpenMP library, including the library name

    OpenMP_libomp_LIBRARY

    Table 4.6 CMake Config Options (all variables and their default values)#

    CMake Parameter

    BUILD_SHARED_LIBS=ON(default)/OFF

    ENABLE_THREADING=no(default)/openmp

    INT_SIZE=auto(default)/32/64

    BLAS_INT_SIZE=32(default)/64

    ENABLE_BLAS=ON/OFF(default)

    ENABLE_CBLAS=ON/OFF(default)

    ENABLE_MIXED_DT=ON(default)/OFF

    ENABLE_SUP_HANDLING=ON(default)/OFF

    ENABLE_AOCL_DYNAMIC=ON(default)/OFF

    COMPLEX_RETURN=gnu(default)/intel

    ENABLE_NO_UNDERSCORE_API=ON/OFF(default)

    ENABLE_UPPERCASE_API=ON/OFF(default)

    ENABLE_SYSTEM=ON(default)/OFF

    THREAD_PART_JRIR=slab(default)/rr

    ENABLE_PBA_POOLS=ON(default)/OFF

    ENABLE_SBA_POOLS= ON(default)/OFF

    ENABLE_MEM_TRACING=ON/OFF(default)

    ENABLE_MIXED_DT=ON(default)/OFF

    ENABLE_MIXED_DT_EXTRA_MEM=ON(default)/OFF

    ENABLE_SUP_HANDLING=ON(default)/OFF

    ENABLE_TRSM_PREINVERSION=ON(default)/OFF

    FORCE_VERSION=no(default)/<user-defined>

    DISABLE_BLIS_ARCH_TYPE=ON/OFF(default)

    RENAME_BLIS_ARCH_TYPE=BLIS_ARCH_TYPE(default)/<user-defined>

    RENAME_BLIS_MODEL_TYPE=BLIS_MODEL_TYPE(default)/<user-defined>

    For the detailed documentation of all the options, configure CMake with PRINT_CONFIGURE_HELP=ON.

  5. To generate the Microsoft Visual Studio project in the out folder, click on the Generate button as shown in the following figure:

    _images/image12_aocl.png

    Figure 4.4 CMake Configure and Generate Project Settings#

4.1.8.3. Building the Project in Visual Studio GUI#

Complete the following steps in the Microsoft Visual Studio GUI:

  1. Open the project generated by CMake (build folder) in Preparing Project with CMake GUI.

  2. To generate AOCL-BLAS binaries, build the AOCL-LibBlis project or libs/libblis target. The library files will be generated in the out folder based on the project settings.

    For example, blis/out/Release/AOCL-LibBlis-Win-MT.dll or AOCL-LibBlis-Win-MT.lib

  3. To install the binaries (or to build and install them), build the INSTALL project under CMakePredefinedTargets.

4.1.8.4. Building AOCL-BLAS Using Command-Line Arguments#

The project configuration and build procedures can be triggered from the command prompt as well. The corresponding steps are described in the following sections.

Configuring the Project in Command Prompt

In the AOCL-BLAS project folder, create a folder out. Open the command prompt in this directory and run the following command to configure the project:

$ cmake -S .. -B . -G "Visual Studio 17 2022"
-DCMAKE_BUILD_TYPE=Release
-DBLIS_CONFIG_FAMILY=amdzen -DBUILD_SHARED_LIBS=ON
-DENABLE_THREADING=openmp
-DCOMPLEX_RETURN=intel -DOpenMP_libomp_LIBRARY="C:\Program
Files\LLVM\lib\libomp.lib" -TClangCL

You can refer CMake Config Options and update the parameter options in the command according to the project requirements or run the following command for a detailed description of the available options:

$ cmake -S .. -B . -G "Visual Studio 17 2022" -DPRINT_CONFIGURE_HELP=ON

Building the Project in Command Prompt

Open command prompt in the blis\out directory. Invoke CMake with the build command with release or debug option. For example:

$ cmake --build . --config Release

For building the library using multiple threads, run the following command:

$ cmake --build . --config Release -j

The library files would be generated in the Release or Debug folder based on the project settings.

4.1.9. Verifying AOCL-BLAS Installation#

The AOCL-BLAS source directory contains the test cases which demonstrate the usage of AOCL-BLAS APIs. To execute the tests, navigate to the AOCL-BLAS source directory and run one of the following commands, as appropriate for your operating system and choice of AOCL-BLAS build method (configure+make is Linux-only):

# Build tests using configure+make (Linux only)
$ make checkblas checkblis

# Build tests using CMake (Linux or Windows shared library)
$ cmake -build . -config Release --target checkblas checkblis

# Build tests using CMake (Windows static library)
$ cmake -build . -config Release --target checkblis

4.2. Application Development Using AOCL-BLAS#

This section explains the different types of APIs provided by AOCL-BLAS. It describes how to call them and link with the library.

4.2.1. Linking Application with AOCL-BLAS#

The AOCL-BLAS library can be linked statically or dynamically with the user application. It has a separate binary for single-threaded and multi-threaded implementation.

The basic build command is as following:

$ gcc test_blis.c -I<path-to-AOCL-BLAS-header> <link-options> \
  -o test_blis.x

The following table explains different options depending on a particular build configuration:

Table 4.7 AOCL-BLAS Application - Link Options#

Application Type

Linking Type

Link Options

Single-threaded

Static

<path-to-AOCL-BLAS-library>/libblis.a -lm -lpthread

Single-threaded

Dynamic

-L<path-to-AOCL-BLAS-library> -lblis -lm -lpthread

Multi-threaded

Static

<path-to-AOCL-BLAS-library>/libblis-mt.a -lm -fopenmp

Multi-threaded

Dynamic

-L<path-to-AOCL-BLAS-library> -lblis-mt -lm -fopenmp

Example - Dynamic Linking and Execution

AOCL-BLAS can be built as a shared library. By default, the library is built as both static and shared libraries. Complete the following steps to build a shared lib version of AOCL-BLAS and link it with the user application:

  1. During configuration, enable the support for the shared lib using the following command:

    $ ./configure --disable-static --enable-shared zen
    
  2. Link the application with the generated shared library using the following command:

    $ gcc CBLAS_DGEMM_usage.c -I /path/to/include/aocl-blas/ \
      -L/path/to/libblis.so -lblis -lm -lpthread -o CBLAS_DGEMM_usage.x
    
  3. Ensure that the shared library is available in the library load path. Run the application using the following command (for this demo we will use the CBLAS_DGEMM_usage.c):

    $ export LD_LIBRARY_PATH="/path/to/libblis.so"
    $ ./CBLAS_DGEMM_usage.x
    a =
    1.000000 2.000000
    3.000000 4.000000
    b =
    5.000000 6.000000
    7.000000 8.000000
    c =
    19.000000   22.000000
    43.000000   50.000000
    

The same header can be used for both static and shared libraries on Windows. To access DLL’s public data symbols and objects, you can define BLIS_EXPORT=declspec(dllimport) to import those symbols explicitly. Importing is not required for:

  • The AOCL-BLAS and CBLAS interface users

  • Most of the cases where BLIS interface is used

4.2.2. AOCL-BLAS Usage in Fortran and C#

Simple BLAS and CBLAS examples in Fortran and C are in the following subsections.

4.2.2.1. Using BLAS API in Fortran#

For example, the following Fortran code does a double precision general matrix-matrix multiplication. It calls the ‘DGEMM’ BLAS API function to accomplish this. A sample command to compile and link it with the AOCL-BLAS library is shown in the following code:

! File: BLAS_DGEMM_usage.f
! Example code to demonstrate BLAS DGEMM usage

program dgemm_usage

implicit none
EXTERNAL DGEMM

DOUBLE PRECISION, ALLOCATABLE :: a(:,:)
DOUBLE PRECISION, ALLOCATABLE :: b(:,:)
DOUBLE PRECISION, ALLOCATABLE :: c(:,:)
INTEGER I, J, M, N, K, lda, ldb, ldc
DOUBLE PRECISION alpha, beta

M=2
N=M
K=M
lda=M
ldb=K
ldc=M
alpha=1.0
beta=0.0

ALLOCATE(a(lda,K), b(ldb,N), c(ldc,N))
a=RESHAPE((/ 1.0, 3.0, &
             2.0, 4.0 /), (/lda,K/))
b=RESHAPE((/ 5.0, 7.0, &
             6.0, 8.0 /), (/ldb,N/))

WRITE(*,*) ("a =")
DO I = LBOUND(a,1), UBOUND(a,1)
    WRITE(*,*) (a(I,J), J=LBOUND(a,2), UBOUND(a,2))
END DO
WRITE(*,*) ("b =")
DO I = LBOUND(b,1), UBOUND(b,1)
    WRITE(*,*) (b(I,J), J=LBOUND(b,2), UBOUND(b,2))
END DO

CALL DGEMM('N','N',M,N,K,alpha,a,lda,b,ldb,beta,c,ldc)

WRITE(*,*) ("c =")
DO I = LBOUND(c,1), UBOUND(c,1)
    WRITE(*,*) (c(I,J), J=LBOUND(c,2), UBOUND(c,2))
END DO

end program dgemm_usage

A sample compilation command with gfortran compiler for the code above:

$ gfortran -ffree-form BLAS_DGEMM_usage.f /path/to/libblis.a

4.2.2.2. Using BLAS API in C#

Following is the C version of the Fortran code in section Using BLAS API in Fortran. It uses the standard BLAS API. The following process takes place during the execution of the code:

  1. The matrices are transposed to account for the row-major storage of C and the column-major convention of BLAS (inherited from Fortran).

  2. The function arguments are passed by address again to be in line with Fortran conventions.

  3. There is a trailing underscore in the function name (dgemm_) as BLAS APIs require Fortran compilers to add a trailing underscore.

  4. blis.h is included as a header. A sample command to compile it and link with the AOCL-BLAS library is also shown in the following code:

    // File: BLAS_DGEMM_usage.c
    // Example code to demonstrate BLAS DGEMM usage
    
    #include <stdio.h>
    #include "blis.h"
    
    #define DIM 2
    
    int main() {
    
       double a[DIM * DIM] = { 1.0, 2.0, 3.0, 4.0 };
       double b[DIM * DIM] = { 5.0, 7.0, 6.0, 8.0 };
       double c[DIM * DIM];
       int I, J, M, N, K, lda, ldb, ldc; double alpha, beta;
    
       M = DIM;
       N = M; K = M;
       lda = M; ldb = K; ldc = M;
       alpha = 1.0;
       beta = 0.0;
    
       printf("a = \\n");
       for ( I = 0; I < M; I ++ ) {
         for ( J = 0; J < K; J ++ ) {
           printf("%f\t", a[J * K + I]);
         }
         printf("\n");
       }
    
       printf("b = \\n");
       for ( I = 0; I < K; I ++ ) {
         for ( J = 0; J < N; J ++ ) {
           printf("%f\t", b[J * N + I]);
         }
         printf("\n");
       }
    
       dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);
       printf("c = \\n");
    
       for ( I = 0; I < M; I ++ ) {
         for ( J = 0; J < N; J ++ ) {
           printf("%f\t", c[J * N + I]);
         }
         printf("\n");
       }
    
       return 0;
    }
    

A sample compilation command with a gcc compiler for the code above:

$ gcc BLAS_DGEMM_usage.c -I/path/to/include/aocl-blas/ \
  /path/to/libblis.a -lpthread -lm

4.2.2.3. Using CBLAS API in C#

This section contains an example application written in C code using the CBLAS API for DGEMM. The following process takes place during the execution of the code:

  1. The CBLAS Layout option is used to choose row-major layout which is consistent with C.

  2. The function arguments are passed by value.

  3. cblas.h is included as a header. A sample command to compile it and link with the AOCL-BLAS library is also shown in the following code:

    // File: CBLAS_DGEMM_usage.c
    // Example code to demonstrate CBLAS DGEMM usage
    
    #include <stdio.h>
    #include "cblas.h"
    #define DIM 2
    
    int main() {
    
       double a[DIM * DIM] = { 1.0, 2.0, 3.0, 4.0 };
       double b[DIM * DIM] = { 5.0, 6.0, 7.0, 8.0 };
       double c[DIM * DIM];
       int I, J, M, N, K, lda, ldb, ldc; double alpha, beta;
    
       M = DIM;
       N = M; K = M;
       lda = M; ldb = K; ldc = M;
       alpha = 1.0;
       beta = 0.0;
    
       printf("a = \\n");
       for ( I = 0; I < M; I ++ ) {
         for ( J = 0; J < K; J ++ ) {
           printf("%f\t", a[I * K + J]);
         }
         printf("\n");
       }
    
       printf("b = \\n");
       for ( I = 0; I < K; I ++ ) {
         for ( J = 0; J < N; J ++ ) {
           printf("%f\t", b[I * N + J]);
         }
         printf("\n");
       }
    
       cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K,
       alpha, a, lda, b, ldb, beta, c, ldc);
    
       printf("c = \\n");
       for ( I = 0; I < M; I ++ ) {
         for ( J = 0; J < N; J ++ ) {
           printf("%f\t", c[I * N + J]);
         }
         printf("\n");
       }
    
       return 0;
    }
    

Note

To get the CBLAS API with AOCL-BLAS, you must provide the flag --enable-cblas to the configure command while building the AOCL-BLAS library.

A sample compilation command with a gcc compiler for the code above is as follows:

$ gcc CBLAS_DGEMM_usage.c -I/path/to/include/aocl-blas/ \
  /path/to/libblis.a -lpthread -lm

4.2.2.4. Returning Complex Numbers#

The GNU Fortran compiler (gfortran), AOCC (Flang), and Intel Fortran compiler (ifort) have different requirements for returning complex numbers from the C functions as follows:

  • AOCC (Flang) and Intel® (ifort) compilers returns complex numbers using hidden first argument. The caller must pass the pointer to the return value as the first parameter.

  • GNU (gfortran) compiler returns complex numbers using registers. Thus, the complex numbers are returned as the return value of the function itself.

gfortran Example:

  • Configure Option:

    --complex-return=gnu
    
  • API Call:

    ret_value = cdotc_(&n, x, &incx, y, &incy);
    

AOCC Example:

  • Configure Option:

    --complex-return=intel CC=clang CXX=clang++
    
  • API Call:

    cdotc_(&ret_value, &n, x, &incx, y, &incy);
    

This feature is currently enabled only for cdotc, cdotu, zdotc, and zdotu APIs.

4.3. Migrating/Porting#

The application written for MKL, OpenBLAS or any other library using standard BLAS or CBLAS interfaces can be ported to AOCL-BLAS with minimal or no changes.

Complete the following steps to port from BLAS or CBLAS to AOCL-BLAS:

  1. Update the source code to include the correct header files.

  2. Update the build script or makefile to use correct compile or link option.

The following table lists the compiler and linker options to use while porting to AOCL-BLAS:

Table 4.8 Porting to AOCL-BLAS#

MKL

OpenBLAS

AOCL-BLAS

AOCL-BLAS

Single-threaded

Multi-threaded

Header File

mkl.h

cblas.h

blis.h/cblas.h

blis.h/cblas.h

Link Options

-lmkl_intel_lp64 -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lmkl_intel_thread

-lopenblas

-lm -lblis -lpthread

-lm -fopenmp -lblis-mt

4.4. Using AOCL-BLAS Library Features#

4.4.1. Dynamic Dispatch#

Starting from AOCL 3.1, AOCL-BLAS supports Dynamic Dispatch feature. It enables you to use the same binary with different code paths optimized for different architectures.

4.4.1.1. Purpose#

Before Dynamic Dispatch, the user had to build different binaries for each CPU architecture, that is, AMD “Zen”, AMD “Zen2”, and AMD “Zen3” architectures. Furthermore, when building the application, users had to ensure that they used the correct AMD “Zen”-based library as needed for the platform. This becomes challenging when using AOCL-BLAS on a cluster having nodes of different architectures.

Dynamic Dispatch addresses this issue by building a single binary compatible with all the AMD “Zen” architectures. At the runtime, the Dynamic Dispatch feature enables optimizations specific to the detected AMD “Zen” architecture.

4.4.1.2. On Non-AMD “Zen” Architectures#

The Dynamic Dispatch feature supports AMD “Zen”, AMD “Zen2”, AMD “Zen3”, AMD “Zen4”, and AMD “Zen5” architectures in a single binary. However, it also includes a generic architecture to support older x86-64 processors. The generic architecture uses a pure C implementation of the APIs and does not use any architecture-specific features.

The specific compiler flags used for building the library with generic configuration are:

-O2 -funsafe-math-optimizations -ffp-contract=fast -Wall \
-Wno-unused-function -Wfatal-errors

Note

As no architecture specific optimization and vectorized kernels are enabled, performance with the generic architecture may be significantly lower than the architecture-specific implementation.

Previous AOCL-BLAS releases identified the processor based on Family, Model, and other cpuid features, and selected the appropriate code path based on the preprogrammed choices. With Dynamic Dispatch, an unknown processor would fall through to the slow generic code path, although users could override this by setting an environment variable BLIS_ARCH_TYPE to a suitable value.

From AOCL-BLAS 4.2, additional cpuid tests based on AVX2 and AVX512 instruction support are used to enable AMD “Zen3”, AMD “Zen4” or AMD “Zen5” code paths to be selected by default on suitable x86-64 processors (i.e. future AMD processors and current or future Intel processors). These AMD Zen code paths are not (re-)optimized specifically for these different architectures but should perform better than the slow generic code path.

To be more specific:

  • AVX2 support requires AVX2 and FMA3.

  • AVX512 support requires AVX512 F, DQ, CD, BW, and VL.

4.4.1.3. Using Dynamic Dispatch#

Building AOCL-BLAS

Dynamic Dispatch must be enabled while building the AOCL-BLAS library. This is done by building the library for amdzen configuration as explained in Build AOCL-BLAS from Source on Linux.

Code Path Information

Dynamic Dispatch can print debugging information on the selected code path. This is enabled by setting the environment variable BLIS_ARCH_DEBUG=1.

Architecture Selection at Runtime

For most use cases, Dynamic Dispatch will detect the underlying architecture and enable appropriate code paths and optimizations. However, AOCL-BLAS can be forced to use a specific architecture by setting either the environment variable AOCL_ENABLE_INSTRUCTIONS or BLIS_ARCH_TYPE as follows:

AOCL_ENABLE_INSTRUCTIONS=value <AOCL-BLAS linked application>

or

BLIS_ARCH_TYPE=value <AOCL-BLAS linked application>

Where, value = {avx512, avx2, zen5, zen4, zen3, zen2, zen, generic} You must note the following:

  • The code path names are not case sensitive but the environment variable names are.

  • Options for older x86-64 vector ISAs (e.g. avx, sse2) are also supported but in general will correspond to the generic code path.

  • In AOCL-BLAS builds with configuration amdzen, avx512 is an alias for zen4 and avx2 is an alias for zen3.

  • AOCL_ENABLE_INSTRUCTIONS is intended to become the standard option for controlling dynamic dispatch (where supported) across all the AOCL components.

  • BLIS_ARCH_TYPE is specific to the BLIS code used in AOCL-BLAS.

  • If both are specified, BLIS_ARCH_TYPE takes precedence and AOCL_ENABLE_INSTRUCTIONS is ignored by AOCL-BLAS. This provides and options of using AOCL_ENABLE_INSTRUCTIONS to control other AOCL libraries but specify different options for AOCL-BLAS using BLIS_ARCH_TYPE.

  • The operation of AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE are slightly different:

    • If AOCL_ENABLE_INSTRUCTIONS is in operation, AOCL-BLAS will check if the instruction set required by the code path selected is enabled within the library and supported by the processor. If not, it will use the default choice for that architecture. In other words, AOCL_ENABLE_INSTRUCTIONS should be used to restrict a processor to an earlier instruction set, rather than try to force a later one on an older processor.

    • By contrast, if BLIS_ARCH_TYPE is in operation, that code path will be used irrespective of the compatibility with the processor.

  • Specifying a particular code path will completely override the automatic selection and thus, the following scenarios are possible:

    • A code path unavailable in the AOCL-BLAS build is being used. This will result in an error message from the AOCL-BLAS library which will then abort. This only applies when using BLIS_ARCH_TYPE (at AOCL 4.2 it also applied to AOCL_ENABLE_INSTRUCTIONS).

    • A code path executes instructions unavailable on the processor being used, for example, trying to run the AMD “Zen4” code path (which may use AVX512 instructions) on a AMD “Zen3” or older system. If this happens, the program may stop with an “illegal instruction” error. This applies only when BLIS_ARCH_TYPE is used; executing the illegal instruction may be routine and problem size dependent.

In some circumstances, AOCL-BLAS aborting on an error from BLIS_ARCH_TYPE being set incorrectly may not be acceptable. If you are building AOCL-BLAS from source, there are two options to mitigate this issue. One is to change the environment variable used from BLIS_ARCH_TYPE to another name, for example:

$ ./configure --enable-cblas --prefix=<your-install-dir> \
  --rename-blis-arch-type=MY_BLIS_ARCH_TYPE amdzen
... make aocl-blas library
... compile program linking with aocl-blas
$ export BLIS_ARCH_TYPE=zen3
$ export MY_BLIS_ARCH_TYPE=zen2
$ ./program.exe

This will cause program.exe (which uses AOCL-BLAS) to ignore the setting of BLIS_ARCH_TYPE to zen3. Instead, it will take the value of MY_BLIS_ARCH_TYPE and use the zen2 code path. When --rename-blis-arch-type is used, AOCL_ENABLE_INSTRUCTIONS remains enabled in the build, but MY_BLIS_ARCH_TYPE (in this example) would take precedence if both are set.

Alternatively, the mechanism to allow manual selection of code path can be disabled:

$ ./configure --enable-cblas --prefix=<your-install-dir> \
  --disable-blis-arch-type amdzen

In this case, Dynamic Dispatch will still occur among the included code paths, but only by automatic selection based on the processor architecture. Manual selection of code path by both AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE is disabled.

Model Selection at Runtime

Recent AMD “Zen” generations have added more diverse choices of core designs and cache characteristics. For example, Milan and Milan-X variants at AMD “Zen3”; Genoa, Bergamo, and Genoa-X variants at AMD “Zen4”. Some AOCL-BLAS APIs may be tuned differently for these different models. The appropriate model will be selected automatically by Dynamic Dispatch.

However, AOCL can be forced to use a specific model by setting the environment variable BLIS_MODEL_TYPE as follows:

BLIS_MODEL_TYPE=value <AOCL-BLAS linked application>

where value = {Milan, Milan-X, Genoa, Bergamo, Genoa-X, Turin, Turin-Dense} Note the following:

  • Different model values correspond to specific BLIS_ARCH_TYPE values (either set automatically or explicitly by the user). Thus, Milan and Milan-X correspond to AMD “Zen3”; Genoa, Bergamo, and Genoa-X correspond to AMD “Zen4”, and Turin and Turin-Dense correspond to AMD “Zen5”.

  • Incorrect values of BLIS_MODEL_TYPE do not cause an error, the default model type for the selected architecture will be used.

  • The number of APIs that have different optimizations by model type is currently very small. Setting this environment variable may provide consistent results across different models if consistency is a higher priority than best performance.

As with BLIS_ARCH_TYPE, when building BLAS from source, the name of the environment variable used to set the model type can be changed, for example:

$ ./configure --enable-cblas --prefix=<your-install-dir> \
  --rename-blis-model-type=MY_BLIS_MODEL_TYPE amdzen

Disabling the mechanism to allow the manual section of BLAS architecture will also disable the mechanism to allow the manual section of the model.

$ ./configure --enable-cblas --prefix=<your-install-dir> \
  --disable-blis-arch-type amdzen

Setting either of these environment variables makes sense only when using a build of AOCL-BLAS that includes multiple code paths.

Thus, AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE are disabled by default in all the builds containing only a single code path.

4.4.2. AOCL-BLAS - Running the Test Suite#

The AOCL-BLAS source directory contains a test suite to verify the functionality of AOCL-BLAS and BLAS APIs. The test suite invokes the APIs with different inputs and verifies that the results are within the expected tolerance limits.

For more information, refer amd/blis.

4.4.2.1. Multi-Thread Test Suite Performance#

Starting from AOCL-BLAS 3.1, if the number of threads is not specified, AOCL-BLAS uses the maximum number of threads equal to the number of cores available on the system. A higher number of threads result in better performance for medium to large size matrices found in practical use cases.

However, the higher number of threads results in poor performance for very small sizes used by the test and check features. Hence, you must specify the number of threads while running the test/test suite.

The recommended number of threads to run the test suite is 1 or 2.

Running Test Suite

Execute the following command to invoke the test suite:

$ OMP_NUM_THREADS=2 make test

The sample output from the execution of the command is as follows:

$ OMP_NUM_THREADS=2 make test

Compiling obj/zen3/testsuite/test_addm.o
Compiling obj/zen3/testsuite/test_addv.o

<<< More compilation output >>>

Compiling obj/zen3/testsuite/test_xpbym.o
Compiling obj/zen3/testsuite/test_xpbyv.o

Linking test_libblis-mt.x against 'lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running test_libblis-mt.x with output
redirected to 'output.testsuite'

check-blistest.sh: All BLIS tests passed! Compiling
obj/zen3/blastest/cblat1.o Compiling obj/zen3/blastest/abs.o

<<< More compilation output >>>

Compiling obj/zen3/blastest/wsfe.o
Compiling obj/zen3/blastest/wsle.o
Archiving obj/zen3/blastest/libf2c.a

Linking cblat1.x against 'libf2c.a lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running cblat1.x > 'out.cblat1'

<<< More compilation output >>>

Linking zblat3.x against 'libf2c.a lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running zblat3.x <
'./blastest/input/zblat3.in' (output to 'out.zblat3')

check-blastest.sh: All BLAS tests passed!

4.4.3. Testing/Benchmarking#

The AOCL-BLAS source has an API specific test driver and this section explains how to use it for a specific set of matrix sizes.

The source file for GEMM benchmark is test/test_gemm.c and the executable is test/ test_gemm_blis.x.

Complete the following steps to execute the GEMM tests on specific input parameters:

Enabling File Inputs

By default, file input/output is disabled (instead it uses start, end, and step sizes). To enable the file inputs, complete the following steps:

  1. Open the file test/test_gemm.c.

  2. Uncomment the macro at the start of the file:

    #define FILE_IN_OUT
    

Building Test Driver

Execute the following commands to build the test driver:

$ cd tests
$ make -j blis

Creating an Input File

The input file accepts matrix sizes and strides in the following format. Each dimension is separated by a space and each entry is separated by a new line.

For example, m k nlda ldb ldc. Where:

  • Matrix A is of size m x k

  • Matrix B is of size k x n

  • Matrix C is of size m x n

  • lda is leading dimension of matrix A

  • ldb is leading dimension of matrix B

  • ldc is leading dimension of matrix C

This test application (test_gemm.c) assumes column-major storage of matrices.

The valid values of lda, ldb, and ldc for a GEMM operation C = beta*C + alpha* A * B, are as follows:

  • lda >= m

  • ldb >= k

  • ldc >= m

Running the Tests

Execute the following commands to run the tests:

$ cd tests
$ ./test_gemm_blis.x <input file name> <output file name>

An execution sample (with the test driver) for GEMM is as follows:

$ cat inputs.txt
200 100 100 200 200 200
10  4   1   100 100 100
4000 4000 400 4000 4000 4000
$ ./test_gemm_blis.x inputs.txt outputs.txt
_BLAS          m    k    n   cs_a cs_b cs_c gflops
data_gemm_blis 200  100  100 200  200  200  27.211
data_gemm_blis 10   4    1   100  100  100  0.027
data_gemm_blis 4000 4000 400 4000 4000 4000 45.279
$ cat outputs.txt
m    k    n    cs_a  cs_b cs_c  gflops
200  100  100  200   200  200   27.211
10   4    1    100   100  100   0.027
4000 4000 400  4000  4000 4000  45.279

4.4.4. AOCL-BLAS Utility APIs#

This section explains some of the AOCL-BLAS APIs used to get the AOCL-BLAS library configuration information and for configuring optimization tuning parameters.

Table 4.9 AOCL-BLAS Utility APIs#

API

Usage

bli_info_get_version_str()

Returns the version string in the form of AOCL-BLAS 5.0.0 Build yyyyddmm.

bli_info_get_info_value()

Returns the value of INFO from the previous call by this user thread to a BLAS2 or BLAS3 routine. For more information, refer to Error Handling in AOCL-BLAS

bli_info_get_enable_openmp() bli_info_get_enable_pthreads() bli_info_get_enable_threading()

Returns true if OpenMP/pthreads are enabled and false otherwise.

bli_thread_get_num_threads()

Returns the default number of threads used for the subsequent BLAS calls.

bli_thread_set_num_threads(dim_t, n_threads)

Sets the number of threads for the subsequent BLAS calls.

bli_thread_set_ways(dim_t jc, dim_t pc, dim_t ic, dim_t jr, dim_t ir)

Sets the number of threads for different levels of parallelization as per GotoBLAS five loops architecture.

For more details on the threading-related APIs, see amd/blis

4.5. Debugging and Troubleshooting#

4.5.1. Error Handling in AOCL-BLAS#

The original Netlib BLAS defined an error handling function XERBLA, which is called within BLAS2 and BLAS3 routines if an incorrect input argument is detected. Only incorrect matrix, vector sizes, and options for specifying transpose matrix, upper or lower in a symmetric matrix, and so on can be detected. BLAS does not detect extreme values (such as Inf or NaNs) within the supplied matrices and vectors, it is the user’s responsibility to check for these if required.

The functionality of Netlib’s XERBLA is to print a message to standard output and stop execution of the process. Stopping is extremely unhelpful in many applications and usage scenarios. Thus, AOCL-BLAS, in common with other similar libraries, has traditionally disabled the stop statement. In AOCL 4.2, the functionality of AOCL-BLAS has been enhanced to give users more choice over both stopping the application on error and printing a message on error. In AOCL 5.0 this functionality was added to the similar cblas_xerbla error handling function. The choices are specified by setting each of the environment variables BLIS_STOP_ON_ERROR and BLIS_PRINT_ON_ERROR to 0 or 1 to respectively disable or enable the functionality. The default values for each are:

Table 4.10 AOCL-BLAS - Error Handlers#

Environment Variable

Default Value

BLIS_STOP_ON_ERROR

0

BLIS_PRINT_ON_ERROR

1

When the stop on error is disabled, no error code is passed back to the user application through the BLAS interface arguments, unlike the INFO argument used in LAPACK routines. Therefore, AOCL-BLAS has also added an extra function to return the value of INFO from the previous call to a BLAS routine made by the same thread. The function can be called as follows:

**In C/C++:**

#include <blis.h>
...
gint_t info_value = bli_info_get_info_value();

**In Fortran:**

integer :: info_value
integer, external :: bli_info_get_info_value
...
info_value = bli_info_get_info_value()

If the returned value is not zero, the value indicates the argument
in the preceding BLAS call that was incorrect.

Note

Errors from an incorrect setting of the BLIS_ARCH_TYPE environment variable (used to override the default choice in dynamic dispatch, refer to Using Dynamic Dispatch for details) are handled by a separate error mechanism and will not be affected by the environment variables BLIS_STOP_ON_ERROR and BLIS_PRINT_ON_ERROR.

4.5.2. Debugging Build Using GDB#

The AOCL-BLAS library can be debugged on Linux using GDB. To enable the debugging support, build the library with the –enable-debug flag. Use following commands to configure and build the debug version of AOCL-BLAS:

$ cd blis_src
$ ./configure --enable-cblas --enable-debug auto
$ make -j

Use the following commands to link the application with the binary and build application with debug support:

$ cd blis_src
$ gcc -g -O0 -I<path-to-AOCL-BLAS-header> test_gemm.c \
  <path-to-AOCL-BLAS-library>/libblis.a -lpthread -lm \
  -o test_gemm_blis.x

You can debug the application using gdb. A sample output of the gdb session is as follows:

$ gdb ./test_gemm_blis.x
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8

Reading symbols from ./test_gemm_blis.x...done.
(gdb) break bli_gemm_small
Breakpoint 1 at 0x677543: file kernels/zen/3/bli_gemm_small.c, line 110.
(gdb) run
Starting program: /home/dipal/work/blis_dtl/test/test_gemm_blis.x
Using host libthread_db library "/lib64/libthread_db.so.1".
BLIS Library version is : AOCL BLIS 3.1

Breakpoint 1, bli_gemm_small (alpha=0x7fffffffcf40, a=0x2471b30,
b=0x7fffffffd1c0, beta=0x2465400 <BLIS_ZERO>,
c=0x4fe66e <bli_obj_equals+300>, cntx=0x7fffffffb320, cntl=0x0) at
kernels/zen/3/ bli_gemm_small.c:110

110 {

(gdb) bt
#0 bli_gemm_small (alpha=0x7fffffffcf40, a=0x2471b30,
b=0x7fffffffd1c0, beta=0x2465400
<BLIS_ZERO>,
c=0x4fe66e <bli_obj_equals+300>, cntx=0x7fffffffb320, cntl=0x0) at
kernels/zen/3/ bli_gemm_small.c:110

#1 0x00000000007caab6 in bli_gemm_front (alpha=0x7fffffffd1c0,
a=0x7fffffffd120, b=0x7fffffffd080,
beta=0x7fffffffcfe0, c=0x7fffffffcf40, cntx=0x2471b30,
rntm=0x7fffffffce50, cntl=0x0) at frame/3/gemm/bli_gemm_front.c:83

#2 0x00000000005baf42 in bli_gemmnat (alpha=0x7fffffffd1c0,
a=0x7fffffffd120, b=0x7fffffffd080,
beta=0x7fffffffcfe0, c=0x7fffffffcf40, cntx=0x2471b30,
rntm=0x7fffffffce50) at frame/ind/oapi/bli_l3_nat_oapi.c:83

#3 0x00000000005474a2 in dgemm\_ (transa=0x7fffffffd363 "N\320\a",
transb=0x7fffffffd362 "NN\320\a",
m=0x7fffffffd36c, n=0x7fffffffd364, k=0x7fffffffd368,
alpha=0x24733c0, a=0x7ffff53e2040, lda=0x7fffffffd378,

b=0x7ffff355d040, ldb=0x7fffffffd374, beta=0x2473340,
c=0x7ffff16d8040, ldc=0x7fffffffd370) at frame/compat/bla_gemm.c:559
#4 0x0000000000413a1c in main (argc=1, argv=0x7fffffffd988) at
test_gemm.c:321 (gdb)

4.5.3. Viewing Logs#

The AOCL-BLAS library provides Debug and Trace features:

  • Trace Log identifies the code path taken in terms of the function call chain. It prints the information on the functions invoked and their order.

  • Debug Log prints the other debugging information, such as values of input parameters, content, and data structures.

The key features of this functionality are as follows:

  • Can be enabled/disabled at compile time.

  • When these features are disabled at compile time, they do not require any runtime resources and that does not affect the performance.

  • Compile time option is available to control the depth of trace/log levels.

  • All the traces are thread safe.

  • Performance data, such as execution time and gflops achieved, are also printed for xGEMM APIs.

4.5.3.1. Function Call Tracing#

The function call tracing is implemented using hard instrumentation of the AOCL-BLAS code. Here, the functions are grouped as per their position in the call stack. You can configure the level up to which the traces must be generated.

Complete the following steps to enable and view the traces:

  1. Enable the trace support as follows:

    1. Modify the source code to enable tracing.

      Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
      
    2. Change the following macro from 0 to 1:

      #define AOCL_DTL_TRACE_ENABLE 0
      
  2. Configure the trace depth level.

    1. Modify the source code to specify the trace depth level.

      Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
      
    2. Change the following macro as required. Beginning with Level 5 should be a good compromise in terms of details and resource requirement. The higher the level, the deeper is the call stack. A lower level reduces the depth of the call stack used for a trace generation.

      #define AOCL_DTL_TRACE_LEVEL AOCL_DTL_LEVEL_TRACE_5
      
  3. Build the library as explained in Build AOCL-BLAS from Source on Linux.

  4. Run the application to generate the trace data.

    The trace output file for each thread is generated in the current folder.

    The following figure shows a sample running the call tracing function using the test_gemm application:

    _images/image5_aocl.jpeg

    Figure 4.5 Sample Run of Function Call Tracing#

    The trace data for each thread is saved in the file with appropriate naming conventions. The .txt extension is used to signify the readable file:

    P<process id>_T<thread id>_aocldtl_trace.txt

  5. View the trace data.

    The output of the call trace is in a readable format, you can open the file in any of the text editors. The first column shows the level in call stack for the given function.

4.5.3.2. Debug Logging#

The debug logging works very similar to the function call tracing and uses the same infrastructure. However, it can be enabled independent of the trace feature to avoid cluttering of the overall debugging information. This feature is primarily used to print the input values of the AOCL-BLAS APIs. Additionally, it can also be used to print any arbitrary debugging data (buffers, matrices, arrays, or text).

Complete the following steps to enable and view the debug logs:

  1. Enable the debug log support as follows:

    1. Modify the source code to enable debug logging.

      Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
      
    2. Change the following macro from 0 to 1:

      #define AOCL_DTL_LOG_ENABLE 0
      
  2. Configure the trace depth level.

    1. Modify the source code to specify the debug log depth level.

      Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
      
    2. Change the following macro as required. Beginning with Level 5 should be a good compromise in terms of details and resource requirement. The higher the level (maximum is 10), the deeper is the call stack. A lower level reduces the depth of the call stack used for a trace generation.

      #define AOCL_DTL_TRACE_LEVEL AOCL_DTL_LEVEL_TRACE_5
      
  3. Build the library as explained in Build AOCL-BLAS from Source on Linux.

  4. Run the application to generate the trace data.

    The trace output files for each thread is generated in the current folder.

    The following figure shows a sample running of AOCL-BLAS with the debug logs enabled using the test_gemm application:

    _images/image6_aocl.jpeg

    Figure 4.6 Sample Run with Debug Logs Enabled#

    The debug logs for each thread are saved in the file with appropriate naming conventions. The .txt extension is used to signify the readable file:

    P<process id>_T<thread id>_aocldtl_log.txt

  5. View the debug logs.

    The output of the debug logs is in a readable format, you can open the file in any text editor. The following figure shows the sample output for one of the threads of the test_gemm application:

    $ cat P3386555_T0_aocldtl_log.txt
    dgemm_blis_impl D N N 4000 4000 4000 1.300000 0.000000 4000 4000 0.700000 0.000000 4000 nt=1 911.148 ms 70.482 GFLOPS
    dgemm_blis_impl D N N 4000 4000 4000 1.300000 0.000000 4000 4000 0.700000 0.000000 4000 nt=8 121.024 ms 557.641 GFLOPS
    

4.5.3.3. Usage and Limitations#

The debug and trace logs have the following usage and limitations:

  • When tracing is enabled, there could be a significant drop in the performance.

  • Only a function that has the trace feature in the code can be traced. To get the trace information for any other function, the source code must be updated to add the trace/log macros in them.

  • The call trace and debug logging is a resource-dependent process and can generate a large size of data. Based on the hardware configuration (the disk space, number of cores and threads) required for the execution, logging may result in a sluggish or non-responsive system.

4.5.4. Checking AOCL-BLAS Operation Progress#

The AOCL libraries may be used to perform lengthy computations (for example, matrix multiplications and solver involving large matrices). These operations/computations may go on for hours.

AOCL Progress feature provides mechanism for the application to check the computation progress. The AOCL libraries (AOCL-BLAS and AOCL-LAPACK) periodically updates the application with progress made through a callback function.

Usage

The application must define the callback function in a specific format and register it with the AOCL library.

Callback Definition

The callback function prototype must be as defined as given follows:

dim_t AOCL_progress(const char* const api, const dim_t lapi, const dim_t progress,
const dim_t current_thread, const dim_t total_threads)

However, you can modify the function name as per your preference.

The following table explains different parameters passed to the callback function:

Table 4.11 Callback Parameters#

Parameter

Purpose

api

Name of the API running currently

lapi

Length of the API name string (*api)

progress

Linear progress made in current thread presently

current_thread

Current thread ID

total_threads

Total number of threads used to performance the operation

Callback Registration

The callback function must be registered with the library for reporting the progress. Each library has its own callback registration function. The registration can be done by calling:

AOCL_BLIS_set_progress(AOCL_progress); // for AOCL-BLAS

Example

The library only invokes the callback function at appropriate intervals, it is up to the user to consume this information appropriately. The following example shows how to use it for printing the progress to a standard output:

dim_t AOCL_progress(const char* const api, const dim_t lapi,
  const dim_t progress,const dim_t current_thread,
  const dim_t total_threads)
{
   printf("\n%s, total thread = %lld, processed %lld element by thread %lld.",
     api, total_threads, progress, current_thread);

   return 0;
}

Register the callback with:

AOCL_BLIS_set_progress(AOCL_progress); // for AOCL-BLAS

The result is displayed in following format (output truncated):

$ BLIS_NUM_THREADS=5 ./test_gemm_blis.x

dgemm, total thread = 5, processed 11796480 element by thread 4.
dgemm, total thread = 5, processed 17694720 element by thread 0.
dgemm, total thread = 5, processed 5898240 element by thread 2.
dgemm, total thread = 5, processed 20643840 element by thread 0.
dgemm, total thread = 5, processed 14745600 element by thread 3.
dgemm, total thread = 5, processed 14745600 element by thread 4.

Limitations

  • The feature only shows if the operation is progressing or not, it doesn’t provide an estimate/ percentage compilation status.

  • A separate callback must be registered for AOCL-BLAS, AOCL-LAPACK, and AOCL-ScaLAPACK.

4.6. LPGEMM in AOCL-BLAS#

4.6.1. Add-on in AOCL-BLAS#

An add-on in AOCL-BLAS provides additional APIs, operations, and/or implementations that may be useful to certain users. It can be a standalone extension of AOCL-BLAS that does not depend on any other add-on, although add-ons may utilize existing functionality or kernels within the core framework.

An add-on should never provide APIs that conflict with the interfaces belonging to the BLIS typed or object API. Thus, a properly constructed/functioning add-on would never interfere with or change the core BLIS functionality or the standard BLAS and CBLAS APIs.

Low Precision GEMM (LPGEMM) APIs are added as an add-on feature with the name aocl_gemm in AOCL-BLAS 4.1 which are used in Inference of Deep Neural Networks (DNN) applications. For example, Low Precision DNN uses the input as image pixels that are unsigned 8-bit (u8) and quantized pre-trained weights of signed 8-bits (s8) width. They produce signed 32-bit or downscaled/ quantized 8-bit output.

At the same time, these APIs are expected to utilize the architecture features such as AVX512VNNI instructions designed to take the inputs in u8, s8; produce an output in s32 and produce high throughput. Similarly, AVX512BF16 based instructions expects input in Brain Floating Point (bfloat16) type to provide higher throughput with less precision than 32-bit.

4.6.2. API Naming and Arguments#

LPGEMM APIs starts with the prefix aocl_gemm_ and follows the data type of input matrix A, B, accumulation type, and output matrix C.

For example, aocl_gemm_u8s8s32os32( ) API expects input matrix ‘A’ is unsigned 8-bit (u8) and matrix ‘B’ signed 8-bit (s8), accumulation matrix ‘C’ is signed 32-bit (s32) and output matrix type is signed 32-bit (o s32).

4.6.3. Post-Operations#

The low precision GEMM operations are highly useful in AI applications, where the precision requirements can be traded with performance. In DNN applications element-wise operations, such as adding bias, clipping the output, ReLU, and GeLU are performed on the GEMM output which are referred here as post-operations (post-ops).

In LPGEMM, these post-ops are fused with the GEMM operation to avoid repeated access to memory and thereby, improving the performance. In the LPGEMM APIs, an additional argument is added for the user to provide information about the post-ops needed to perform after the GEMM operation.

4.6.4. APIs and Post-Ops in aocl_gemm#

4.6.4.1. Architecture Features and APIs#

Table 4.12 Required Architecture Features and APIs#

Architecture Features Required

API

AVX512-VNNI

aocl_gemm_u8s8s32os32

aocl_gemm_u8s8s32os8

aocl_gemm_s8s8s32os32

aocl_gemm_s8s8s32os8

AVX2

aocl_gemm_u8s8s16os16

aocl_gemm_u8s8s16os8

aocl_gemm_u8s8s16ou8

aocl_gemm_s8s8s16os16

aocl_gemm_s8s8s16os8

AVX512-BF16

aocl_gemm_bf16bf16f32of32

aocl_gemm_bf16bf16f32obf16

aocl_gemm_bf16s4f32of32

aocl_gemm_bf16s4f32obf16

AVX512 / AVX2

aocl_gemm_f32f32f32of32

4.6.4.2. Utility APIs in aocl_gemm Add-on#

Table 4.13 GEMM API Supported Post-ops#

Post-op

Description

Add bias

Adds bias to the GEMM output before storing into C, where the bias data is passed by the user using the post-op interface.

ReLU

Performs ReLU operation on GEMM output. f(x) = 0, when x<=0 and f(x)=x when x>0.

PReLU

Performs Parametric ReLU operation on GEMM output based on scale given by the user. f(x) = x, when x > 0 and f(x) = scale*x when x <= 0.

SWISH

Sigmoid Weighted Linear Unit (SiLU) when beta=1. SWISH(x) = x*sigmoid(beta*x)

GeLU_Tanh

Perform Tanh based GeLU on GEMM output. GeLU_Tanh(x) = 0.5*x*(1 + tanh(0.797884*(x+( 0.044715*x^3 ))))

GeLU_ERF

Perform Erf based GeLU on GEMM output. GeLU_Erf(x) = 0.5* x * (1 + erf(x * 0.707107 ))

Scale

Perform Scale operation on GEMM output based on the scale provided by the user.

Clip

Perform clip operation on GEMM output based on minimum and maximum values given by the user.

Matrix Add

Perform elementwise addition of a given D matrix to GEMM output C. C := (beta * C + alpha * A * B ) + D

Matrix Mul

Perform elementwise multiplication of a given D matrix with GEMM output and update C C := (beta * C + alpha * A * B ) * D

LPGEMM APIs supports reordering the entire input matrix before calling GEMM and on the go packing, where GEMM API takes care of packing of matrix internally. The following utility APIs are used to reorder input weight matrix before calling GEMM:

Table 4.14 Utility APIs in aocl_gemm Add-on#

API

Description

aocl_get_reorder_buff_size_XXX()

Returns buffer size required to reorder an input matrix, where XXX corresponds to each of the data type combinations specified in Required Architecture Features and APIs. For example, u8s8s32os32.

aocl_reorder_XXX()

Reorders the given input and writes into output buffer.

aocl_gelu_tanh_f32()

Performs tanh operation on each element of the given input buffer and writes in the output buffer.

aocl_gelu_erf_f32()

Performs tanh operation on each element of the given input buffer and writes in the output buffer.

aocl_softmax_f32()

Performs tanh operation on each element of the given input buffer and writes in the output buffer.

aocl_gemm_eltwise_ops_XX

Performs sequence of element wise operations on a given input matrix. The sequence can contain any of the supported post-ops.

4.6.5. Enabling aocl_gemm Add-on#

Enabling aocl_gemm add-on while building AOCL-BLAS from Source on Linux:

  • Building with GCC:

$ ./configure -a aocl_gemm --enable-cblas --enable-threading=openmp \
  --prefix=<your-install-dir> CC=gcc CXX=g++ [auto \| amdzen]
  • Building with AOCC:

$ ./configure -a aocl_gemm --enable-cblas --enable-threading=openmp \
  --prefix=<your-install-dir> CC=clang CXX=clang++ [auto \| amdzen]
  • The aocl_gemm add-on feature is supported on Windows with clang version 18.0 and above. To enable, please add -DENABLE_ADDON=”aocl_gemm” to cmake command line as mentioned in Build AOCL-BLAS from Source on Windows

  • Refer to blis.h file for all the prototypes of LPGEMM APIs.

  • Some LPGEMM APIs are supported only when the architecture features, such as avx512vnni and avx512bf16 are available in the machine as mentioned in Required Architecture Features and APIs. The APIs returns without doing anything when those features are not available.

  • Transpose support for A and B is not available for AVX2 s16 APIs.

4.6.6. Sample Application 1#

The following sample application is to use the LPGEMM APIs without post-ops:

/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm

Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "blis.h"

// Example program to demonstrate LPGEMM API usage.
// aocl_gemm_f32f32f32of32 (A:float, B:float, C:float) used here.
int main()
{
    dim_t m = 1024;
    dim_t n = 1024;
    dim_t k = 1024;

    // Leading dimensions for row major matrices.
    dim_t lda = k;
    dim_t ldb = n;
    dim_t ldc = n;

    err_t err = BLIS_SUCCESS;
    float *a = (float *)bli_malloc_user(sizeof(float) * m * k, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    float *b = (float *)bli_malloc_user(sizeof(float) * n * k, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    float *c = (float *)bli_malloc_user(sizeof(float) * m * n, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // Functions to fill the matrices with data can be added here.
    float alpha = 2.2;
    float beta = 9.15;
    float alpha = 2.2;
    float beta = 9.15;
    char storage = 'r'; // Row major. Use 'c' for column major.
    char transa = 'n'; // No transpose. Transpose not supported for all API's.
    char transb = 'n';
    char reordera = 'n';
    char reorderb = 'n';

    aocl_gemm_f32f32f32of32(storage, transa, transb,
                          m, n, k,
                          alpha,
                          a, lda, reordera,
                          b, ldb, reorderb,
                          beta,
                          c, ldc,
                          NULL);

bailout:

    if (a != NULL)
    {
        bli_free_user(a);
    }
    if (b != NULL)
    {
        bli_free_user(b);
    }
    if (c != NULL)
    {
        bli_free_user(c);
    }

    return 0;
}

4.6.7. Sample Application 2#

The following sample application is to demonstrate usage of LPGEMM API with reordered B matrix and post-ops:

/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm

Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "blis.h"

// aocl_gemm_bf16bf16f32of32 (A:bfloat16, B:bfloat16, C:float) used here.
// 3 post-ops - bias + gelu_tanh + clip used here.
int main()
{
    dim_t m = 1024;
    dim_t n = 1024;
    dim_t k = 1024;

    // Leading dimensions for row major matrices.
    dim_t lda = k;
    dim_t ldb = n;
    dim_t ldc = n;

    err_t err = BLIS_SUCCESS;
    bfloat16 *a = (bfloat16 *)bli_malloc_user(sizeof(bfloat16) * m * k, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    bfloat16 *b = (bfloat16 *)bli_malloc_user(sizeof(bfloat16) * n * k, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    float *c = (float *)bli_malloc_user(sizeof(float) * m * n, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // Functions to fill the matrices with data can be added here.
    float alpha = 2.95;
    float beta = 3.5;
    char storage = 'r'; // Row major. Use 'c' for column major.
    char transa = 'n'; // No transpose. Transpose not supported.
    char transb = 'n';
    char reordera = 'n';
    char reorderb = 'r'; // B matrix will be reordered.

    // Initialize post-ops struct.
    aocl_post_op *post_ops = NULL;
    post_ops = (aocl_post_op *)bli_malloc_user(sizeof(aocl_post_op), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }


    dim_t max_post_ops_seq_length = 3; // bias + gelu_tanh + clip
    post_ops->seq_vector =
        (AOCL_POST_OP_TYPE *) bli_malloc_user(
                        max_post_ops_seq_length * sizeof(AOCL_POST_OP_TYPE),
                        &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // 1 bias instance, need to allocate dynamically.
    post_ops->seq_vector[0] = BIAS;
    post_ops->bias =
            bli_malloc_user(1 * sizeof(aocl_post_op_bias), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    // Need to output accumulation (float) type for bias.
    (post_ops->bias + 0)->bias = bli_malloc_user(n * sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    // Add function to fill bias array here.

    // 2 element wise post-ops, need to allocate dynamically.
    post_ops->seq_vector[1] = ELTWISE; // For gelu_tanh
    post_ops->seq_vector[2] = ELTWISE; // For clip

    post_ops->eltwise =
            bli_malloc_user(2 * sizeof(aocl_post_op_eltwise), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // Gelu tanh.
    (post_ops->eltwise + 0)->is_power_of_2 = FALSE;
    (post_ops->eltwise + 0)->scale_factor = NULL;
    (post_ops->eltwise + 0)->algo.alpha = NULL;
    (post_ops->eltwise + 0)->algo.beta = NULL;
    (post_ops->eltwise + 0)->algo.algo_type = GELU_TANH;

    // Clip.
    (post_ops->eltwise + 1)->is_power_of_2 = FALSE;
    (post_ops->eltwise + 1)->scale_factor = NULL;
    // Min bound is represented by alpha.
    (post_ops->eltwise + 1)->algo.alpha =
                            bli_malloc_user(sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    // Max bound is represented by beta.
    (post_ops->eltwise + 1)->algo.beta =
                            bli_malloc_user(sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    // Set some min/max bounds.
    *((float*)(post_ops->eltwise + 1)->algo.alpha) = -64.5;
    *((float*)(post_ops->eltwise + 1)->algo.beta) = 3.9;
    (post_ops->eltwise + 1)->algo.algo_type = CLIP;

    post_ops->seq_length = 3;

    // Reorder B matrix, this is pre-packing the B matrix so that packing
    // costs are not incurred when executing GEMM.
    siz_t b_reorder_buffer_size =
        aocl_get_reorder_buf_size_bf16bf16f32of32(storage, transb, 'B', k, n );
    bfloat16* b_reorder =
        (bfloat16*)bli_malloc_user(b_reorder_buffer_size, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    aocl_reorder_bf16bf16f32of32(storage, transb, 'B',
                                 b, b_reorder,
                                 k, n, ldb);

    aocl_gemm_bf16bf16f32of32(storage, transa, transb,
                              m, n, k,
                              alpha,
                              a, lda, reordera,
                              b_reorder, ldb, reorderb,
                              beta,
                              c, ldc,
                              post_ops);

bailout:
    if ((post_ops->eltwise + 1)->algo.alpha != NULL)
    {
        bli_free_user((post_ops->eltwise + 1)->algo.alpha);
    }
    if ((post_ops->eltwise + 1)->algo.beta != NULL)
    {
        bli_free_user((post_ops->eltwise + 1)->algo.beta);
    }
    if (post_ops->eltwise != NULL)
    {
        bli_free_user(post_ops->eltwise);
    }
    if (post_ops->bias != NULL)
    {
        if ((post_ops->bias + 0)->bias != NULL)
        {
            bli_free_user((post_ops->bias + 0)->bias);
        }
        bli_free_user(post_ops->bias);
    }
    if (post_ops->seq_vector != NULL)
    {
        bli_free_user(post_ops->seq_vector);
    }
    if (post_ops != NULL)
    {
        bli_free_user(post_ops);
    }
    if (b_reorder != NULL)
    {
        bli_free_user(b_reorder);
    }
    if (a != NULL)
    {
        bli_free_user(a);
    }
    if (b != NULL)
    {
        bli_free_user(b);
    }
    if (c != NULL)
    {
        bli_free_user(c);
    }

    return 0;
}

4.6.8. Sample Application 3#

The following sample application is to demonstrate usage of LPGEMM downscale API with multiple scale post-ops and int4 to int8 B matrix reordering:

/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm

Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "blis.h"

// aocl_gemm_u8s8s32os8 (A:uint8_t, B:int_t, C:int8_t) used here.
// 3 post-ops - scale + matrix_add + scale used here.
int main()
{
    dim_t m = 1024;
    dim_t n = 1024;
    dim_t k = 1024;

    // Leading dimensions for row major matrices.
    dim_t lda = k;
    dim_t ldb = n;
    dim_t ldc = n;

    err_t err = BLIS_SUCCESS;
    uint8_t *a = (uint8_t *)bli_malloc_user(sizeof(uint8_t) * m * k, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // int4_t B matrix represented using int8_t, but with half the int8_t size.
    int8_t *b = (int8_t *)bli_malloc_user((sizeof(int8_t) * n * k) / 2, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    int8_t *c = (int8_t *)bli_malloc_user(sizeof(int8_t) * m * n, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // Functions to fill the matrices with data can be added here.
    int32_t alpha = 2;
    int32_t beta = 9;
    char storage = 'r'; // Row major. Use 'c' for column major.
    char transa = 'n'; // No transpose. Transpose not supported.
    char transb = 'n';
    char reordera = 'n';
    char reorderb = 'r';

    // Initialize post-ops struct.
    aocl_post_op *post_ops = NULL;
    post_ops = (aocl_post_op *)bli_malloc_user(sizeof(aocl_post_op), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // Downscale parameters need to be passed as a post-op, even
    // if a downscale specific api is invoked.
    dim_t max_post_ops_seq_length = 3; // scale + matrix_add + scale

    post_ops->seq_vector =
        (AOCL_POST_OP_TYPE *) bli_malloc_user(
                        max_post_ops_seq_length * sizeof(AOCL_POST_OP_TYPE),
                        &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // 2 scaling post-ops, first for normal scaling and second one for
    // downscaling, need to allocate scale struct dynamically.
    post_ops->sum =
            bli_malloc_user(2 * sizeof(aocl_post_op_sum), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }

    // For first scale, using scalar zero point and scale factor.
    post_ops->seq_vector[0] = SCALE;
    (post_ops->sum + 0)->is_power_of_2 = FALSE;
    (post_ops->sum + 0)->buff = NULL;
    (post_ops->sum + 0)->zero_point =
         bli_malloc_user(1 * sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    *((int8_t*)((post_ops->sum + 0)->zero_point)) = 3;
    (post_ops->sum + 0)->zero_point_len = 1;
    (post_ops->sum + 0)->scale_factor =
         bli_malloc_user(1 * sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    *((float*)((post_ops->sum + 0)->scale_factor)) = 3.9;
    (post_ops->sum + 0)->scale_factor_len = 1;

    // Matrix add post-op.
    post_ops->matrix_add =
            bli_malloc_user(1 * sizeof(aocl_post_op_matrix_add), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    post_ops->seq_vector[1] = MATRIX_ADD;
    (post_ops->matrix_add + 0)->matrix =
         bli_malloc_user(sizeof(int8_t) * m * n, &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    (post_ops->matrix_add + 0)->ldm = n;
    // Add function to fill matrix_add matrix here.

    // For second scale, using vector zero point and scale factor.
    // This scale post-op is purely for downscaling/quantization.
    post_ops->seq_vector[2] = SCALE;
    (post_ops->sum + 1)->is_power_of_2 = FALSE;
    (post_ops->sum + 1)->buff = NULL;
    (post_ops->sum + 1)->zero_point =
         bli_malloc_user(n * sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    (post_ops->sum + 1)->zero_point_len = n;
    (post_ops->sum + 1)->scale_factor =
         bli_malloc_user(n * sizeof(float), &err);
    if (err != BLIS_SUCCESS) { goto bailout; }
    (post_ops->sum + 1)->scale_factor_len = n;
    // Add function to fill zero point and scale factor here.

    post_ops->seq_length = 3;

    // Reorder B matrix, this is pre-packing the B matrix so that packing
    // costs are not incurred when executing GEMM. Here the int4 B matrix
    // is reordered along with conversion to each element to int8 type.
    siz_t b_reorder_buffer_size =
      aocl_get_reorder_buf_size_u8s4s32os32(storage, transb, 'B', k, n );

    int8_t* b_reorder = (int8_t*)bli_malloc_user(b_reorder_buffer_size, &err);

    aocl_reorder_u8s4s32os32(storage, transb, 'B',
                             b, b_reorder,
                             k, n, ldb);

    aocl_gemm_u8s8s32os8(storage, transa, transb,
                         m, n, k,
                         alpha,
                         a, lda, reordera,
                         b_reorder, ldb, reorderb,
                         beta,
                         c, ldc,
                         post_ops);

bailout:
    if (post_ops->sum != NULL)
    {
      if ((post_ops->sum + 0)->zero_point != NULL)
      {
         bli_free_user((post_ops->sum + 0)->zero_point);
      }
      if ((post_ops->sum + 0)->scale_factor != NULL)
      {
         bli_free_user((post_ops->sum + 0)->scale_factor);
      }
      if ((post_ops->sum + 1)->zero_point != NULL)
      {
         bli_free_user((post_ops->sum + 1)->zero_point);
      }
      if ((post_ops->sum + 1)->scale_factor != NULL)
      {
         bli_free_user((post_ops->sum + 1)->scale_factor);
      }

      bli_free_user(post_ops->sum);
    }
    if (post_ops->matrix_add != NULL)
    {
      if ((post_ops->matrix_add + 0)->matrix != NULL)
      {
         bli_free_user((post_ops->matrix_add + 0)->matrix);
      }

      bli_free_user(post_ops->matrix_add);
    }
    if (post_ops->seq_vector != NULL)
    {
        bli_free_user(post_ops->seq_vector);
    }
    if (post_ops != NULL)
    {
        bli_free_user(post_ops);
    }
    if (b_reorder != NULL)
    {
        bli_free_user(b_reorder);
    }
    if (a != NULL)
    {
        bli_free_user(a);
    }
    if (b != NULL)
    {
        bli_free_user(b);
    }
    if (c != NULL)
    {
        bli_free_user(c);
    }

    return 0;
}