4. AOCL-BLAS#
AOCL-BLAS is a high-performant implementation of the Basic Linear Algebra Subprograms (BLAS). The BLAS was designed to provide the essential kernels of matrix and vector computation and are the most commonly used computationally intensive operations in dense numerical linear algebra. Select kernels have been optimized for the AMD “Zen”-based processors, for example, AMD EPYCTM, AMD RyzenTM, AMD RyzenTM ThreadripperTM processors by AMD and others.
AOCL_BLAS is developed as a forked version of BLIS (flame/blis), which is developed by members of the Science of High-Performance Computing (SHPC) group in the Institute for Computational Engineering and Sciences at The University of Texas at Austin and other collaborators (including AMD). All known features and functionalities of BLIS are retained and supported in AOCL-BLAS library, along with the standard BLAS and CBLAS interfaces. C++ template interfaces for the BLAS functionalities are also included.
4.1. Installation#
AOCL-BLAS can be installed from the source or pre-built binaries, as described in following sections.
4.1.1. Using Pre-Built Binaries#
AOCL-BLAS library binaries for Linux are available at the following URL: https://www.amd.com/en/developer/aocl/dense.html
Also, the AOCL-BLAS binary can be installed from the AOCL master installer tar file (https://www.amd.com/en/developer/aocl.html).
The master installer includes the following:
Single threaded and multi-threaded AOCL-BLAS binaries.
Binaries built with amdzen config with LP64 and ILP64 integer support.
Multi-threaded AOCL-BLAS binary (libblis-mt) built with OpenMP threading mode.
The tar file includes pre-built binaries of other AMD libraries as explained in Using Master Package.
4.1.2. Build from Source - Introduction#
When building from source, two different build systems are supported:
CMake - supported on Linux and Windows
configure/make - only supported on Linux
First we consider the different options available when creating and using AOCL-BLAS, and then look at some examples and platform-specific information for Linux and Windows.
The AOCL-BLAS source code is available at GitHub URL: amd/blis
Clone the Git repository amd/blis.git
Prerequisites
The following dependencies must be met for installing AOCL-BLAS:
Target CPU ISA supporting AVX2 and FMA (and preferably AVX512)
Python versions 3.4 or later
GNU Make 4.2 or later
CMake 3.22.0 or later
Compilers: either of
GCC versions 12.2 through 13.1
AOCC versions 4.2 or 5.0
4.1.3. Hardware Configuration#
AOCL-BLAS supports a wide range of different architectures, with optimizations focused on AMD “Zen” and compatible processors. AOCL-BLAS can be compiled for specific hardware by specifying the appropriate configuration option:
auto - This configuration generates a binary optimized for the build machine’s AMD “Zen” core architecture. This is useful when you build the library on the target system. Starting from the AOCL-BLAS 2.1 release, the auto configuration option enables selecting the appropriate build configuration based on the target CPU architecture. For example, for a build machine using the 1st Gen AMD EPYCTM (code name “Naples”) processor, the zen configuration will be auto-selected. For a build machine using the 2nd Gen AMD EPYCTM processor (code name “Rome”), the zen2 configuration will be auto-selected. From AOCL-BLAS 3.0 forward, zen3 will be auto-selected for the 3rd Gen AMD EPYCTM processor (code name “Milan”). From AOCL-BLAS 4.0 forward, zen4 will be auto-selected for the 4th Gen AMD EPYCTM processors (code name “Genoa”, “Bergamo” or “Siena”). From AOCL-BLAS 5.0 forward, zen5 will be auto-selected for the 5th Gen AMD EPYCTM processors (code name “Turin”).
zen - This configuration generates a binary compatible with AMD “Zen” architecture and is optimized for it. The architecture of the build machine is not relevant.
zen2 - This configuration generates binary compatible with AMD “Zen2” architecture and is optimized for it. The architecture of the build machine is not relevant.
zen3 - This configuration generates binary compatible with AMD “Zen3” architecture and is optimized for it. The architecture of the build machine is not relevant.
zen4 - This configuration generates binary compatible with AMD “Zen4” architecture and is optimized for it. The architecture of the build machine is not relevant.
zen5 - This configuration generates binary compatible with AMD “Zen5” architecture and is optimized for it. The architecture of the build machine is not relevant.
amdzen - The library built using this configuration generates a binary compatible with and optimized for AMD “Zen”, AMD “Zen2”, AMD “Zen3”, AMD “Zen4” and AMD “Zen5” architectures. A slower generic code path, compatible with older x86-64 processors is also included. The architecture of the build machine is not relevant. The architecture of the target machine is checked during the runtime, based on which the relevant optimizations are picked up automatically. This feature is also called Dynamic Dispatch. For more information, refer to the Dynamic Dispatch section below.
Example desired config |
CMake options
————————————
configure options
|
Usage |
---|---|---|
amdzen |
-DBLIS_CONFIG_FAMILY=amdzen
————————————
./configure … amdzen
|
Linux : No default is set
Windows: “auto” is default choice
When using configure, the desired config should be the last argument.
|
4.1.4. API Compatibility Layers (Calling AOCL-BLAS)#
AOCL-BLAS supports various API compatibility layers:
The BLAS/CBLAS standard enables application portability between various libraries. See Netlib BLAS for background information. They can be called from programs written in Fortran, C, C++ and compatible languages. Simple BLAS and CBLAS examples in Fortran and C are available in section AOCL-BLAS Usage in Fortran and C.
AOCL-BLAS also includes BLIS-specific APIs that provide more flexibility and control to help achieve the best performance in some situations. Details of these C interfaces are available at:
BLIS Typed: Documentation, Examples
BLIS Object: Documentation, Examples
The following table lists all the supported layers and the CMake and configure options to control them, with the default setting in bold:
API Layer |
Header file |
CMake options
————————————
configure options
|
Usage |
---|---|---|---|
BLAS (Fortran) |
N/A |
-DENABLE_BLAS=ON
-DENABLE_BLAS=OFF
————————————
--enable-blas
--disable-blas
|
Use this option when calling AOCL-BLAS from Fortran applications.
API Name Format:
DGEMM |
BLAS (C) |
blis.h |
-DENABLE_BLAS=ON
-DENABLE_BLAS=OFF
————————————
--enable-blas
--disable-blas”
|
Use this option when calling AOCL-BLAS from C application using BLAS type parameters.
API Name Format:
dgemm_ |
CBLAS |
cblas.h |
-DENABLE_CBLAS=ON
-DENABLE_CBLAS=OFF
————————————
--enable-cblas
--disable-cblas
|
Use this option when calling AOCL-BLAS from C application using CBLAS type parameters. If enabled, BLAS API is also enabled.
API Name Format:
cblas_dgemm |
BLIS - C
(Non-standard)
|
blis.h |
Enabled by default
————————————
Enabled by default
|
This is AOCL-BLAS library specific (non-standard) interface, it provides most flexibility in calling AOCL-BLAS for best performance.
However, these applications will not be portable to other BLAS/CBLAS compatible libraries.
API Name Format:
bli_gemm API Name Format:
bli_gemm_ex |
BLIS - CPP
(Non-standard)
|
blis.hh |
Enabled by default
————————————
Enabled by default
|
This is AOCL-BLAS library specific (non-standard) C++ interface. This interface follows same parameter order as CBLAS.
However, these applications will not be portable to other BLAS/CBLAS compatible libraries.
API Name Format:
blis::gemm |
4.1.5. API Compatibility - Advanced Options#
The API compatibility can be further extended to meet additional requirements for input sizes (ILP64) and different ways in which complex numbers are returned by functions in the BLAS interface, which is related to the choice of compiler. The following table explains such options, with the default setting in bold:
Build characteristic |
CMake options
————————————
configure options
|
Usage |
---|---|---|
Choice of compiler and complex function return type |
-DCOMPLEX_RETURN=gnu
-DCOMPLEX_RETURN=intel CC=clang CXX=clang++
————————————
--complex-return=gnu
--complex-return=intel CC=clang CXX=clang++
|
GNU and AOCC (based on LLVM) are currently supported.
Refer to Returning Complex Numbers for more information.
|
Integer size (LP64 vs IP64) |
-DBLAS_INT_SIZE=32
-DBLAS_INT_SIZE=64
————————————
--blas-int-size=32
--blas-int-size=64
|
This option can be used to specify the integer types used in external BLAS and CBLAS interfaces.
|
4.1.6. Other Key Options#
The following table lists other key AOCL-BLAS build options and the CMake and configure options to control them, with the default setting in bold (where appropriate):
Build characteristic |
CMake options
————————————
configure options
|
Usage |
---|---|---|
Serial or multithreaded (OpenMP) |
-DENABLE_THREADING=openmp
-DENABLE_THREADING=no
————————————
--enable-threading=openmp
--enable-threading=no
|
Enabling multithreading with OpenMP will add a link time dependency on the compiler’s OpenMP runtime library.
|
Enable/Disable dynamic thread scaling |
-DENABLE_AOCL_DYNAMIC=ON
-DENABLE_AOCL_DYNAMIC=OFF
————————————
–enable-aocl-dynamic
–disable-aocl-dynamic
|
In multithreaded builds with AOCL dynamic enabled, AOCL-BLAS may reduce the number of threads used within each API call from that requested where the problem parameters are small.
|
Installation directory |
-DCMAKE_INSTALL_PREFIX=</desired/location/>
Default location: /usr/local/
————————————
--prefix=</desired/location/>
Default location: /usr/local/
|
Specifies target directory for installing AOCL-BLAS. Installation will include
lib , include and share subdirectories.On Linux these will be placed inside a directory
lp64 or ilp64 as appropriate. |
Enable LPGEMM add-on |
-DENABLE_ADDON=”aocl_gemm”
Default: disabled
————————————
-a aocl_gemm
Default: disabled
|
LPGEMM provides a range of BF16 and INT8 GEMM operations with many supported pre-/post-ops, targeted at AI applications. See LPGEMM in AOCL-BLAS for details.
|
Change dynamic dispatch environment variables |
-DRENAME_BLIS_ARCH_TYPE=<user-defined-name>
(Default name
BLIS_ARCH_TYPE )-DRENAME_BLIS_MODEL_TYPE=<user-defined-name>
(Default name
BLIS_MODEL_TYPE )————————————
--rename-blis-arch-type=
(Default name
BLIS_ARCH_TYPE )--rename-blis-model-type=
(Default name
BLIS_MODEL_TYPE ) |
If dynamic dispatch is enabled in the configuration (e.g. amdzen), the default runtime choice of code path based on the hardware can be overridden by environment variables.
These options allow the environment variables
BLIS_ARCH_TYPE and BLIS_MODEL_TYPE to be renamed. See Dynamic Dispatch for more details. |
Disable dynamic dispatch environment variables |
-DISABLE_BLIS_ARCH_TYPE=ON
-DISABLE_BLIS_ARCH_TYPE=OFF
————————————
--enable-blis-arch-type
--disable-blis-arch-type
|
If dynamic dispatch is enabled in the configuration (e.g. amdzen), alternatively use of these environment variables
(
BLIS_ARCH_TYPE , BLIS_MODEL_TYPE and AOCL_ENABLE_INSTRUCTIONS ) can be disabled. See Dynamic Dispatch for more details. |
4.1.7. Build AOCL-BLAS from Source on Linux#
Below are some examples for configuration and build commands on Linux, for both CMake and configure build systems.
4.1.7.1. Single-Thread AOCL-BLAS#
Complete the following steps to install a single-thread AOCL-BLAS:
Clone the AOCL-BLAS Git repository (amd/blis.git).
Configure the library as required:
# CMake commands # GCC (Default) $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \ -DBLIS_CONFIG_FAMILY=auto # AOCC $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \ -DCOMPLEX_RETURN=intel CC=clang CXX=clang++ -DBLIS_CONFIG_FAMILY=auto # Alternatively, using configure # GCC (Default) $ ./configure --enable-cblas --prefix=<your-install-dir> auto # AOCC $ ./configure --enable-cblas --prefix=<your-install-dir> \ --complex-return=intel CC=clang CXX=clang++ auto
To build the library, use the command:
$ make
To install the library on build machine, use the command:
$ make install
4.1.7.2. Multi-Thread AOCL-BLAS#
Complete the following steps to install a multi-thread AOCL-BLAS:
Clone the AOCL-BLAS Git repository (amd/blis.git).
Configure the library as required:
# CMake commands # GCC (Default) $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \ -DENABLE_THREADING=[Mode] -DBLIS_CONFIG_FAMILY=auto # AOCC $ cmake . -DENABLE_CBLAS=ON -DCMAKE_INSTALL_PREFIX=<your-install-dir> \ -DCOMPLEX_RETURN=intel CC=clang CXX=clang++ \ -DENABLE_THREADING=[Mode] -DBLIS_CONFIG_FAMILY=auto # Alternatively, using configure # GCC (Default) $ ./configure --enable-cblas --prefix=<your-install-dir> \ --enable-threading=[Mode] auto # AOCC $ ./configure --enable-cblas --prefix=<your-install-dir> \ --enable-threading=[Mode] \ --complex-return=intel CC=clang CXX=clang++ auto
Mode
indicates one of the options in {openmp
,no
}.no
option implies disable multi-threading.To build the library, use the command:
$ make
To install the library on build machine, use the command:
$ make install
4.1.8. Build AOCL-BLAS from Source on Windows#
GitHub URL: amd/blis
AOCL-BLAS uses CMake along with Microsoft Visual Studio for building binaries from the sources on Windows. The following sections explain the GUI and command-line schemes of building the binaries and test suite.
Prerequisites
Windows 10/11 or Windows Server 2019/2022
LLVM 15/16 for AMD “Zen3” and AMD “Zen4” support (or LLVM 11 for AMD “Zen2” support)
LLVM plug-in for Microsoft Visual Studio (if latest version of LLVM is installed separately, this plugin enables linking Visual Studio with the installed LLVM toolchain)
For more information on CMake versions validated, refer to Build Utilities
Microsoft Visual Studio 2019 (build 16.8.7) and 2022 (build 17.3.2 through 17.7.5)
Microsoft Visual Studio tools (as shown in Microsoft Visual Studio Prerequisites):
Python development
Desktop development with C++: C++ Clang-Cl for v142 build tool (x64/x86)
Figure 4.1 Microsoft Visual Studio Prerequisites#
4.1.8.1. Building AOCL-BLAS Using GUI#
4.1.8.2. Preparing Project with CMake GUI#
Complete the following steps in the CMake GUI:
Set the source (folder containing AOCL-BLAS source code) and build (folder in which the project files will be generated, for example, out) folder paths as shown in the following figure:
Figure 4.2 CMake Source and Build Folders#
It is not recommended to use the folder named build since build is used by Linux build system.
Click on the Configure button to prepare the project options.
Set the generator to Visual Studio 17 2022 and the compiler to ClangCl as shown in the following figure:
Figure 4.3 Set Generator and Compiler#
Update the options based on the project requirements. All the available options are listed in the following table:
Table 4.5 CMake Config Options# Feature
CMake Parameter
AMD CPU architecture
BLIS_CONFIG_FAMILY=zen / zen2 / zen3 / zen4 / zen5 / amdzen
Shared library
BUILD_SHARED_LIBS=ON
Static library
BUILD_SHARED_LIBS=OFF
Debug/Release build type
CMAKE_BUILD_TYPE=Debug / Release
Enable single threading (disables AOCL dynamic dispatch)
ENABLE_THREADING=no(default)
Enable multi-threading (enables AOCL dynamic dispatch with OpenMP)
ENABLE_THREADING=openmp
AOCL Dynamic (automatically selected depending on the value of ENABLE_THREADING)
ENABLE_AOCL_DYNAMIC=ON/OFF
Enable BLAS/CBLAS support
ENABLE_BLAS=ON ENABLE_CBLAS=ON
Enable 32-bit integer size in BLIS and BLAS APIs
INT_SIZE=32 and BLAS_INT_SIZE=32
Enable 64-bit integer size in BLIS and BLAS APIs
INT_SIZE=64 and BLAS_INT_SIZE=64
Absolute path to the OpenMP library, including the library name
OpenMP_libomp_LIBRARY
Table 4.6 CMake Config Options (all variables and their default values)# CMake Parameter
BUILD_SHARED_LIBS=ON(default)/OFF
ENABLE_THREADING=no(default)/openmp
INT_SIZE=auto(default)/32/64
BLAS_INT_SIZE=32(default)/64
ENABLE_BLAS=ON/OFF(default)
ENABLE_CBLAS=ON/OFF(default)
ENABLE_MIXED_DT=ON(default)/OFF
ENABLE_SUP_HANDLING=ON(default)/OFF
ENABLE_AOCL_DYNAMIC=ON(default)/OFF
COMPLEX_RETURN=gnu(default)/intel
ENABLE_NO_UNDERSCORE_API=ON/OFF(default)
ENABLE_UPPERCASE_API=ON/OFF(default)
ENABLE_SYSTEM=ON(default)/OFF
THREAD_PART_JRIR=slab(default)/rr
ENABLE_PBA_POOLS=ON(default)/OFF
ENABLE_SBA_POOLS= ON(default)/OFF
ENABLE_MEM_TRACING=ON/OFF(default)
ENABLE_MIXED_DT=ON(default)/OFF
ENABLE_MIXED_DT_EXTRA_MEM=ON(default)/OFF
ENABLE_SUP_HANDLING=ON(default)/OFF
ENABLE_TRSM_PREINVERSION=ON(default)/OFF
FORCE_VERSION=no(default)/<user-defined>
DISABLE_BLIS_ARCH_TYPE=ON/OFF(default)
RENAME_BLIS_ARCH_TYPE=BLIS_ARCH_TYPE(default)/<user-defined>
RENAME_BLIS_MODEL_TYPE=BLIS_MODEL_TYPE(default)/<user-defined>
For the detailed documentation of all the options, configure CMake with PRINT_CONFIGURE_HELP=ON.
To generate the Microsoft Visual Studio project in the out folder, click on the Generate button as shown in the following figure:
Figure 4.4 CMake Configure and Generate Project Settings#
4.1.8.3. Building the Project in Visual Studio GUI#
Complete the following steps in the Microsoft Visual Studio GUI:
Open the project generated by CMake (build folder) in Preparing Project with CMake GUI.
To generate AOCL-BLAS binaries, build the AOCL-LibBlis project or libs/libblis target. The library files will be generated in the out folder based on the project settings.
For example, blis/out/Release/AOCL-LibBlis-Win-MT.dll or AOCL-LibBlis-Win-MT.lib
To install the binaries (or to build and install them), build the INSTALL project under CMakePredefinedTargets.
4.1.8.4. Building AOCL-BLAS Using Command-Line Arguments#
The project configuration and build procedures can be triggered from the command prompt as well. The corresponding steps are described in the following sections.
Configuring the Project in Command Prompt
In the AOCL-BLAS project folder, create a folder out. Open the command prompt in this directory and run the following command to configure the project:
$ cmake -S .. -B . -G "Visual Studio 17 2022"
-DCMAKE_BUILD_TYPE=Release
-DBLIS_CONFIG_FAMILY=amdzen -DBUILD_SHARED_LIBS=ON
-DENABLE_THREADING=openmp
-DCOMPLEX_RETURN=intel -DOpenMP_libomp_LIBRARY="C:\Program
Files\LLVM\lib\libomp.lib" -TClangCL
You can refer CMake Config Options and update the parameter options in the command according to the project requirements or run the following command for a detailed description of the available options:
$ cmake -S .. -B . -G "Visual Studio 17 2022" -DPRINT_CONFIGURE_HELP=ON
Building the Project in Command Prompt
Open command prompt in the blis\out directory. Invoke CMake with the build command with release or debug option. For example:
$ cmake --build . --config Release
For building the library using multiple threads, run the following command:
$ cmake --build . --config Release -j
The library files would be generated in the Release or Debug folder based on the project settings.
4.1.9. Verifying AOCL-BLAS Installation#
The AOCL-BLAS source directory contains the test cases which demonstrate the usage of AOCL-BLAS APIs. To execute the tests, navigate to the AOCL-BLAS source directory and run one of the following commands, as appropriate for your operating system and choice of AOCL-BLAS build method (configure+make is Linux-only):
# Build tests using configure+make (Linux only)
$ make checkblas checkblis
# Build tests using CMake (Linux or Windows shared library)
$ cmake -build . -config Release --target checkblas checkblis
# Build tests using CMake (Windows static library)
$ cmake -build . -config Release --target checkblis
4.2. Application Development Using AOCL-BLAS#
This section explains the different types of APIs provided by AOCL-BLAS. It describes how to call them and link with the library.
4.2.1. Linking Application with AOCL-BLAS#
The AOCL-BLAS library can be linked statically or dynamically with the user application. It has a separate binary for single-threaded and multi-threaded implementation.
The basic build command is as following:
$ gcc test_blis.c -I<path-to-AOCL-BLAS-header> <link-options> \
-o test_blis.x
The following table explains different options depending on a particular build configuration:
Application Type |
Linking Type |
Link Options |
---|---|---|
Single-threaded |
Static |
|
Single-threaded |
Dynamic |
|
Multi-threaded |
Static |
|
Multi-threaded |
Dynamic |
|
Example - Dynamic Linking and Execution
AOCL-BLAS can be built as a shared library. By default, the library is built as both static and shared libraries. Complete the following steps to build a shared lib version of AOCL-BLAS and link it with the user application:
During configuration, enable the support for the shared lib using the following command:
$ ./configure --disable-static --enable-shared zen
Link the application with the generated shared library using the following command:
$ gcc CBLAS_DGEMM_usage.c -I /path/to/include/aocl-blas/ \ -L/path/to/libblis.so -lblis -lm -lpthread -o CBLAS_DGEMM_usage.x
Ensure that the shared library is available in the library load path. Run the application using the following command (for this demo we will use the
CBLAS_DGEMM_usage.c
):$ export LD_LIBRARY_PATH="/path/to/libblis.so" $ ./CBLAS_DGEMM_usage.x a = 1.000000 2.000000 3.000000 4.000000 b = 5.000000 6.000000 7.000000 8.000000 c = 19.000000 22.000000 43.000000 50.000000
The same header can be used for both static and shared libraries on
Windows. To access DLL’s public data symbols and objects, you can
define BLIS_EXPORT=declspec(dllimport)
to import those symbols
explicitly. Importing is not required for:
The AOCL-BLAS and CBLAS interface users
Most of the cases where BLIS interface is used
4.2.2. AOCL-BLAS Usage in Fortran and C#
Simple BLAS and CBLAS examples in Fortran and C are in the following subsections.
4.2.2.1. Using BLAS API in Fortran#
For example, the following Fortran code does a double precision general matrix-matrix multiplication. It calls the ‘DGEMM’ BLAS API function to accomplish this. A sample command to compile and link it with the AOCL-BLAS library is shown in the following code:
! File: BLAS_DGEMM_usage.f
! Example code to demonstrate BLAS DGEMM usage
program dgemm_usage
implicit none
EXTERNAL DGEMM
DOUBLE PRECISION, ALLOCATABLE :: a(:,:)
DOUBLE PRECISION, ALLOCATABLE :: b(:,:)
DOUBLE PRECISION, ALLOCATABLE :: c(:,:)
INTEGER I, J, M, N, K, lda, ldb, ldc
DOUBLE PRECISION alpha, beta
M=2
N=M
K=M
lda=M
ldb=K
ldc=M
alpha=1.0
beta=0.0
ALLOCATE(a(lda,K), b(ldb,N), c(ldc,N))
a=RESHAPE((/ 1.0, 3.0, &
2.0, 4.0 /), (/lda,K/))
b=RESHAPE((/ 5.0, 7.0, &
6.0, 8.0 /), (/ldb,N/))
WRITE(*,*) ("a =")
DO I = LBOUND(a,1), UBOUND(a,1)
WRITE(*,*) (a(I,J), J=LBOUND(a,2), UBOUND(a,2))
END DO
WRITE(*,*) ("b =")
DO I = LBOUND(b,1), UBOUND(b,1)
WRITE(*,*) (b(I,J), J=LBOUND(b,2), UBOUND(b,2))
END DO
CALL DGEMM('N','N',M,N,K,alpha,a,lda,b,ldb,beta,c,ldc)
WRITE(*,*) ("c =")
DO I = LBOUND(c,1), UBOUND(c,1)
WRITE(*,*) (c(I,J), J=LBOUND(c,2), UBOUND(c,2))
END DO
end program dgemm_usage
A sample compilation command with gfortran compiler for the code above:
$ gfortran -ffree-form BLAS_DGEMM_usage.f /path/to/libblis.a
4.2.2.2. Using BLAS API in C#
Following is the C version of the Fortran code in section Using BLAS API in Fortran. It uses the standard BLAS API. The following process takes place during the execution of the code:
The matrices are transposed to account for the row-major storage of C and the column-major convention of BLAS (inherited from Fortran).
The function arguments are passed by address again to be in line with Fortran conventions.
There is a trailing underscore in the function name (
dgemm_
) as BLAS APIs require Fortran compilers to add a trailing underscore.blis.h
is included as a header. A sample command to compile it and link with the AOCL-BLAS library is also shown in the following code:// File: BLAS_DGEMM_usage.c // Example code to demonstrate BLAS DGEMM usage #include <stdio.h> #include "blis.h" #define DIM 2 int main() { double a[DIM * DIM] = { 1.0, 2.0, 3.0, 4.0 }; double b[DIM * DIM] = { 5.0, 7.0, 6.0, 8.0 }; double c[DIM * DIM]; int I, J, M, N, K, lda, ldb, ldc; double alpha, beta; M = DIM; N = M; K = M; lda = M; ldb = K; ldc = M; alpha = 1.0; beta = 0.0; printf("a = \\n"); for ( I = 0; I < M; I ++ ) { for ( J = 0; J < K; J ++ ) { printf("%f\t", a[J * K + I]); } printf("\n"); } printf("b = \\n"); for ( I = 0; I < K; I ++ ) { for ( J = 0; J < N; J ++ ) { printf("%f\t", b[J * N + I]); } printf("\n"); } dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc); printf("c = \\n"); for ( I = 0; I < M; I ++ ) { for ( J = 0; J < N; J ++ ) { printf("%f\t", c[J * N + I]); } printf("\n"); } return 0; }
A sample compilation command with a gcc compiler for the code above:
$ gcc BLAS_DGEMM_usage.c -I/path/to/include/aocl-blas/ \
/path/to/libblis.a -lpthread -lm
4.2.2.3. Using CBLAS API in C#
This section contains an example application written in C code using the CBLAS API for DGEMM. The following process takes place during the execution of the code:
The CBLAS Layout option is used to choose row-major layout which is consistent with C.
The function arguments are passed by value.
cblas.h
is included as a header. A sample command to compile it and link with the AOCL-BLAS library is also shown in the following code:// File: CBLAS_DGEMM_usage.c // Example code to demonstrate CBLAS DGEMM usage #include <stdio.h> #include "cblas.h" #define DIM 2 int main() { double a[DIM * DIM] = { 1.0, 2.0, 3.0, 4.0 }; double b[DIM * DIM] = { 5.0, 6.0, 7.0, 8.0 }; double c[DIM * DIM]; int I, J, M, N, K, lda, ldb, ldc; double alpha, beta; M = DIM; N = M; K = M; lda = M; ldb = K; ldc = M; alpha = 1.0; beta = 0.0; printf("a = \\n"); for ( I = 0; I < M; I ++ ) { for ( J = 0; J < K; J ++ ) { printf("%f\t", a[I * K + J]); } printf("\n"); } printf("b = \\n"); for ( I = 0; I < K; I ++ ) { for ( J = 0; J < N; J ++ ) { printf("%f\t", b[I * N + J]); } printf("\n"); } cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, alpha, a, lda, b, ldb, beta, c, ldc); printf("c = \\n"); for ( I = 0; I < M; I ++ ) { for ( J = 0; J < N; J ++ ) { printf("%f\t", c[I * N + J]); } printf("\n"); } return 0; }
Note
To get the CBLAS API with AOCL-BLAS, you must provide the
flag --enable-cblas
to the configure
command while building the
AOCL-BLAS library.
A sample compilation command with a gcc compiler for the code above is as follows:
$ gcc CBLAS_DGEMM_usage.c -I/path/to/include/aocl-blas/ \
/path/to/libblis.a -lpthread -lm
4.2.2.4. Returning Complex Numbers#
The GNU Fortran compiler (gfortran), AOCC (Flang), and Intel Fortran compiler (ifort) have different requirements for returning complex numbers from the C functions as follows:
AOCC (Flang) and Intel® (ifort) compilers returns complex numbers using hidden first argument. The caller must pass the pointer to the return value as the first parameter.
GNU (gfortran) compiler returns complex numbers using registers. Thus, the complex numbers are returned as the return value of the function itself.
gfortran Example:
Configure Option:
--complex-return=gnu
API Call:
ret_value = cdotc_(&n, x, &incx, y, &incy);
AOCC Example:
Configure Option:
--complex-return=intel CC=clang CXX=clang++
API Call:
cdotc_(&ret_value, &n, x, &incx, y, &incy);
This feature is currently enabled only for cdotc
, cdotu
, zdotc
, and
zdotu
APIs.
4.3. Migrating/Porting#
The application written for MKL, OpenBLAS or any other library using standard BLAS or CBLAS interfaces can be ported to AOCL-BLAS with minimal or no changes.
Complete the following steps to port from BLAS or CBLAS to AOCL-BLAS:
Update the source code to include the correct header files.
Update the build script or makefile to use correct compile or link option.
The following table lists the compiler and linker options to use while porting to AOCL-BLAS:
MKL |
OpenBLAS |
AOCL-BLAS |
AOCL-BLAS |
|
---|---|---|---|---|
Single-threaded |
Multi-threaded |
|||
Header File |
|
|
|
|
Link Options |
|
|
|
|
4.4. Using AOCL-BLAS Library Features#
4.4.1. Dynamic Dispatch#
Starting from AOCL 3.1, AOCL-BLAS supports Dynamic Dispatch feature. It enables you to use the same binary with different code paths optimized for different architectures.
4.4.1.1. Purpose#
Before Dynamic Dispatch, the user had to build different binaries for each CPU architecture, that is, AMD “Zen”, AMD “Zen2”, and AMD “Zen3” architectures. Furthermore, when building the application, users had to ensure that they used the correct AMD “Zen”-based library as needed for the platform. This becomes challenging when using AOCL-BLAS on a cluster having nodes of different architectures.
Dynamic Dispatch addresses this issue by building a single binary compatible with all the AMD “Zen” architectures. At the runtime, the Dynamic Dispatch feature enables optimizations specific to the detected AMD “Zen” architecture.
4.4.1.2. On Non-AMD “Zen” Architectures#
The Dynamic Dispatch feature supports AMD “Zen”, AMD “Zen2”, AMD “Zen3”, AMD “Zen4”, and AMD “Zen5” architectures in a single binary. However, it also includes a generic architecture to support older x86-64 processors. The generic architecture uses a pure C implementation of the APIs and does not use any architecture-specific features.
The specific compiler flags used for building the library with generic configuration are:
-O2 -funsafe-math-optimizations -ffp-contract=fast -Wall \
-Wno-unused-function -Wfatal-errors
Note
As no architecture specific optimization and vectorized kernels are enabled, performance with the generic architecture may be significantly lower than the architecture-specific implementation.
Previous AOCL-BLAS releases identified the processor based on Family,
Model, and other cpuid features, and selected the appropriate code
path based on the preprogrammed choices. With Dynamic Dispatch, an
unknown processor would fall through to the slow generic code path,
although users could override this by setting an environment variable
BLIS_ARCH_TYPE
to a suitable value.
From AOCL-BLAS 4.2, additional cpuid tests based on AVX2 and AVX512 instruction support are used to enable AMD “Zen3”, AMD “Zen4” or AMD “Zen5” code paths to be selected by default on suitable x86-64 processors (i.e. future AMD processors and current or future Intel processors). These AMD Zen code paths are not (re-)optimized specifically for these different architectures but should perform better than the slow generic code path.
To be more specific:
AVX2 support requires AVX2 and FMA3.
AVX512 support requires AVX512 F, DQ, CD, BW, and VL.
4.4.1.3. Using Dynamic Dispatch#
Building AOCL-BLAS
Dynamic Dispatch must be enabled while building the AOCL-BLAS library. This is done by building the library for amdzen configuration as explained in Build AOCL-BLAS from Source on Linux.
Code Path Information
Dynamic Dispatch can print debugging information on the selected code
path. This is enabled by setting the environment variable
BLIS_ARCH_DEBUG=1
.
Architecture Selection at Runtime
For most use cases, Dynamic Dispatch will detect the underlying
architecture and enable appropriate code paths and optimizations.
However, AOCL-BLAS can be forced to use a specific architecture by
setting either the environment variable AOCL_ENABLE_INSTRUCTIONS
or
BLIS_ARCH_TYPE
as follows:
AOCL_ENABLE_INSTRUCTIONS=value <AOCL-BLAS linked application>
or
BLIS_ARCH_TYPE=value <AOCL-BLAS linked application>
Where, value = {avx512
, avx2
, zen5
, zen4
, zen3
, zen2
, zen
, generic
}
You must note the following:
The code path names are not case sensitive but the environment variable names are.
Options for older x86-64 vector ISAs (e.g.
avx
,sse2
) are also supported but in general will correspond to thegeneric
code path.In AOCL-BLAS builds with configuration amdzen,
avx512
is an alias forzen4
andavx2
is an alias forzen3
.AOCL_ENABLE_INSTRUCTIONS
is intended to become the standard option for controlling dynamic dispatch (where supported) across all the AOCL components.BLIS_ARCH_TYPE
is specific to the BLIS code used in AOCL-BLAS.If both are specified,
BLIS_ARCH_TYPE
takes precedence andAOCL_ENABLE_INSTRUCTIONS
is ignored by AOCL-BLAS. This provides and options of usingAOCL_ENABLE_INSTRUCTIONS
to control other AOCL libraries but specify different options for AOCL-BLAS usingBLIS_ARCH_TYPE
.The operation of
AOCL_ENABLE_INSTRUCTIONS
andBLIS_ARCH_TYPE
are slightly different:If
AOCL_ENABLE_INSTRUCTIONS
is in operation, AOCL-BLAS will check if the instruction set required by the code path selected is enabled within the library and supported by the processor. If not, it will use the default choice for that architecture. In other words,AOCL_ENABLE_INSTRUCTIONS
should be used to restrict a processor to an earlier instruction set, rather than try to force a later one on an older processor.By contrast, if
BLIS_ARCH_TYPE
is in operation, that code path will be used irrespective of the compatibility with the processor.
Specifying a particular code path will completely override the automatic selection and thus, the following scenarios are possible:
A code path unavailable in the AOCL-BLAS build is being used. This will result in an error message from the AOCL-BLAS library which will then abort. This only applies when using
BLIS_ARCH_TYPE
(at AOCL 4.2 it also applied toAOCL_ENABLE_INSTRUCTIONS
).A code path executes instructions unavailable on the processor being used, for example, trying to run the AMD “Zen4” code path (which may use AVX512 instructions) on a AMD “Zen3” or older system. If this happens, the program may stop with an “illegal instruction” error. This applies only when
BLIS_ARCH_TYPE
is used; executing the illegal instruction may be routine and problem size dependent.
In some circumstances, AOCL-BLAS aborting on an error from
BLIS_ARCH_TYPE
being set incorrectly may not be acceptable. If you
are building AOCL-BLAS from source, there are two options to mitigate
this issue. One is to change the environment variable used from
BLIS_ARCH_TYPE
to another name, for example:
$ ./configure --enable-cblas --prefix=<your-install-dir> \
--rename-blis-arch-type=MY_BLIS_ARCH_TYPE amdzen
... make aocl-blas library
... compile program linking with aocl-blas
$ export BLIS_ARCH_TYPE=zen3
$ export MY_BLIS_ARCH_TYPE=zen2
$ ./program.exe
This will cause program.exe (which uses AOCL-BLAS) to ignore the
setting of BLIS_ARCH_TYPE
to zen3
. Instead, it will take the value of
MY_BLIS_ARCH_TYPE
and use the zen2
code path. When --rename-blis-arch-type
is used, AOCL_ENABLE_INSTRUCTIONS
remains enabled in the build, but MY_BLIS_ARCH_TYPE
(in this example) would take precedence if both are set.
Alternatively, the mechanism to allow manual selection of code path can be disabled:
$ ./configure --enable-cblas --prefix=<your-install-dir> \
--disable-blis-arch-type amdzen
In this case, Dynamic Dispatch will still occur among the included
code paths, but only by automatic selection based on the processor
architecture. Manual selection of code path by both AOCL_ENABLE_INSTRUCTIONS
and BLIS_ARCH_TYPE
is disabled.
Model Selection at Runtime
Recent AMD “Zen” generations have added more diverse choices of core designs and cache characteristics. For example, Milan and Milan-X variants at AMD “Zen3”; Genoa, Bergamo, and Genoa-X variants at AMD “Zen4”. Some AOCL-BLAS APIs may be tuned differently for these different models. The appropriate model will be selected automatically by Dynamic Dispatch.
However, AOCL can be forced to use a specific model by setting the
environment variable BLIS_MODEL_TYPE
as follows:
BLIS_MODEL_TYPE=value <AOCL-BLAS linked application>
where value = {Milan, Milan-X, Genoa, Bergamo, Genoa-X, Turin, Turin-Dense} Note the following:
Different model values correspond to specific
BLIS_ARCH_TYPE
values (either set automatically or explicitly by the user). Thus, Milan and Milan-X correspond to AMD “Zen3”; Genoa, Bergamo, and Genoa-X correspond to AMD “Zen4”, and Turin and Turin-Dense correspond to AMD “Zen5”.Incorrect values of
BLIS_MODEL_TYPE
do not cause an error, the default model type for the selected architecture will be used.The number of APIs that have different optimizations by model type is currently very small. Setting this environment variable may provide consistent results across different models if consistency is a higher priority than best performance.
As with BLIS_ARCH_TYPE
, when building BLAS from source, the name of
the environment variable used to set the model type can be changed,
for example:
$ ./configure --enable-cblas --prefix=<your-install-dir> \
--rename-blis-model-type=MY_BLIS_MODEL_TYPE amdzen
Disabling the mechanism to allow the manual section of BLAS architecture will also disable the mechanism to allow the manual section of the model.
$ ./configure --enable-cblas --prefix=<your-install-dir> \
--disable-blis-arch-type amdzen
Setting either of these environment variables makes sense only when using a build of AOCL-BLAS that includes multiple code paths.
Thus, AOCL_ENABLE_INSTRUCTIONS
and BLIS_ARCH_TYPE
are disabled by
default in all the builds containing only a single code path.
4.4.2. AOCL-BLAS - Running the Test Suite#
The AOCL-BLAS source directory contains a test suite to verify the functionality of AOCL-BLAS and BLAS APIs. The test suite invokes the APIs with different inputs and verifies that the results are within the expected tolerance limits.
For more information, refer amd/blis.
4.4.2.1. Multi-Thread Test Suite Performance#
Starting from AOCL-BLAS 3.1, if the number of threads is not specified, AOCL-BLAS uses the maximum number of threads equal to the number of cores available on the system. A higher number of threads result in better performance for medium to large size matrices found in practical use cases.
However, the higher number of threads results in poor performance for very small sizes used by the test and check features. Hence, you must specify the number of threads while running the test/test suite.
The recommended number of threads to run the test suite is 1 or 2.
Running Test Suite
Execute the following command to invoke the test suite:
$ OMP_NUM_THREADS=2 make test
The sample output from the execution of the command is as follows:
$ OMP_NUM_THREADS=2 make test
Compiling obj/zen3/testsuite/test_addm.o
Compiling obj/zen3/testsuite/test_addv.o
<<< More compilation output >>>
Compiling obj/zen3/testsuite/test_xpbym.o
Compiling obj/zen3/testsuite/test_xpbyv.o
Linking test_libblis-mt.x against 'lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running test_libblis-mt.x with output
redirected to 'output.testsuite'
check-blistest.sh: All BLIS tests passed! Compiling
obj/zen3/blastest/cblat1.o Compiling obj/zen3/blastest/abs.o
<<< More compilation output >>>
Compiling obj/zen3/blastest/wsfe.o
Compiling obj/zen3/blastest/wsle.o
Archiving obj/zen3/blastest/libf2c.a
Linking cblat1.x against 'libf2c.a lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running cblat1.x > 'out.cblat1'
<<< More compilation output >>>
Linking zblat3.x against 'libf2c.a lib/zen3/libblis-mt.a -lm
-lpthread -fopenmp -lrt' Running zblat3.x <
'./blastest/input/zblat3.in' (output to 'out.zblat3')
check-blastest.sh: All BLAS tests passed!
4.4.3. Testing/Benchmarking#
The AOCL-BLAS source has an API specific test driver and this section explains how to use it for a specific set of matrix sizes.
The source file for GEMM benchmark is test/test_gemm.c
and the
executable is test/ test_gemm_blis.x
.
Complete the following steps to execute the GEMM tests on specific input parameters:
Enabling File Inputs
By default, file input/output is disabled (instead it uses start, end, and step sizes). To enable the file inputs, complete the following steps:
Open the file
test/test_gemm.c
.Uncomment the macro at the start of the file:
#define FILE_IN_OUT
Building Test Driver
Execute the following commands to build the test driver:
$ cd tests
$ make -j blis
Creating an Input File
The input file accepts matrix sizes and strides in the following format. Each dimension is separated by a space and each entry is separated by a new line.
For example, m k nlda ldb ldc. Where:
Matrix A is of size m x k
Matrix B is of size k x n
Matrix C is of size m x n
lda is leading dimension of matrix A
ldb is leading dimension of matrix B
ldc is leading dimension of matrix C
This test application (test_gemm.c) assumes column-major storage of matrices.
The valid values of lda, ldb, and ldc for a GEMM operation C = beta*C + alpha* A * B, are as follows:
lda >= m
ldb >= k
ldc >= m
Running the Tests
Execute the following commands to run the tests:
$ cd tests
$ ./test_gemm_blis.x <input file name> <output file name>
An execution sample (with the test driver) for GEMM is as follows:
$ cat inputs.txt
200 100 100 200 200 200
10 4 1 100 100 100
4000 4000 400 4000 4000 4000
$ ./test_gemm_blis.x inputs.txt outputs.txt
_BLAS m k n cs_a cs_b cs_c gflops
data_gemm_blis 200 100 100 200 200 200 27.211
data_gemm_blis 10 4 1 100 100 100 0.027
data_gemm_blis 4000 4000 400 4000 4000 4000 45.279
$ cat outputs.txt
m k n cs_a cs_b cs_c gflops
200 100 100 200 200 200 27.211
10 4 1 100 100 100 0.027
4000 4000 400 4000 4000 4000 45.279
4.4.4. AOCL-BLAS Utility APIs#
This section explains some of the AOCL-BLAS APIs used to get the AOCL-BLAS library configuration information and for configuring optimization tuning parameters.
API |
Usage |
---|---|
|
Returns the version string in the form of |
|
Returns the value of |
|
Returns true if OpenMP/pthreads are enabled and false otherwise. |
|
Returns the default number of threads used for the subsequent BLAS calls. |
|
Sets the number of threads for the subsequent BLAS calls. |
|
Sets the number of threads for different levels of parallelization as per GotoBLAS five loops architecture. |
For more details on the threading-related APIs, see amd/blis
4.5. Debugging and Troubleshooting#
4.5.1. Error Handling in AOCL-BLAS#
The original Netlib BLAS defined an error handling function XERBLA
,
which is called within BLAS2 and BLAS3 routines if an incorrect input
argument is detected. Only incorrect matrix, vector sizes, and
options for specifying transpose matrix, upper or lower in a
symmetric matrix, and so on can be detected. BLAS does not detect
extreme values (such as Inf or NaNs) within the supplied matrices and
vectors, it is the user’s responsibility to check for these if
required.
The functionality of Netlib’s XERBLA
is to print a message to
standard output and stop execution of the process. Stopping is
extremely unhelpful in many applications and usage scenarios. Thus,
AOCL-BLAS, in common with other similar libraries, has traditionally
disabled the stop statement. In AOCL 4.2, the functionality of
AOCL-BLAS has been enhanced to give users more choice over both
stopping the application on error and printing a message on error.
In AOCL 5.0 this functionality was added to the similar cblas_xerbla
error handling function.
The choices are specified by setting each of the environment
variables BLIS_STOP_ON_ERROR
and BLIS_PRINT_ON_ERROR
to 0 or 1 to
respectively disable or enable the functionality. The default values
for each are:
Environment Variable |
Default Value |
---|---|
|
0 |
|
1 |
When the stop on error is disabled, no error code is passed back to
the user application through the BLAS interface arguments, unlike the
INFO
argument used in LAPACK routines. Therefore, AOCL-BLAS has also
added an extra function to return the value of INFO
from the previous
call to a BLAS routine made by the same thread. The function can be
called as follows:
**In C/C++:**
#include <blis.h>
...
gint_t info_value = bli_info_get_info_value();
**In Fortran:**
integer :: info_value
integer, external :: bli_info_get_info_value
...
info_value = bli_info_get_info_value()
If the returned value is not zero, the value indicates the argument
in the preceding BLAS call that was incorrect.
Note
Errors from an incorrect setting of the BLIS_ARCH_TYPE
environment variable (used to override the default choice in dynamic
dispatch, refer to Using Dynamic Dispatch for
details) are handled by a separate error mechanism and will not be
affected by the environment variables BLIS_STOP_ON_ERROR
and
BLIS_PRINT_ON_ERROR
.
4.5.2. Debugging Build Using GDB#
The AOCL-BLAS library can be debugged on Linux using GDB. To enable the debugging support, build the library with the –enable-debug flag. Use following commands to configure and build the debug version of AOCL-BLAS:
$ cd blis_src
$ ./configure --enable-cblas --enable-debug auto
$ make -j
Use the following commands to link the application with the binary and build application with debug support:
$ cd blis_src
$ gcc -g -O0 -I<path-to-AOCL-BLAS-header> test_gemm.c \
<path-to-AOCL-BLAS-library>/libblis.a -lpthread -lm \
-o test_gemm_blis.x
You can debug the application using gdb. A sample output of the gdb session is as follows:
$ gdb ./test_gemm_blis.x
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8
Reading symbols from ./test_gemm_blis.x...done.
(gdb) break bli_gemm_small
Breakpoint 1 at 0x677543: file kernels/zen/3/bli_gemm_small.c, line 110.
(gdb) run
Starting program: /home/dipal/work/blis_dtl/test/test_gemm_blis.x
Using host libthread_db library "/lib64/libthread_db.so.1".
BLIS Library version is : AOCL BLIS 3.1
Breakpoint 1, bli_gemm_small (alpha=0x7fffffffcf40, a=0x2471b30,
b=0x7fffffffd1c0, beta=0x2465400 <BLIS_ZERO>,
c=0x4fe66e <bli_obj_equals+300>, cntx=0x7fffffffb320, cntl=0x0) at
kernels/zen/3/ bli_gemm_small.c:110
110 {
(gdb) bt
#0 bli_gemm_small (alpha=0x7fffffffcf40, a=0x2471b30,
b=0x7fffffffd1c0, beta=0x2465400
<BLIS_ZERO>,
c=0x4fe66e <bli_obj_equals+300>, cntx=0x7fffffffb320, cntl=0x0) at
kernels/zen/3/ bli_gemm_small.c:110
#1 0x00000000007caab6 in bli_gemm_front (alpha=0x7fffffffd1c0,
a=0x7fffffffd120, b=0x7fffffffd080,
beta=0x7fffffffcfe0, c=0x7fffffffcf40, cntx=0x2471b30,
rntm=0x7fffffffce50, cntl=0x0) at frame/3/gemm/bli_gemm_front.c:83
#2 0x00000000005baf42 in bli_gemmnat (alpha=0x7fffffffd1c0,
a=0x7fffffffd120, b=0x7fffffffd080,
beta=0x7fffffffcfe0, c=0x7fffffffcf40, cntx=0x2471b30,
rntm=0x7fffffffce50) at frame/ind/oapi/bli_l3_nat_oapi.c:83
#3 0x00000000005474a2 in dgemm\_ (transa=0x7fffffffd363 "N\320\a",
transb=0x7fffffffd362 "NN\320\a",
m=0x7fffffffd36c, n=0x7fffffffd364, k=0x7fffffffd368,
alpha=0x24733c0, a=0x7ffff53e2040, lda=0x7fffffffd378,
b=0x7ffff355d040, ldb=0x7fffffffd374, beta=0x2473340,
c=0x7ffff16d8040, ldc=0x7fffffffd370) at frame/compat/bla_gemm.c:559
#4 0x0000000000413a1c in main (argc=1, argv=0x7fffffffd988) at
test_gemm.c:321 (gdb)
4.5.3. Viewing Logs#
The AOCL-BLAS library provides Debug and Trace features:
Trace Log identifies the code path taken in terms of the function call chain. It prints the information on the functions invoked and their order.
Debug Log prints the other debugging information, such as values of input parameters, content, and data structures.
The key features of this functionality are as follows:
Can be enabled/disabled at compile time.
When these features are disabled at compile time, they do not require any runtime resources and that does not affect the performance.
Compile time option is available to control the depth of trace/log levels.
All the traces are thread safe.
Performance data, such as execution time and gflops achieved, are also printed for xGEMM APIs.
4.5.3.1. Function Call Tracing#
The function call tracing is implemented using hard instrumentation of the AOCL-BLAS code. Here, the functions are grouped as per their position in the call stack. You can configure the level up to which the traces must be generated.
Complete the following steps to enable and view the traces:
Enable the trace support as follows:
Modify the source code to enable tracing.
Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
Change the following macro from 0 to 1:
#define AOCL_DTL_TRACE_ENABLE 0
Configure the trace depth level.
Modify the source code to specify the trace depth level.
Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
Change the following macro as required. Beginning with Level 5 should be a good compromise in terms of details and resource requirement. The higher the level, the deeper is the call stack. A lower level reduces the depth of the call stack used for a trace generation.
#define AOCL_DTL_TRACE_LEVEL AOCL_DTL_LEVEL_TRACE_5
Build the library as explained in Build AOCL-BLAS from Source on Linux.
Run the application to generate the trace data.
The trace output file for each thread is generated in the current folder.
The following figure shows a sample running the call tracing function using the test_gemm application:
Figure 4.5 Sample Run of Function Call Tracing#
The trace data for each thread is saved in the file with appropriate naming conventions. The .txt extension is used to signify the readable file:
P<process id>_T<thread id>_aocldtl_trace.txt
View the trace data.
The output of the call trace is in a readable format, you can open the file in any of the text editors. The first column shows the level in call stack for the given function.
4.5.3.2. Debug Logging#
The debug logging works very similar to the function call tracing and uses the same infrastructure. However, it can be enabled independent of the trace feature to avoid cluttering of the overall debugging information. This feature is primarily used to print the input values of the AOCL-BLAS APIs. Additionally, it can also be used to print any arbitrary debugging data (buffers, matrices, arrays, or text).
Complete the following steps to enable and view the debug logs:
Enable the debug log support as follows:
Modify the source code to enable debug logging.
Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
Change the following macro from 0 to 1:
#define AOCL_DTL_LOG_ENABLE 0
Configure the trace depth level.
Modify the source code to specify the debug log depth level.
Open file <aocl-blas folder>/aocl_dtl/aocldtlcf.h
Change the following macro as required. Beginning with Level 5 should be a good compromise in terms of details and resource requirement. The higher the level (maximum is 10), the deeper is the call stack. A lower level reduces the depth of the call stack used for a trace generation.
#define AOCL_DTL_TRACE_LEVEL AOCL_DTL_LEVEL_TRACE_5
Build the library as explained in Build AOCL-BLAS from Source on Linux.
Run the application to generate the trace data.
The trace output files for each thread is generated in the current folder.
The following figure shows a sample running of AOCL-BLAS with the debug logs enabled using the test_gemm application:
Figure 4.6 Sample Run with Debug Logs Enabled#
The debug logs for each thread are saved in the file with appropriate naming conventions. The .txt extension is used to signify the readable file:
P<process id>_T<thread id>_aocldtl_log.txt
View the debug logs.
The output of the debug logs is in a readable format, you can open the file in any text editor. The following figure shows the sample output for one of the threads of the test_gemm application:
$ cat P3386555_T0_aocldtl_log.txt dgemm_blis_impl D N N 4000 4000 4000 1.300000 0.000000 4000 4000 0.700000 0.000000 4000 nt=1 911.148 ms 70.482 GFLOPS dgemm_blis_impl D N N 4000 4000 4000 1.300000 0.000000 4000 4000 0.700000 0.000000 4000 nt=8 121.024 ms 557.641 GFLOPS
4.5.3.3. Usage and Limitations#
The debug and trace logs have the following usage and limitations:
When tracing is enabled, there could be a significant drop in the performance.
Only a function that has the trace feature in the code can be traced. To get the trace information for any other function, the source code must be updated to add the trace/log macros in them.
The call trace and debug logging is a resource-dependent process and can generate a large size of data. Based on the hardware configuration (the disk space, number of cores and threads) required for the execution, logging may result in a sluggish or non-responsive system.
4.5.4. Checking AOCL-BLAS Operation Progress#
The AOCL libraries may be used to perform lengthy computations (for example, matrix multiplications and solver involving large matrices). These operations/computations may go on for hours.
AOCL Progress feature provides mechanism for the application to check the computation progress. The AOCL libraries (AOCL-BLAS and AOCL-LAPACK) periodically updates the application with progress made through a callback function.
Usage
The application must define the callback function in a specific format and register it with the AOCL library.
Callback Definition
The callback function prototype must be as defined as given follows:
dim_t AOCL_progress(const char* const api, const dim_t lapi, const dim_t progress,
const dim_t current_thread, const dim_t total_threads)
However, you can modify the function name as per your preference.
The following table explains different parameters passed to the callback function:
Parameter |
Purpose |
---|---|
api |
Name of the API running currently |
lapi |
Length of the API name string (*api) |
progress |
Linear progress made in current thread presently |
current_thread |
Current thread ID |
total_threads |
Total number of threads used to performance the operation |
Callback Registration
The callback function must be registered with the library for reporting the progress. Each library has its own callback registration function. The registration can be done by calling:
AOCL_BLIS_set_progress(AOCL_progress); // for AOCL-BLAS
Example
The library only invokes the callback function at appropriate intervals, it is up to the user to consume this information appropriately. The following example shows how to use it for printing the progress to a standard output:
dim_t AOCL_progress(const char* const api, const dim_t lapi,
const dim_t progress,const dim_t current_thread,
const dim_t total_threads)
{
printf("\n%s, total thread = %lld, processed %lld element by thread %lld.",
api, total_threads, progress, current_thread);
return 0;
}
Register the callback with:
AOCL_BLIS_set_progress(AOCL_progress); // for AOCL-BLAS
The result is displayed in following format (output truncated):
$ BLIS_NUM_THREADS=5 ./test_gemm_blis.x
dgemm, total thread = 5, processed 11796480 element by thread 4.
dgemm, total thread = 5, processed 17694720 element by thread 0.
dgemm, total thread = 5, processed 5898240 element by thread 2.
dgemm, total thread = 5, processed 20643840 element by thread 0.
dgemm, total thread = 5, processed 14745600 element by thread 3.
dgemm, total thread = 5, processed 14745600 element by thread 4.
Limitations
The feature only shows if the operation is progressing or not, it doesn’t provide an estimate/ percentage compilation status.
A separate callback must be registered for AOCL-BLAS, AOCL-LAPACK, and AOCL-ScaLAPACK.
4.6. LPGEMM in AOCL-BLAS#
4.6.1. Add-on in AOCL-BLAS#
An add-on in AOCL-BLAS provides additional APIs, operations, and/or implementations that may be useful to certain users. It can be a standalone extension of AOCL-BLAS that does not depend on any other add-on, although add-ons may utilize existing functionality or kernels within the core framework.
An add-on should never provide APIs that conflict with the interfaces belonging to the BLIS typed or object API. Thus, a properly constructed/functioning add-on would never interfere with or change the core BLIS functionality or the standard BLAS and CBLAS APIs.
Low Precision GEMM (LPGEMM) APIs are added as an add-on feature with the name aocl_gemm in AOCL-BLAS 4.1 which are used in Inference of Deep Neural Networks (DNN) applications. For example, Low Precision DNN uses the input as image pixels that are unsigned 8-bit (u8) and quantized pre-trained weights of signed 8-bits (s8) width. They produce signed 32-bit or downscaled/ quantized 8-bit output.
At the same time, these APIs are expected to utilize the architecture features such as AVX512VNNI instructions designed to take the inputs in u8, s8; produce an output in s32 and produce high throughput. Similarly, AVX512BF16 based instructions expects input in Brain Floating Point (bfloat16) type to provide higher throughput with less precision than 32-bit.
4.6.2. API Naming and Arguments#
LPGEMM APIs starts with the prefix aocl_gemm_
and follows the data
type of input matrix A, B, accumulation type, and output matrix C.
For example, aocl_gemm_u8s8s32os32( )
API expects input matrix ‘A’ is
unsigned 8-bit (u8) and matrix ‘B’ signed 8-bit (s8), accumulation
matrix ‘C’ is signed 32-bit (s32) and output matrix type is signed
32-bit (o s32).
4.6.3. Post-Operations#
The low precision GEMM operations are highly useful in AI applications, where the precision requirements can be traded with performance. In DNN applications element-wise operations, such as adding bias, clipping the output, ReLU, and GeLU are performed on the GEMM output which are referred here as post-operations (post-ops).
In LPGEMM, these post-ops are fused with the GEMM operation to avoid repeated access to memory and thereby, improving the performance. In the LPGEMM APIs, an additional argument is added for the user to provide information about the post-ops needed to perform after the GEMM operation.
4.6.4. APIs and Post-Ops in aocl_gemm#
4.6.4.1. Architecture Features and APIs#
Architecture Features Required |
API |
---|---|
AVX512-VNNI |
aocl_gemm_u8s8s32os32 |
aocl_gemm_u8s8s32os8 |
|
aocl_gemm_s8s8s32os32 |
|
aocl_gemm_s8s8s32os8 |
|
AVX2 |
aocl_gemm_u8s8s16os16 |
aocl_gemm_u8s8s16os8 |
|
aocl_gemm_u8s8s16ou8 |
|
aocl_gemm_s8s8s16os16 |
|
aocl_gemm_s8s8s16os8 |
|
AVX512-BF16 |
aocl_gemm_bf16bf16f32of32 |
aocl_gemm_bf16bf16f32obf16 |
|
aocl_gemm_bf16s4f32of32 |
|
aocl_gemm_bf16s4f32obf16 |
|
AVX512 / AVX2 |
aocl_gemm_f32f32f32of32 |
4.6.4.2. Utility APIs in aocl_gemm Add-on#
Post-op |
Description |
---|---|
Add bias |
Adds bias to the GEMM output before storing into C, where the bias data is passed by the user using the post-op interface. |
ReLU |
Performs ReLU operation on GEMM output. f(x) = 0, when x<=0 and f(x)=x when x>0. |
PReLU |
Performs Parametric ReLU operation on GEMM output based on scale given by the user. f(x) = x, when x > 0 and f(x) = scale*x when x <= 0. |
SWISH |
Sigmoid Weighted Linear Unit (SiLU) when beta=1. SWISH(x) = x*sigmoid(beta*x) |
GeLU_Tanh |
Perform Tanh based GeLU on GEMM output. GeLU_Tanh(x) = 0.5*x*(1 + tanh(0.797884*(x+( 0.044715*x^3 )))) |
GeLU_ERF |
Perform Erf based GeLU on GEMM output. GeLU_Erf(x) = 0.5* x * (1 + erf(x * 0.707107 )) |
Scale |
Perform Scale operation on GEMM output based on the scale provided by the user. |
Clip |
Perform clip operation on GEMM output based on minimum and maximum values given by the user. |
Matrix Add |
Perform elementwise addition of a given D matrix to GEMM output C. C := (beta * C + alpha * A * B ) + D |
Matrix Mul |
Perform elementwise multiplication of a given D matrix with GEMM output and update C C := (beta * C + alpha * A * B ) * D |
LPGEMM APIs supports reordering the entire input matrix before calling GEMM and on the go packing, where GEMM API takes care of packing of matrix internally. The following utility APIs are used to reorder input weight matrix before calling GEMM:
API |
Description |
---|---|
aocl_get_reorder_buff_size_XXX() |
Returns buffer size required to reorder an input matrix, where XXX corresponds to each of the data type combinations specified in Required Architecture Features and APIs. For example, u8s8s32os32. |
aocl_reorder_XXX() |
Reorders the given input and writes into output buffer. |
aocl_gelu_tanh_f32() |
Performs tanh operation on each element of the given input buffer and writes in the output buffer. |
aocl_gelu_erf_f32() |
Performs tanh operation on each element of the given input buffer and writes in the output buffer. |
aocl_softmax_f32() |
Performs tanh operation on each element of the given input buffer and writes in the output buffer. |
aocl_gemm_eltwise_ops_XX |
Performs sequence of element wise operations on a given input matrix. The sequence can contain any of the supported post-ops. |
4.6.5. Enabling aocl_gemm Add-on#
Enabling aocl_gemm add-on while building AOCL-BLAS from Source on Linux:
Building with GCC:
$ ./configure -a aocl_gemm --enable-cblas --enable-threading=openmp \
--prefix=<your-install-dir> CC=gcc CXX=g++ [auto \| amdzen]
Building with AOCC:
$ ./configure -a aocl_gemm --enable-cblas --enable-threading=openmp \
--prefix=<your-install-dir> CC=clang CXX=clang++ [auto \| amdzen]
The aocl_gemm add-on feature is supported on Windows with clang version 18.0 and above. To enable, please add -DENABLE_ADDON=”aocl_gemm” to cmake command line as mentioned in Build AOCL-BLAS from Source on Windows
Refer to blis.h file for all the prototypes of LPGEMM APIs.
Some LPGEMM APIs are supported only when the architecture features, such as avx512vnni and avx512bf16 are available in the machine as mentioned in Required Architecture Features and APIs. The APIs returns without doing anything when those features are not available.
Transpose support for A and B is not available for AVX2 s16 APIs.
4.6.6. Sample Application 1#
The following sample application is to use the LPGEMM APIs without post-ops:
/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm
Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "blis.h"
// Example program to demonstrate LPGEMM API usage.
// aocl_gemm_f32f32f32of32 (A:float, B:float, C:float) used here.
int main()
{
dim_t m = 1024;
dim_t n = 1024;
dim_t k = 1024;
// Leading dimensions for row major matrices.
dim_t lda = k;
dim_t ldb = n;
dim_t ldc = n;
err_t err = BLIS_SUCCESS;
float *a = (float *)bli_malloc_user(sizeof(float) * m * k, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
float *b = (float *)bli_malloc_user(sizeof(float) * n * k, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
float *c = (float *)bli_malloc_user(sizeof(float) * m * n, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Functions to fill the matrices with data can be added here.
float alpha = 2.2;
float beta = 9.15;
float alpha = 2.2;
float beta = 9.15;
char storage = 'r'; // Row major. Use 'c' for column major.
char transa = 'n'; // No transpose. Transpose not supported for all API's.
char transb = 'n';
char reordera = 'n';
char reorderb = 'n';
aocl_gemm_f32f32f32of32(storage, transa, transb,
m, n, k,
alpha,
a, lda, reordera,
b, ldb, reorderb,
beta,
c, ldc,
NULL);
bailout:
if (a != NULL)
{
bli_free_user(a);
}
if (b != NULL)
{
bli_free_user(b);
}
if (c != NULL)
{
bli_free_user(c);
}
return 0;
}
4.6.7. Sample Application 2#
The following sample application is to demonstrate usage of LPGEMM API with reordered B matrix and post-ops:
/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm
Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "blis.h"
// aocl_gemm_bf16bf16f32of32 (A:bfloat16, B:bfloat16, C:float) used here.
// 3 post-ops - bias + gelu_tanh + clip used here.
int main()
{
dim_t m = 1024;
dim_t n = 1024;
dim_t k = 1024;
// Leading dimensions for row major matrices.
dim_t lda = k;
dim_t ldb = n;
dim_t ldc = n;
err_t err = BLIS_SUCCESS;
bfloat16 *a = (bfloat16 *)bli_malloc_user(sizeof(bfloat16) * m * k, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
bfloat16 *b = (bfloat16 *)bli_malloc_user(sizeof(bfloat16) * n * k, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
float *c = (float *)bli_malloc_user(sizeof(float) * m * n, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Functions to fill the matrices with data can be added here.
float alpha = 2.95;
float beta = 3.5;
char storage = 'r'; // Row major. Use 'c' for column major.
char transa = 'n'; // No transpose. Transpose not supported.
char transb = 'n';
char reordera = 'n';
char reorderb = 'r'; // B matrix will be reordered.
// Initialize post-ops struct.
aocl_post_op *post_ops = NULL;
post_ops = (aocl_post_op *)bli_malloc_user(sizeof(aocl_post_op), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
dim_t max_post_ops_seq_length = 3; // bias + gelu_tanh + clip
post_ops->seq_vector =
(AOCL_POST_OP_TYPE *) bli_malloc_user(
max_post_ops_seq_length * sizeof(AOCL_POST_OP_TYPE),
&err);
if (err != BLIS_SUCCESS) { goto bailout; }
// 1 bias instance, need to allocate dynamically.
post_ops->seq_vector[0] = BIAS;
post_ops->bias =
bli_malloc_user(1 * sizeof(aocl_post_op_bias), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Need to output accumulation (float) type for bias.
(post_ops->bias + 0)->bias = bli_malloc_user(n * sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Add function to fill bias array here.
// 2 element wise post-ops, need to allocate dynamically.
post_ops->seq_vector[1] = ELTWISE; // For gelu_tanh
post_ops->seq_vector[2] = ELTWISE; // For clip
post_ops->eltwise =
bli_malloc_user(2 * sizeof(aocl_post_op_eltwise), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Gelu tanh.
(post_ops->eltwise + 0)->is_power_of_2 = FALSE;
(post_ops->eltwise + 0)->scale_factor = NULL;
(post_ops->eltwise + 0)->algo.alpha = NULL;
(post_ops->eltwise + 0)->algo.beta = NULL;
(post_ops->eltwise + 0)->algo.algo_type = GELU_TANH;
// Clip.
(post_ops->eltwise + 1)->is_power_of_2 = FALSE;
(post_ops->eltwise + 1)->scale_factor = NULL;
// Min bound is represented by alpha.
(post_ops->eltwise + 1)->algo.alpha =
bli_malloc_user(sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Max bound is represented by beta.
(post_ops->eltwise + 1)->algo.beta =
bli_malloc_user(sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Set some min/max bounds.
*((float*)(post_ops->eltwise + 1)->algo.alpha) = -64.5;
*((float*)(post_ops->eltwise + 1)->algo.beta) = 3.9;
(post_ops->eltwise + 1)->algo.algo_type = CLIP;
post_ops->seq_length = 3;
// Reorder B matrix, this is pre-packing the B matrix so that packing
// costs are not incurred when executing GEMM.
siz_t b_reorder_buffer_size =
aocl_get_reorder_buf_size_bf16bf16f32of32(storage, transb, 'B', k, n );
bfloat16* b_reorder =
(bfloat16*)bli_malloc_user(b_reorder_buffer_size, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
aocl_reorder_bf16bf16f32of32(storage, transb, 'B',
b, b_reorder,
k, n, ldb);
aocl_gemm_bf16bf16f32of32(storage, transa, transb,
m, n, k,
alpha,
a, lda, reordera,
b_reorder, ldb, reorderb,
beta,
c, ldc,
post_ops);
bailout:
if ((post_ops->eltwise + 1)->algo.alpha != NULL)
{
bli_free_user((post_ops->eltwise + 1)->algo.alpha);
}
if ((post_ops->eltwise + 1)->algo.beta != NULL)
{
bli_free_user((post_ops->eltwise + 1)->algo.beta);
}
if (post_ops->eltwise != NULL)
{
bli_free_user(post_ops->eltwise);
}
if (post_ops->bias != NULL)
{
if ((post_ops->bias + 0)->bias != NULL)
{
bli_free_user((post_ops->bias + 0)->bias);
}
bli_free_user(post_ops->bias);
}
if (post_ops->seq_vector != NULL)
{
bli_free_user(post_ops->seq_vector);
}
if (post_ops != NULL)
{
bli_free_user(post_ops);
}
if (b_reorder != NULL)
{
bli_free_user(b_reorder);
}
if (a != NULL)
{
bli_free_user(a);
}
if (b != NULL)
{
bli_free_user(b);
}
if (c != NULL)
{
bli_free_user(c);
}
return 0;
}
4.6.8. Sample Application 3#
The following sample application is to demonstrate usage of LPGEMM downscale API with multiple scale post-ops and int4 to int8 B matrix reordering:
/*
$gcc test_lpgemm.c -o ./test_lpgemm.x -I/aocl-blis_install_directory/include/amdzen/
-L/aocl-blis_install_directory/lib/amdzen/ -lblis-mt -lm
Note: Export blis library path to LD_LIBRARY_PATH before running the
executable ./test_lpgem.x
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "blis.h"
// aocl_gemm_u8s8s32os8 (A:uint8_t, B:int_t, C:int8_t) used here.
// 3 post-ops - scale + matrix_add + scale used here.
int main()
{
dim_t m = 1024;
dim_t n = 1024;
dim_t k = 1024;
// Leading dimensions for row major matrices.
dim_t lda = k;
dim_t ldb = n;
dim_t ldc = n;
err_t err = BLIS_SUCCESS;
uint8_t *a = (uint8_t *)bli_malloc_user(sizeof(uint8_t) * m * k, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// int4_t B matrix represented using int8_t, but with half the int8_t size.
int8_t *b = (int8_t *)bli_malloc_user((sizeof(int8_t) * n * k) / 2, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
int8_t *c = (int8_t *)bli_malloc_user(sizeof(int8_t) * m * n, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Functions to fill the matrices with data can be added here.
int32_t alpha = 2;
int32_t beta = 9;
char storage = 'r'; // Row major. Use 'c' for column major.
char transa = 'n'; // No transpose. Transpose not supported.
char transb = 'n';
char reordera = 'n';
char reorderb = 'r';
// Initialize post-ops struct.
aocl_post_op *post_ops = NULL;
post_ops = (aocl_post_op *)bli_malloc_user(sizeof(aocl_post_op), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// Downscale parameters need to be passed as a post-op, even
// if a downscale specific api is invoked.
dim_t max_post_ops_seq_length = 3; // scale + matrix_add + scale
post_ops->seq_vector =
(AOCL_POST_OP_TYPE *) bli_malloc_user(
max_post_ops_seq_length * sizeof(AOCL_POST_OP_TYPE),
&err);
if (err != BLIS_SUCCESS) { goto bailout; }
// 2 scaling post-ops, first for normal scaling and second one for
// downscaling, need to allocate scale struct dynamically.
post_ops->sum =
bli_malloc_user(2 * sizeof(aocl_post_op_sum), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
// For first scale, using scalar zero point and scale factor.
post_ops->seq_vector[0] = SCALE;
(post_ops->sum + 0)->is_power_of_2 = FALSE;
(post_ops->sum + 0)->buff = NULL;
(post_ops->sum + 0)->zero_point =
bli_malloc_user(1 * sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
*((int8_t*)((post_ops->sum + 0)->zero_point)) = 3;
(post_ops->sum + 0)->zero_point_len = 1;
(post_ops->sum + 0)->scale_factor =
bli_malloc_user(1 * sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
*((float*)((post_ops->sum + 0)->scale_factor)) = 3.9;
(post_ops->sum + 0)->scale_factor_len = 1;
// Matrix add post-op.
post_ops->matrix_add =
bli_malloc_user(1 * sizeof(aocl_post_op_matrix_add), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
post_ops->seq_vector[1] = MATRIX_ADD;
(post_ops->matrix_add + 0)->matrix =
bli_malloc_user(sizeof(int8_t) * m * n, &err);
if (err != BLIS_SUCCESS) { goto bailout; }
(post_ops->matrix_add + 0)->ldm = n;
// Add function to fill matrix_add matrix here.
// For second scale, using vector zero point and scale factor.
// This scale post-op is purely for downscaling/quantization.
post_ops->seq_vector[2] = SCALE;
(post_ops->sum + 1)->is_power_of_2 = FALSE;
(post_ops->sum + 1)->buff = NULL;
(post_ops->sum + 1)->zero_point =
bli_malloc_user(n * sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
(post_ops->sum + 1)->zero_point_len = n;
(post_ops->sum + 1)->scale_factor =
bli_malloc_user(n * sizeof(float), &err);
if (err != BLIS_SUCCESS) { goto bailout; }
(post_ops->sum + 1)->scale_factor_len = n;
// Add function to fill zero point and scale factor here.
post_ops->seq_length = 3;
// Reorder B matrix, this is pre-packing the B matrix so that packing
// costs are not incurred when executing GEMM. Here the int4 B matrix
// is reordered along with conversion to each element to int8 type.
siz_t b_reorder_buffer_size =
aocl_get_reorder_buf_size_u8s4s32os32(storage, transb, 'B', k, n );
int8_t* b_reorder = (int8_t*)bli_malloc_user(b_reorder_buffer_size, &err);
aocl_reorder_u8s4s32os32(storage, transb, 'B',
b, b_reorder,
k, n, ldb);
aocl_gemm_u8s8s32os8(storage, transa, transb,
m, n, k,
alpha,
a, lda, reordera,
b_reorder, ldb, reorderb,
beta,
c, ldc,
post_ops);
bailout:
if (post_ops->sum != NULL)
{
if ((post_ops->sum + 0)->zero_point != NULL)
{
bli_free_user((post_ops->sum + 0)->zero_point);
}
if ((post_ops->sum + 0)->scale_factor != NULL)
{
bli_free_user((post_ops->sum + 0)->scale_factor);
}
if ((post_ops->sum + 1)->zero_point != NULL)
{
bli_free_user((post_ops->sum + 1)->zero_point);
}
if ((post_ops->sum + 1)->scale_factor != NULL)
{
bli_free_user((post_ops->sum + 1)->scale_factor);
}
bli_free_user(post_ops->sum);
}
if (post_ops->matrix_add != NULL)
{
if ((post_ops->matrix_add + 0)->matrix != NULL)
{
bli_free_user((post_ops->matrix_add + 0)->matrix);
}
bli_free_user(post_ops->matrix_add);
}
if (post_ops->seq_vector != NULL)
{
bli_free_user(post_ops->seq_vector);
}
if (post_ops != NULL)
{
bli_free_user(post_ops);
}
if (b_reorder != NULL)
{
bli_free_user(b_reorder);
}
if (a != NULL)
{
bli_free_user(a);
}
if (b != NULL)
{
bli_free_user(b);
}
if (c != NULL)
{
bli_free_user(c);
}
return 0;
}