AOCL-LAPACK - 5.0 English

AOCL Performance Tuning Guide (63859)

Document ID
63859
Release Date
2024-10-10
Version
5.0 English

3. AOCL-LAPACK#

AOCL-LAPACK provides several build and run-time options to get performance uplift for different use cases. Multi-threading, selection of x86 ISA and enabling AOCL specific optimizations are some of the prominent options. In general, setting the flag ENABLE_AMD_FLAGS to ON during CMAKE configure turns on many optimization and interface options. Following sub-sections describe those different options and effect of the options on the performance of AOCL-LAPACK. All these options can be enabled / disabled either during configuring before build or while executing applications using corresponding environment variables.

3.1. Enable AMD Optimizations#

All performance optimizations and other library features added by AOCL are enabled by setting either ENABLE_AMD_FLAGS or ENABLE_AMD_AOCC_FLAGS to ON for GCC or AOCC compiler respectively. Following are the salient features turned on with these options:

  1. AOCL performance optimizations for Zen family of CPUs

  2. Parallelization using OpenMP for Shared Memory Parallelization

  3. Usage of Extended BLAS API(s) available in AOCL-BLAS

3.2. Enable / Disable Multithreading#

AOCL-LAPACK supports multi-threading using OpenMP in selected APIs. This feature is enabled by default when AOCL-LAPACK is compiled with ENABLE_AMD_FLAGS=ON or ENABLE_AMD_AOCC_FLAGS=ON. However, you can disable multi-threading by setting ENABLE_MULTITHREADING=NO.

Select LAPACK interface APIs that support multi-threading automatically choose optimal number of threads. However, you can explicitly set the number of threads through the environment variable or OpenMP runtime APIs. In such a scenario, the number of threads is selected as follows:

Thread Criteria

Threads Used by API

If user specified threads are greater than AOCL-LAPACK computed optimal threads

AOCL-LAPACK computed optimal threads

If user specified threads are less than AOCL-LAPACK computed optimal threads

User specified threads

3.3. Build-Time ISA Selection#

To support binary portability across different architectures, the default compiler flags are set to -mtune=native -mavx2 -mfma -O3 when compiled with ENABLE_AMD_FLAGS or ENABLE_AMD_AOCC_FLAGS options. This means, AOCL-LAPACK requires minimum AVX2 and Fused Multiply Accumulate (FMA) support from the target CPU.

However, the library can be compiled with different ISA flag, such as AVX512 depending on the ISA supported on the target CPU. You can use the following steps:

Set the flag, LF_ISA_CONFIG to the desired ISA support. The available options are Auto, AVX2 (default), AVX512, and None. The command to use this is as follows:

$ cmake .. -DLF_ISA_CONFIG=AVX512 -DENABLE_AMD_FLAGS=ON

3.4. Run-Time ISA Selection#

For select functions, AOCL-LAPACK supports automatic processor dispatching to suitable code paths based on the target CPU ISA architecture. However, you can enable different ISA code path using environment variable, AOCL_ENABLE_INSTRUCTIONS. Valid values for AOCL_ENABLE_INSTRUCTIONS are SSE2, AVX, AVX2, AVX512 and GENERIC. All values are case-insensitive.

When you set AOCL_ENABLE_INSTRUCTIONS to ISA value higher than supported by target CPU, AOCL-LAPACK chooses the code path that is best supported architecture on that target CPU. If you choose a lower level ISA, then same will be used. Any ISA selection lower than AVX2 defaults to generic reference code path.

Case 1: On a AVX2-only (example: AMD Zen1 / Zen2 / Zen3) machine

  • Setting AOCL_ENABLE_INSTRUCTIONS=AVX2 will take avx2 path.

  • Setting AOCL_ENABLE_INSTRUCTIONS=AVX512 will take avx2 path

  • Setting AOCL_ENABLE_INSTRUCTIONS=generic or sse2 or avx will take reference path.

Case 2: On AVX512 (example: Zen4 / Zen5) machine

  • Setting AOCL_ENABLE_INSTRUCTIONS=AVX512 will take avx512 path

  • Setting AOCL_ENABLE_INSTRUCTIONS=AVX2 will take avx2 path

  • Setting AOCL_ENABLE_INSTRUCTIONS=generic or sse2 or avx will run reference path.

Case 3: Setting AOCL_ENABLE_INSTRUCTIONS to values other than avx512, avx2, avx, sse2, generic will result in error

Performance varies based on the function and size of the inputs.

3.5. Using AOCL-BLAS#

AOCL-LAPACK can be linked with any Netlib BLAS compliant library when compiled with standard CMake options as provided in AOCL User Guide. However, AOCL-LAPACK provides an option explicitly to link explicitly with AOCL-BLAS library at compile time. This option enables invoking lower level AOCL-BLAS APIs directly and that could result in better performance for certain APIs on AMD “Zen” CPUs. To force AOCL-LAPACK to use AOCL-BLAS library, provide the option ENABLE_AOCL_BLAS in the CMake configuration:

$ cmake -DENABLE_AMD_AOCC_FLAGS=ON -DENABLE_AOCL_BLAS=ON ...

Provide path of the AOCL-BLAS library using one of the following methods:

  • Set AOCL_ROOT environment variable to the root path where AOCL-BLAS library ($AOCL_ROOT/lib) and header files ($AOCL_ROOT/include) are located:

    $ export AOCL_ROOT=<path to AOCL-BLAS>
    
  • Specify root path of the AOCL-BLAS library through the CMake option AOCL_ROOT:

    $ cmake -DENABLE_AMD_AOCC_FLAGS=ON -DENABLE_AOCL_BLAS=ON
    -DAOCL_ROOT=<path to AOCL-BLAS> ...
    

The path specified in AOCL_ROOT must have the directories include and lib containing the necessary header files and binary of AOCL-BLAS respectively.

3.6. Using Extended BLAS APIs#

As mentioned earlier, usage of Extended BLAS APIs is enabled by setting ENABLE_AMD_FLAGS or ENABLE_AMD_AOCC_FLAGS to ON. If there is a need to disable this feature, CMake option of ENABLE_BLAS_EXT_GEMMT can be used:

$ cmake -DENABLE_AMD_FLAGS=ON -DENABLE_BLAS_EXT_GEMMT=OFF ...