AOCL-BLAS - 5.0 English

AOCL Performance Tuning Guide (63859)

Document ID
63859
Release Date
2024-10-10
Version
5.0 English

2. AOCL-BLAS#

2.1. AOCL-BLAS Thread Control#

Multi-threaded builds of AOCL-BLAS provide several mechanism for setting the desired number of threads during initialization and runtime as explained below.

2.1.1. AOCL-BLAS Initialization#

During AOCL-BLAS initialization, the preferred number of threads in the BLAS routines can be set by an application in multiple ways as follows:

  • bli_thread_set_num_threads(nt) AOCL-BLAS library API

  • Valid value of BLIS_NUM_THREADS environment variable

  • omp_set_num_threads(nt) OpenMP library API

  • Valid value of OMP_NUM_THREADS environment variable

If none of these is issued by an application, the default behaviour is compiler-dependent but often defaults to using all logical cores enabled on the node as the preferred number of threads.

If the number of threads is set in one or more possible ways, the order of precedence for AOCL would be in the above mentioned order.

The following tables describes the sample scenarios for setting the number of threads during AOCL-BLAS initialization for respective codes:

int main()
{
   // pseudo code to use OpenMP API to set number of threads

   omp_set_num_threads(16);
   dgemm_( );
   // ...
   return 0;
}

Sample Command Executed

No of Threads Set During AOCL-BLAS Initialization

Remarks

$ BLIS_NUM_THREADS=8 ./my_blis_program

8

BLIS_NUM_THREADS will have the maximum precedence.

$ ./my_blis_program

16

BLIS_NUM_THREADS is not set and hence, omp_set_num_threads(16) has taken effect.

$ OMP_NUM_THREADS=4 ./my_blis_program

16

BLIS_NUM_THREADS is not set, omp_set_num_threads(16) has taken effect as it has more precedence than OMP_NUM_THREADS.

$ BLIS_NUM_THREADS=8 OMP_NUM_THREADS=4 ./my_blis_program

8

BLIS_NUM_THREADS is set to 8, omp_set_num_threads(nt) and OMP_NUM_THREADS do not have any effect.

int main()
{
   // pseudo code

   dgemm_( );
   // ...
   return 0;
}

Sample Command Executed

No of Threads Set During AOCL-BLAS Initialization

Remarks

$ BLIS_NUM_THREADS=8 ./my_blis_program

8

BLIS_NUM_THREADS will have the maximum precedence.

$ ./my_blis_program

64

BLIS_NUM_THREADS is not set, omp_set_num_threads() is not issued, and OMP_NUM_THREADS is not set, Considering the number of logical cores to be 64, number of threads is 64. Or the number of cores derived from numactl --physcpubind=<...> option.

$ OMP_NUM_THREADS=4 ./my_blis_program

4

BLIS_NUM_THREADS is not set, omp_set_num_threads() is not issued, and OMP_NUM_THREADS is set to 4.

2.1.2. Runtime#

Once the number of threads is set during AOCL-BLAS initialization, it will be used in subsequent BLAS routine execution until the application modifies the number of threads to be used (for example, using the omp_set_num_threads() API).

The following table describes the sample scenarios for setting the number of threads during runtime:

int main()
{
   // Pseudo code for sample usage of OpenMP API to set
   // number of threads in the Application during runtime

   do {
      if (m < 500)
         omp_set_num_threads(8);
      if (m >= 500)
         omp_set_num_threads(16);
      if (m >= 3000)
         omp_set_num_threads(32);

      dgemm_( );
   } while(test_case_counter--)
   // ...
   return 0;
}

Sample Command Executed

m

Number of Threads for this BLAS Call

Remarks

$ ./my_blis_program

100

8

Application issued omp_set_num_threads(8)

500

16

Application issued omp_set_num_threads(16)

200

8

Application re-issued omp_set_num_threads(8)

4000

32

Application issued omp_set_num_threads(32)

1000

16

Application re-issued omp_set_num_threads(16)

500

16

Application re-issued omp_set_num_threads(16)

100

8

Application re-issued omp_set_num_threads(8)

2.1.3. Runtime Thread Control#

AOCL-BLAS libraries that are multi-threaded using OpenMP parallelism provide two mechanisms for the users to control the number of threads for AOCL-BLAS functions to use. These are the normal OpenMP mechanisms and AOCL-BLAS specific environment variables and function calls. The AOCL-BLAS specific mechanisms include the option to set the overall number of threads for AOCL-BLAS to use or to set the threading specifically for the different loops within the AOCL-BLAS Level 3 routines (for example, DGEMM). These are called the automatic and the manual ways respectively. For more information, refer to: Multithreading.md

The order of precedence used in AOCL-BLAS, where set or called by the user, is as follows:

  1. The AOCL-BLAS manual way values set using bli_thread_set_ways() by the application.

  2. Valid value(s) of any of the BLIS_*_NT environment variables.

  3. Value set using bli_thread_set_num_threads(nt) by the application.

  4. Valid value set for the environment variable BLIS_NUM_THREADS.

  5. omp_set_num_threads(nt) issued by the application.

  6. Valid value set for the environment variable OMP_NUM_THREADS.

  7. The default number of threads used by the chosen OpenMP runtime library when OMP_NUM_THREADS is not set.

Two other factors may override these settings:

  1. OpenMP parallelism at higher level(s) in the code calling AOCL-BLAS, that is, the number of active levels and the level at which the AOCL-BLAS call occurs.

  2. The effect of AOCL Dynamic (if enabled), as described in the next section.

Note

From AOCL 4.1, support for calling AOCL-BLAS within nested OpenMP parallelism has been improved. Hence, using the standard OpenMP mechanisms should be sufficient for most of the use cases.

2.2. AOCL Dynamic#

The AOCL dynamic feature enables AOCL-BLAS to dynamically change the number of threads.

This feature is enabled by default, however, it can be enabled or disabled at the configuration time using the options --enable-aocl-dynamic and --disable-aocl-dynamic respectively.

You can also specify the preferred number of threads using the environment variables BLIS_NUM_THREADS or OMP_NUM_THREADS. If both are specified, BLIS_NUM_THREADS takes precedence.

The following table summarizes how the number of threads is determined based on the status of AOCL Dynamic and the user configuration using the variable BLIS_NUM_THREADS:

AOCL Dynamic

BLIS_NUM_THREADS

Number of Threads Used by AOCL-BLAS

Disabled

Unset

Number of logical cores.

Disabled

Set

BLIS_NUM_THREADS

Enabled

Unset

Number of threads determined by AOCL Dynamic.

Enabled

Set

Minimum of BLIS_NUM_THREADS or the number of threads determined by AOCL.

2.2.1. Limitations#

The AOCL Dynamic feature has the following limitations:

  • Supported only for threading using OpenMP.

  • Supports only DGEMM, ZGEMM, DTRSM, ZTRSM, DGEMMT, DSYRK, DTRMM, SGEMV, DSCAL, ZDSCAL, DDOT, DNRM2, DZNRM2, and DAXPY APIs.

  • Specifying the number of threads more than the number of cores may result in deteriorated performance because of over-subscription of cores.

Based on the input parameters (such as size, transpose, and storage format), the optimal code path for the given number of threads would be executed. This can be single-threaded even if the number of threads set is more than 1.

2.3. AOCL-BLAS Multi-Thread Tuning#

AOCL-BLAS library can be used on multiple platforms and applications. Multi-threading adds more configuration options at runtime. This section explains the number of threads and CPU affinity settings that can be tuned to get the best performance for your requirements.

2.3.1. Library Usage Scenarios#

  • The application and library are single-threaded:

    This is straight forward - no special instructions needed. You can export BLIS_NUM_THREADS=1 indicating you are running AOCL-BLAS in a single-thread mode. If both BLIS_NUM_THREADS and OMP_NUM_THREADS are set, the former will take precedence over the later.

  • The application is single-threaded and the library is multi-threaded:

    You can either use OMP_NUM_THREADS or BLIS_NUM_THREADS to define the number of threads for the library. However, it is recommended that you use BLIS_NUM_THREADS, especially if you wish to set different values for AOCL-BLAS from OpenMP parallel regions in your application program or in other libraries..

    Example:

    $ export BLIS_NUM_THREADS=128 # Here, AOCL-BLAS runs at 128 threads.
    

    Apart from setting the number of threads, you must pin the threads to the cores using GOMP_CPU_AFFINITY or numactl as follows:

    $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 <./application>
    
    Or
    
    $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 numactl --i=all <./application>
    
    $ BLIS_NUM_THREADS=128 OMP_PROC_BIND=close numactl -C 0-127 --interleave=all <./test_application>
    

    Note

    For the Clang compiler, it is mandatory to use OMP_PROC_BIND=true in addition to the thread pinning (if numactl is used). For example, for a matrix size of 200 and 32 threads, if you run DGEMM without OMP_PROC_BIND settings, the performance would be less. However, if you start using OMP_PROC_BIND=true, the performance would improve. This problem is not noticed with libgomp using gcc compiler. For the gcc compiler, the processor affinity defined using numactl is sufficient. Our advice, always set OMP_PROC_BIND with numactl.

  • The application is multi-threaded and the library is running a single-thread:

    When the application is running multi-thread and number of threads are set using OMP_NUM_THREADS, it is mandatory to set BLIS_NUM_THREADS to one. Otherwise, AOCL-BLAS will run in multi-threaded mode with the number of threads equal to OMP_NUM_THREADS. This may result in a poor performance.

  • The application and library are both multi-threaded:

    This is a typical scenario of nested parallelism. Whether or not nested parallelism will occur depends upon the number of levels of parallelism active in the OpenMP runtime. This can be queried using the OpenMP API call omp_get_max_active_levels and set using the environment variable OMP_MAX_ACTIVE_LEVELS or the API call omp_set_max_active_levels.

    Assuming multiple levels are active, to individually control the threading at application and at the AOCL-BLAS library level, you can either:

    • Use both OMP_NUM_THREADS and BLIS_NUM_THREADS.

      • The number of threads launched by the application is OMP_NUM_THREADS.

      • Each application thread spawns BLIS_NUM_THREADS threads.

      • To get a better performance, ensure that Number of Physical Cores = OMP_NUM_THREADS * BLIS_NUM_THREADS.

      Thread pinning for the application and the library can be done using OMP_PROC_BIND:

      $ OMP_NUM_THREADS=4 BLIS_NUM_THREADS=8 OMP_PROC_BIND=spread,close <./application>
      

      At an outer level, the threads are spread and at the inner level, the threads are scheduled closer to their master threads.

    • Use more advanced options for OMP_NUM_THREADS, for example:

      $ OMP_NUM_THREADS=4,8 OMP_PROC_BIND=spread,close <./application>
      
    • Use the OpenMP API call omp_set_num_threads within the application code to set the number of threads to be used in subsequent library calls:

      omp_set_num_threads(8);
      dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);
      

2.4. AOCL-BLAS Level-3 Block-Size Tuning#

AOCL-BLAS level-3 operations like DGEMM, DTRSM performance is largely impacted by the block sizes used by AOCL-BLAS. A matrix-matrix multiplication of large m, n, and k dimensions is partitioned into sub-problems of the specified block sizes.

Many HPC, scientific applications, and benchmarks run on high-end cluster of machines, each with multiple cores. They run programs with multiple instances through Message Passing Interface (MPI) based APIs or separate instances of each program. Depending on whether the application using AOCL-BLAS is running in multi-instance mode or single instance, the specified block sizes will have an impact on the overall performance.

The default values for the block size in AOCL-BLAS GitHub repository (amd/blis) is set to extract the best performance for such HPC applications/benchmarks, which use single-threaded AOCL-BLAS and run in multi-instance mode on AMD EPYCTM AMD “Zen” core processors. However, if your application runs as a single instance, the block sizes for an optimal performance would vary.

The following settings will help you choose the optimal values for the block sizes based on the way the application is run:

2nd Gen AMD EPYCTM Processors (codenamed “Rome”)

  1. Open the file bli_family_zen2.h in the AOCL-BLAS source:

    $ cd "config/zen2/bli_family_zen2.h"
    
  2. For applications/benchmarks running in multi-instance mode and using multi-threaded AOCL-BLAS, ensure that the macro AOCL_BLIS_MULTIINSTANCE is set to 0. As of AOCL 2.x release, this is the default setting. The HPL benchmark is found to generate better performance numbers using the following setting for multi-threaded AOCL-BLAS:

    #define AOCL_BLIS_MULTIINSTANCE 0
    

1st Gen AMD EPYCTM Processors (codenamed “Naples”)

  1. Open the file bli_cntx_init_zen.c under the AOCL-BLAS source:

    $ cd "config/zen/bli_family_zen.h"
    
  2. Ensure the macro, BLIS_ENABLE_ZEN_BLOCK_SIZES is defined:

    #define BLIS_ENABLE_ZEN_BLOCK_SIZES
    

Multi-Instance Mode

For applications/benchmarks running in multi-instance mode, ensure that the macro BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES is set to 0. As of AOCL 2.x release, following is the default setting:

#define BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES 0

The optimal block sizes for this mode on AMD EPYCTM are defined in the file config/zen/ bli_cntx_init_zen.c:

bli_blksz_init_easy( &blkszs[ BLIS_MC ],  144,  240,  144,   72 );
bli_blksz_init_easy( &blkszs[ BLIS_KC ],  256,  512,  256,  256 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 4080, 2040, 4080, 4080 );

Single-Instance Mode

For the applications running as a single instance, ensure that the macro BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES is set to 1:

#define BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES 1

The optimal block sizes for this mode on AMD EPYCTM are defined in the file config/zen/bli_cntx_init_zen.c:

bli_blksz_init_easy( &blkszs[ BLIS_MC ],  144,  510,  144,   72 );
bli_blksz_init_easy( &blkszs[ BLIS_KC ],  256, 1024,  256,  256 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 4080, 4080, 4080, 4080 );

2.5. Performance Suggestions for Skinny Matrices#

AOCL-BLAS provides a selective packing for GEMM when one or two-dimensions of a matrix is exceedingly small. Selective packing is only applicable when sup is enabled. For optimal performance:

# C = beta*C + alpha*A*B
# Dimension (Dim) of A - m x k
# Dimension (Dim) of B - k x n
# Dimension (Dim) of C - m x n
# Assume all are stored in row-major format.

# IF m >> n
$ BLIS_PACK_A=1 ./test_gemm_blis.x - will give a better performance.

# IF m << n
$ BLIS_PACK_B=1 ./test_gemm_blis.x - will give a better performance.