4.4.3. AOCL-BLAS Multi-Thread Tuning - 5.2 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2025-12-29
Version
5.2 English

AOCL-BLAS library can be used on multiple platforms and applications. Multi-threading adds more configuration options at runtime. This section explains the number of threads and CPU affinity settings that can be tuned to get the best performance for your requirements for different usage scenarios.

  • The application and library are single-threaded:

    This is straight forward - no special instructions are needed. You can export BLIS_NUM_THREADS=1 indicating you are running AOCL-BLAS in a single-thread mode. If both BLIS_NUM_THREADS and OMP_NUM_THREADS are set, the former will take precedence over the latter.

  • The application is single-threaded and the library is multi-threaded:

    You can either use OMP_NUM_THREADS or BLIS_NUM_THREADS to define the number of threads for the library. However, it is recommended that you use BLIS_NUM_THREADS, especially if you wish to set different values for AOCL-BLAS from OpenMP parallel regions in your application program or in other libraries.

    Example:

    $ export BLIS_NUM_THREADS=128 # Here, AOCL-BLAS runs at 128 threads.
    

    Apart from setting the number of threads, you must pin the threads to the cores using GOMP_CPU_AFFINITY or numactl as follows:

    $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 <./application>
    
    Or
    
    $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 numactl --i=all <./application>
    
    $ BLIS_NUM_THREADS=128 OMP_PROC_BIND=close numactl -C 0-127 --interleave=all <./test_application>
    

    Note

    For the Clang compiler, it is mandatory to use OMP_PROC_BIND=true in addition to the thread pinning (if numactl is used). For example, for a matrix size of 200 and 32 threads, if you run DGEMM without OMP_PROC_BIND settings, the performance would be less. However, if you start using OMP_PROC_BIND=true, the performance would improve. This problem is not noticed with the GCC libgomp OpenMP runtime. For the GCC compiler, the processor affinity defined using numactl is sufficient. It is recommended you always set OMP_PROC_BIND with numactl.

  • The application is multi-threaded and the library is running a single-thread:

    When the application is running multi-thread and the number of threads is set using OMP_NUM_THREADS, it is mandatory to set BLIS_NUM_THREADS to one. Failure to do so will result in AOCL-BLAS running in multi-threaded mode with the number of threads equal to OMP_NUM_THREADS, yielding poor performance.

  • The application and library are both multi-threaded:

    This is a typical scenario of nested parallelism. The number of levels of parallelism active in the OpenMP runtime determines if nested parallelism will occur or not. This can be queried using the OpenMP API call omp_get_max_active_levels and set using the environment variable OMP_MAX_ACTIVE_LEVELS or the API call omp_set_max_active_levels.

    Assuming multiple levels are active, to individually control the threading at application and at the AOCL-BLAS library level, you can either:

    • Use both OMP_NUM_THREADS and BLIS_NUM_THREADS.

      • The number of threads launched by the application is OMP_NUM_THREADS.

      • Each application thread spawns BLIS_NUM_THREADS threads.

      • For a better performance, ensure that Number of Physical Cores = OMP_NUM_THREADS * BLIS_NUM_THREADS.

      Thread pinning for the application and the library can be done using OMP_PROC_BIND:

      $ OMP_NUM_THREADS=4 BLIS_NUM_THREADS=8 OMP_PROC_BIND=spread,close <./application>
      

      At an outer level, the threads are spread and at the inner level, the threads are scheduled closer to their master threads.

    • Use more advanced options for OMP_NUM_THREADS, for example:

      $ OMP_NUM_THREADS=4,8 OMP_PROC_BIND=spread,close <./application>
      
    • Use the OpenMP API call omp_set_num_threads within the application code to set the number of threads to be used in subsequent library calls:

      omp_set_num_threads(8);
      dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);