AOCL-BLAS library can be used on multiple platforms and applications. Multi-threading adds more configuration options at runtime. This section explains the number of threads and CPU affinity settings that can be tuned to get the best performance for your requirements for different usage scenarios.
The application and library are single-threaded:
This is straight forward - no special instructions are needed. You can
export BLIS_NUM_THREADS=1indicating you are running AOCL-BLAS in a single-thread mode. If bothBLIS_NUM_THREADSandOMP_NUM_THREADSare set, the former will take precedence over the latter.The application is single-threaded and the library is multi-threaded:
You can either use
OMP_NUM_THREADSorBLIS_NUM_THREADSto define the number of threads for the library. However, it is recommended that you useBLIS_NUM_THREADS, especially if you wish to set different values for AOCL-BLAS from OpenMP parallel regions in your application program or in other libraries.Example:
$ export BLIS_NUM_THREADS=128 # Here, AOCL-BLAS runs at 128 threads.
Apart from setting the number of threads, you must pin the threads to the cores using
GOMP_CPU_AFFINITYornumactlas follows:$ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 <./application> Or $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 numactl --i=all <./application> $ BLIS_NUM_THREADS=128 OMP_PROC_BIND=close numactl -C 0-127 --interleave=all <./test_application>
Note
For the Clang compiler, it is mandatory to use
OMP_PROC_BIND=truein addition to the thread pinning (ifnumactlis used). For example, for a matrix size of 200 and 32 threads, if you run DGEMM withoutOMP_PROC_BINDsettings, the performance would be less. However, if you start usingOMP_PROC_BIND=true, the performance would improve. This problem is not noticed with the GCC libgomp OpenMP runtime. For the GCC compiler, the processor affinity defined usingnumactlis sufficient. It is recommended you always setOMP_PROC_BINDwithnumactl.The application is multi-threaded and the library is running a single-thread:
When the application is running multi-thread and the number of threads is set using
OMP_NUM_THREADS, it is mandatory to setBLIS_NUM_THREADSto one. Failure to do so will result in AOCL-BLAS running in multi-threaded mode with the number of threads equal to OMP_NUM_THREADS, yielding poor performance.The application and library are both multi-threaded:
This is a typical scenario of nested parallelism. The number of levels of parallelism active in the OpenMP runtime determines if nested parallelism will occur or not. This can be queried using the OpenMP API call
omp_get_max_active_levelsand set using the environment variableOMP_MAX_ACTIVE_LEVELSor the API callomp_set_max_active_levels.Assuming multiple levels are active, to individually control the threading at application and at the AOCL-BLAS library level, you can either:
Use both
OMP_NUM_THREADSandBLIS_NUM_THREADS.The number of threads launched by the application is
OMP_NUM_THREADS.Each application thread spawns
BLIS_NUM_THREADSthreads.For a better performance, ensure that
Number of Physical Cores = OMP_NUM_THREADS * BLIS_NUM_THREADS.
Thread pinning for the application and the library can be done using
OMP_PROC_BIND:$ OMP_NUM_THREADS=4 BLIS_NUM_THREADS=8 OMP_PROC_BIND=spread,close <./application>
At an outer level, the threads are spread and at the inner level, the threads are scheduled closer to their master threads.
Use more advanced options for
OMP_NUM_THREADS, for example:$ OMP_NUM_THREADS=4,8 OMP_PROC_BIND=spread,close <./application>
Use the OpenMP API call
omp_set_num_threadswithin the application code to set the number of threads to be used in subsequent library calls:omp_set_num_threads(8); dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);