2. AOCL-BLAS#
2.1. AOCL-BLAS Thread Control#
Multi-threaded builds of AOCL-BLAS provide several mechanism for setting the desired number of threads during initialization and runtime as explained below.
2.1.1. AOCL-BLAS Initialization#
During AOCL-BLAS initialization, the preferred number of threads in the BLAS routines can be set by an application in multiple ways as follows:
bli_thread_set_num_threads(nt)
AOCL-BLAS library APIValid value of
BLIS_NUM_THREADS
environment variableomp_set_num_threads(nt)
OpenMP library APIValid value of
OMP_NUM_THREADS
environment variable
If none of these is issued by an application, the default behaviour is compiler-dependent but often defaults to using all logical cores enabled on the node as the preferred number of threads.
If the number of threads is set in one or more possible ways, the order of precedence for AOCL would be in the above mentioned order.
The following tables describes the sample scenarios for setting the number of threads during AOCL-BLAS initialization for respective codes:
int main()
{
// pseudo code to use OpenMP API to set number of threads
omp_set_num_threads(16);
dgemm_( );
// ...
return 0;
}
Sample Command Executed |
No of Threads Set During AOCL-BLAS Initialization |
Remarks |
---|---|---|
|
8 |
|
|
16 |
|
|
16 |
|
|
8 |
|
int main()
{
// pseudo code
dgemm_( );
// ...
return 0;
}
Sample Command Executed |
No of Threads Set During AOCL-BLAS Initialization |
Remarks |
---|---|---|
|
8 |
|
|
64 |
|
|
4 |
|
2.1.2. Runtime#
Once the number of threads is set during AOCL-BLAS initialization, it
will be used in subsequent BLAS routine execution until the application
modifies the number of threads to be used (for example, using the
omp_set_num_threads()
API).
The following table describes the sample scenarios for setting the number of threads during runtime:
int main()
{
// Pseudo code for sample usage of OpenMP API to set
// number of threads in the Application during runtime
do {
if (m < 500)
omp_set_num_threads(8);
if (m >= 500)
omp_set_num_threads(16);
if (m >= 3000)
omp_set_num_threads(32);
dgemm_( );
} while(test_case_counter--)
// ...
return 0;
}
Sample Command Executed |
m |
Number of Threads for this BLAS Call |
Remarks |
---|---|---|---|
|
100 |
8 |
Application issued |
500 |
16 |
Application issued |
|
200 |
8 |
Application re-issued |
|
4000 |
32 |
Application issued |
|
1000 |
16 |
Application re-issued |
|
500 |
16 |
Application re-issued |
|
100 |
8 |
Application re-issued |
2.1.3. Runtime Thread Control#
AOCL-BLAS libraries that are multi-threaded using OpenMP parallelism
provide two mechanisms for the users to control the number of threads
for AOCL-BLAS functions to use. These are the normal OpenMP
mechanisms and AOCL-BLAS specific environment variables and function
calls. The AOCL-BLAS specific mechanisms include the option to set
the overall number of threads for AOCL-BLAS to use or to set the
threading specifically for the different loops within the AOCL-BLAS Level 3
routines (for example, DGEMM
). These are called the automatic and the
manual ways respectively. For more information, refer to:
Multithreading.md
The order of precedence used in AOCL-BLAS, where set or called by the user, is as follows:
The AOCL-BLAS manual way values set using
bli_thread_set_ways()
by the application.Valid value(s) of any of the
BLIS_*_NT
environment variables.Value set using
bli_thread_set_num_threads(nt)
by the application.Valid value set for the environment variable
BLIS_NUM_THREADS
.omp_set_num_threads(nt)
issued by the application.Valid value set for the environment variable
OMP_NUM_THREADS
.The default number of threads used by the chosen OpenMP runtime library when
OMP_NUM_THREADS
is not set.
Two other factors may override these settings:
OpenMP parallelism at higher level(s) in the code calling AOCL-BLAS, that is, the number of active levels and the level at which the AOCL-BLAS call occurs.
The effect of AOCL Dynamic (if enabled), as described in the next section.
Note
From AOCL 4.1, support for calling AOCL-BLAS within nested OpenMP parallelism has been improved. Hence, using the standard OpenMP mechanisms should be sufficient for most of the use cases.
2.2. AOCL Dynamic#
The AOCL dynamic feature enables AOCL-BLAS to dynamically change the number of threads.
This feature is enabled by default, however, it can be enabled or
disabled at the configuration time using the options
--enable-aocl-dynamic
and --disable-aocl-dynamic
respectively.
You can also specify the preferred number of threads using the
environment variables BLIS_NUM_THREADS
or OMP_NUM_THREADS
.
If both are specified, BLIS_NUM_THREADS
takes precedence.
The following table summarizes how the number of threads is
determined based on the status of AOCL Dynamic and the user
configuration using the variable BLIS_NUM_THREADS
:
AOCL Dynamic |
BLIS_NUM_THREADS |
Number of Threads Used by AOCL-BLAS |
---|---|---|
Disabled |
Unset |
Number of logical cores. |
Disabled |
Set |
|
Enabled |
Unset |
Number of threads determined by AOCL Dynamic. |
Enabled |
Set |
Minimum of |
2.2.1. Limitations#
The AOCL Dynamic feature has the following limitations:
Supported only for threading using OpenMP.
Supports only
DGEMM
,ZGEMM
,DTRSM
,ZTRSM
,DGEMMT
,DSYRK
,DTRMM
,SGEMV
,DSCAL
,ZDSCAL
,DDOT
,DNRM2
,DZNRM2
, andDAXPY
APIs.Specifying the number of threads more than the number of cores may result in deteriorated performance because of over-subscription of cores.
Based on the input parameters (such as size, transpose, and storage format), the optimal code path for the given number of threads would be executed. This can be single-threaded even if the number of threads set is more than 1.
2.3. AOCL-BLAS Multi-Thread Tuning#
AOCL-BLAS library can be used on multiple platforms and applications. Multi-threading adds more configuration options at runtime. This section explains the number of threads and CPU affinity settings that can be tuned to get the best performance for your requirements.
2.3.1. Library Usage Scenarios#
The application and library are single-threaded:
This is straight forward - no special instructions needed. You can
export BLIS_NUM_THREADS=1
indicating you are running AOCL-BLAS in a single-thread mode. If bothBLIS_NUM_THREADS
andOMP_NUM_THREADS
are set, the former will take precedence over the later.The application is single-threaded and the library is multi-threaded:
You can either use
OMP_NUM_THREADS
orBLIS_NUM_THREADS
to define the number of threads for the library. However, it is recommended that you useBLIS_NUM_THREADS
, especially if you wish to set different values for AOCL-BLAS from OpenMP parallel regions in your application program or in other libraries..Example:
$ export BLIS_NUM_THREADS=128 # Here, AOCL-BLAS runs at 128 threads.
Apart from setting the number of threads, you must pin the threads to the cores using
GOMP_CPU_AFFINITY
ornumactl
as follows:$ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 <./application> Or $ BLIS_NUM_THREADS=128 GOMP_CPU_AFFINITY=0-127 numactl --i=all <./application> $ BLIS_NUM_THREADS=128 OMP_PROC_BIND=close numactl -C 0-127 --interleave=all <./test_application>
Note
For the Clang compiler, it is mandatory to use
OMP_PROC_BIND=true
in addition to the thread pinning (ifnumactl
is used). For example, for a matrix size of 200 and 32 threads, if you run DGEMM withoutOMP_PROC_BIND
settings, the performance would be less. However, if you start usingOMP_PROC_BIND=true
, the performance would improve. This problem is not noticed with libgomp using gcc compiler. For the gcc compiler, the processor affinity defined usingnumactl
is sufficient. Our advice, always setOMP_PROC_BIND
withnumactl
.The application is multi-threaded and the library is running a single-thread:
When the application is running multi-thread and number of threads are set using
OMP_NUM_THREADS
, it is mandatory to setBLIS_NUM_THREADS
to one. Otherwise, AOCL-BLAS will run in multi-threaded mode with the number of threads equal toOMP_NUM_THREADS
. This may result in a poor performance.The application and library are both multi-threaded:
This is a typical scenario of nested parallelism. Whether or not nested parallelism will occur depends upon the number of levels of parallelism active in the OpenMP runtime. This can be queried using the OpenMP API call
omp_get_max_active_levels
and set using the environment variableOMP_MAX_ACTIVE_LEVELS
or the API callomp_set_max_active_levels
.Assuming multiple levels are active, to individually control the threading at application and at the AOCL-BLAS library level, you can either:
Use both
OMP_NUM_THREADS
andBLIS_NUM_THREADS
.The number of threads launched by the application is
OMP_NUM_THREADS
.Each application thread spawns
BLIS_NUM_THREADS
threads.To get a better performance, ensure that
Number of Physical Cores = OMP_NUM_THREADS * BLIS_NUM_THREADS
.
Thread pinning for the application and the library can be done using
OMP_PROC_BIND
:$ OMP_NUM_THREADS=4 BLIS_NUM_THREADS=8 OMP_PROC_BIND=spread,close <./application>
At an outer level, the threads are spread and at the inner level, the threads are scheduled closer to their master threads.
Use more advanced options for
OMP_NUM_THREADS
, for example:$ OMP_NUM_THREADS=4,8 OMP_PROC_BIND=spread,close <./application>
Use the OpenMP API call
omp_set_num_threads
within the application code to set the number of threads to be used in subsequent library calls:omp_set_num_threads(8); dgemm_("N","N",&M,&N,&K,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);
2.4. AOCL-BLAS Level-3 Block-Size Tuning#
AOCL-BLAS level-3 operations like DGEMM, DTRSM performance is largely impacted by the block sizes
used by AOCL-BLAS. A matrix-matrix multiplication of large m
, n
, and k
dimensions is partitioned into sub-problems of the specified block
sizes.
Many HPC, scientific applications, and benchmarks run on high-end cluster of machines, each with multiple cores. They run programs with multiple instances through Message Passing Interface (MPI) based APIs or separate instances of each program. Depending on whether the application using AOCL-BLAS is running in multi-instance mode or single instance, the specified block sizes will have an impact on the overall performance.
The default values for the block size in AOCL-BLAS GitHub repository (amd/blis) is set to extract the best performance for such HPC applications/benchmarks, which use single-threaded AOCL-BLAS and run in multi-instance mode on AMD EPYCTM AMD “Zen” core processors. However, if your application runs as a single instance, the block sizes for an optimal performance would vary.
The following settings will help you choose the optimal values for the block sizes based on the way the application is run:
2nd Gen AMD EPYCTM Processors (codenamed “Rome”)
Open the file
bli_family_zen2.h
in the AOCL-BLAS source:$ cd "config/zen2/bli_family_zen2.h"
For applications/benchmarks running in multi-instance mode and using multi-threaded AOCL-BLAS, ensure that the macro
AOCL_BLIS_MULTIINSTANCE
is set to 0. As of AOCL 2.x release, this is the default setting. The HPL benchmark is found to generate better performance numbers using the following setting for multi-threaded AOCL-BLAS:#define AOCL_BLIS_MULTIINSTANCE 0
1st Gen AMD EPYCTM Processors (codenamed “Naples”)
Open the file
bli_cntx_init_zen.c
under the AOCL-BLAS source:$ cd "config/zen/bli_family_zen.h"
Ensure the macro,
BLIS_ENABLE_ZEN_BLOCK_SIZES
is defined:#define BLIS_ENABLE_ZEN_BLOCK_SIZES
Multi-Instance Mode
For applications/benchmarks running in multi-instance mode, ensure
that the macro BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
is set to 0.
As of AOCL 2.x release, following is the default setting:
#define BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES 0
The optimal block sizes for this mode on AMD EPYCTM are defined in
the file config/zen/ bli_cntx_init_zen.c
:
bli_blksz_init_easy( &blkszs[ BLIS_MC ], 144, 240, 144, 72 );
bli_blksz_init_easy( &blkszs[ BLIS_KC ], 256, 512, 256, 256 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 4080, 2040, 4080, 4080 );
Single-Instance Mode
For the applications running as a single instance, ensure that the
macro BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES
is set to 1:
#define BLIS_ENABLE_SINGLE_INSTANCE_BLOCK_SIZES 1
The optimal block sizes for this mode on AMD EPYCTM are defined in
the file config/zen/bli_cntx_init_zen.c
:
bli_blksz_init_easy( &blkszs[ BLIS_MC ], 144, 510, 144, 72 );
bli_blksz_init_easy( &blkszs[ BLIS_KC ], 256, 1024, 256, 256 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 4080, 4080, 4080, 4080 );
2.5. Performance Suggestions for Skinny Matrices#
AOCL-BLAS provides a selective packing for GEMM when one or two-dimensions of a matrix is exceedingly small. Selective packing is only applicable when sup is enabled. For optimal performance:
# C = beta*C + alpha*A*B
# Dimension (Dim) of A - m x k
# Dimension (Dim) of B - k x n
# Dimension (Dim) of C - m x n
# Assume all are stored in row-major format.
# IF m >> n
$ BLIS_PACK_A=1 ./test_gemm_blis.x - will give a better performance.
# IF m << n
$ BLIS_PACK_B=1 ./test_gemm_blis.x - will give a better performance.