1. Introduction#
AMD Optimizing CPU Libraries (AOCL) are a set of numerical libraries optimized for AMD “Zen”- based processors, including EPYCTM, RyzenTM ThreadripperTM, and RyzenTM. This document provides performance tuning recommendations and guidance on optimization flags for advanced users of AOCL to experiment with.
AOCL is comprised of the following libraries:
AOCL-BLAS is a portable software framework for performing high-performance Basic Linear Algebra Subprograms (BLAS) functionality.
AOCL-LAPACK is a portable library for dense matrix computations that provides the functionality present in the Linear Algebra Package (LAPACK).
AOCL-FFTW (Fastest Fourier Transform in the West) is a comprehensive collection of fast C routines for computing the Discrete Fourier Transform (DFT) and various special cases.
AOCL-LibM is a software library containing elementary math functions optimized for x86-64 processor based machines.
AOCL-Utils is a library which provides APIs to check the available CPU features/flags, cache topology, and so on of AMD “Zen”-based CPUs.
AOCL-ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. It depends on external libraries including BLAS and LAPACK for linear algebra computations.
AOCL-RNG is a library that provides a set of pseudo-random number generators, quasi-random number generator and statistical distribution functions optimized for AMD “Zen”-based processors.
AOCL-SecureRNG is a library that provides APIs to access the cryptographically secure random numbers generated by the AMD hardware random number generator.
AOCL-Sparse is a library containing the basic linear algebra subroutines for sparse matrices and vectors optimized for AMD “Zen”-based CPUs.
AOCL-LibMem is AMD’s optimized implementation of memory manipulation functions for AMD “Zen”-based CPUs.
AOCL-Cryptography is AMD’s optimized implementation of cryptographic functions.
AOCL-Compression is a software framework of various lossless data compression and decompression methods tuned and optimized for AMD “Zen”-based CPUs.
AOCL-DA is a data analytics library providing optimized building blocks for data analysis and classical machine learning problems.
For information on installing and using all the AMD optimized libraries, please refer AOCL User Guide under “Documentation” section in AMD Developer Central (https://www.amd.com/en/developer/aocl.html).
1.1. General Tuning Guidelines#
This section covers common tuning guidelines applicable to most AOCL libraries.
AOCL performance may be affected by system configuration settings. The documents in the next table give a useful overview of some of the architectural, BIOS and operating system details for different generations of AMD EPYCTM servers.
Zen5 |
AMD EPYC 9005 (“Turin”) |
To be added when available |
Zen4 |
AMD EPYC 9004 (“Genoa”, “Bergamo”) |
|
Zen4 |
AMD EPYC 8004 (“Siena”) |
|
Zen3 |
AMD EPYC 7003 (“Milan”) |
|
Zen2 |
AMD EPYC 7002 (“Rome”) |
|
Zen |
AMD EPYC 7001 (“Naples”) |
A key aspect to consider is the interaction between any parallelism at the application level and parallelism with specific AOCL libraries, and how this maps to the NUMA regions in the hardware.
Note
It is vital to remember that if an OpenMP parallel library is used
and OpenMP parameters are not explicitly set (via environment variables
or API calls), the default behaviour is compiler-dependent but often
defaults to using all logical cores enabled on the node. Explicitly
setting OMP_NUM_THREADS
(as a minimum) is highly recommended.
Different parallelism scenarios to consider:
The application and library are single-threaded:
This is straight forward - no special instructions needed.
The application is single-threaded and the library is multi-threaded:
You can either use the
OMP_NUM_THREADS
environment variables or make calls to theomp_set_num_threads()
C/C++/Fortran API if it is desirable to vary the number of threads to be used for different parts of your application. For simplicity, we will only consider the simple case in all further scenarios.The operating system may migrate threads between cores during program execution, which can reduce performance. This can be prevented in a number of ways by binding threads to cores using environment variables or using the Linux command
numactl
to launch the application.$ OMP_NUM_THREADS=8 OMP_PROC_BIND=close OMP_PLACES=cores <./application> $ OMP_NUM_THREADS=8 OMP_PROC_BIND=close OMP_PLACES={24:8} <./application> $ OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY=24-31 <./application> $ OMP_NUM_THREADS=8 numactl -C 24-31 <./application>
The last three options specify the specific logical cores in the system to bind to.
Note
numactl
defines affinity places and threads might context-switch between the cores defined bynumactl
. To avoid context switching you should also setOMP_PROC_BIND
.The application is multi-threaded and the library is single-threaded:
You should set the threading options desired for the application-level parallelism, as shown in the previous section.
The application and library are both multi-threaded:
This is a typical scenario of nested parallelism. The default number of active levels of parallelism depends upon the compiler but is often just one. Control over the number of active levels and the desired amount of parallelism at each level can be done using more advanced options for the
OMP_NUM_THREADS
environment variable and theOMP_MAX_ACTIVE_LEVELS
environment variable. Consider the OpenMP specification for further details.Thread pinning for the application and the library can be done using more advanced options for
OMP_PROC_BIND
, e.g.:export OMP_PROC_BIND=spread,close
At an outer (application) level, the threads are spread and at the inner (AOCL library) level, the threads are scheduled closer to their master threads. More complicated scenarios could involve multiple levels of parallelism within the application and/or libraries. Finer grained control is possible by adding suitable OpenMP API calls within the application.
Multiple copies of the application running simultaneously:
This could be independently started copies (aka “task farming”) or an application running multiple processes connected via MPI. The same considerations apply as in earlier scenarios regarding controlling any threading within the application or any libraries called. In general, you should spread the application processes across the cores available while keeping threads spawned close, i.e., on adjacent cores.
Note
If simple options for setting OMP_PLACES is used, e.g.
OMP_PROC_BIND=close OMP_PLACES=cores
then all the threads for all processes may map to the same subset of cores, leaving other cores idle. This can be avoided by using the other options, i.e. using specific core numbers when settingOMP_PLACES
or usingnumactl
to control thread placement.Note
If using a batch queue system (e.g. Slurm, PBS, LSF, IBM Spectrum Symphony, etc) to run the applications, the batch scheduler may control process and thread placement, either automatically or via options it provides. Consult the documentation for the batch queue system you are using, as it is highly likely you should use their options and not set placement directly.
NUMA considerations
Depending upon OS configuration, multiple NUMA regions may be present
within each socket of AMD Ryzen and Threadripper desktop and AMD EPYC
server systems (this is generally referred to as NUMA Per Socket or
NPS), and AMD EPYC server systems also scale to 2 sockets. Thus, when
running multiple threads within a process, the placement of memory
can significantly influence the performance. In Linux, memory is
often allocated on a “first touch” basis, meaning that the memory
will be assigned close to the thread that initializes it. If this
occurs in a single NUMA region, the memory access latency can be
higher and the aggregate bandwidth lower if this memory is then
accessed simultaneously by multiple threads that reside on cores
across multiple NUMA regions. Where possible, this can be avoided by
allocating and initializing the memory on the thread that will use
it. You can also use numactl
to interleave the memory across
specified NUMA regions. This will not achieve perfect placement based
on memory utilization, but will spread the memory access overheads
more evenly, allowing higher aggregate memory bandwidth.
For example, on a two-socket system with 64 cores per socket, and NPS=4, to pin threads and distribute memory across all the cores and memory regions on the second socket:
$ OMP_NUM_THREADS=64 numactl -C 64-127 -i 4-7 <./application>
or
$ OMP_NUM_THREADS=64 numactl -N 4-7 -i 4-7 <./application>