Introduction - 5.0 English

AOCL Performance Tuning Guide (63859)

Document ID
63859
Release Date
2024-10-10
Version
5.0 English

1. Introduction#

AMD Optimizing CPU Libraries (AOCL) are a set of numerical libraries optimized for AMD “Zen”- based processors, including EPYCTM, RyzenTM ThreadripperTM, and RyzenTM. This document provides performance tuning recommendations and guidance on optimization flags for advanced users of AOCL to experiment with.

AOCL is comprised of the following libraries:

  • AOCL-BLAS is a portable software framework for performing high-performance Basic Linear Algebra Subprograms (BLAS) functionality.

  • AOCL-LAPACK is a portable library for dense matrix computations that provides the functionality present in the Linear Algebra Package (LAPACK).

  • AOCL-FFTW (Fastest Fourier Transform in the West) is a comprehensive collection of fast C routines for computing the Discrete Fourier Transform (DFT) and various special cases.

  • AOCL-LibM is a software library containing elementary math functions optimized for x86-64 processor based machines.

  • AOCL-Utils is a library which provides APIs to check the available CPU features/flags, cache topology, and so on of AMD “Zen”-based CPUs.

  • AOCL-ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. It depends on external libraries including BLAS and LAPACK for linear algebra computations.

  • AOCL-RNG is a library that provides a set of pseudo-random number generators, quasi-random number generator and statistical distribution functions optimized for AMD “Zen”-based processors.

  • AOCL-SecureRNG is a library that provides APIs to access the cryptographically secure random numbers generated by the AMD hardware random number generator.

  • AOCL-Sparse is a library containing the basic linear algebra subroutines for sparse matrices and vectors optimized for AMD “Zen”-based CPUs.

  • AOCL-LibMem is AMD’s optimized implementation of memory manipulation functions for AMD “Zen”-based CPUs.

  • AOCL-Cryptography is AMD’s optimized implementation of cryptographic functions.

  • AOCL-Compression is a software framework of various lossless data compression and decompression methods tuned and optimized for AMD “Zen”-based CPUs.

  • AOCL-DA is a data analytics library providing optimized building blocks for data analysis and classical machine learning problems.

For information on installing and using all the AMD optimized libraries, please refer AOCL User Guide under “Documentation” section in AMD Developer Central (https://www.amd.com/en/developer/aocl.html).

1.1. General Tuning Guidelines#

This section covers common tuning guidelines applicable to most AOCL libraries.

AOCL performance may be affected by system configuration settings. The documents in the next table give a useful overview of some of the architectural, BIOS and operating system details for different generations of AMD EPYCTM servers.

A key aspect to consider is the interaction between any parallelism at the application level and parallelism with specific AOCL libraries, and how this maps to the NUMA regions in the hardware.

Note

It is vital to remember that if an OpenMP parallel library is used and OpenMP parameters are not explicitly set (via environment variables or API calls), the default behaviour is compiler-dependent but often defaults to using all logical cores enabled on the node. Explicitly setting OMP_NUM_THREADS (as a minimum) is highly recommended.

Different parallelism scenarios to consider:

  • The application and library are single-threaded:

    This is straight forward - no special instructions needed.

  • The application is single-threaded and the library is multi-threaded:

    You can either use the OMP_NUM_THREADS environment variables or make calls to the omp_set_num_threads() C/C++/Fortran API if it is desirable to vary the number of threads to be used for different parts of your application. For simplicity, we will only consider the simple case in all further scenarios.

    The operating system may migrate threads between cores during program execution, which can reduce performance. This can be prevented in a number of ways by binding threads to cores using environment variables or using the Linux command numactl to launch the application.

    $ OMP_NUM_THREADS=8 OMP_PROC_BIND=close OMP_PLACES=cores <./application>
    $ OMP_NUM_THREADS=8 OMP_PROC_BIND=close OMP_PLACES={24:8} <./application>
    $ OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY=24-31 <./application>
    $ OMP_NUM_THREADS=8 numactl -C 24-31 <./application>
    

    The last three options specify the specific logical cores in the system to bind to.

    Note

    numactl defines affinity places and threads might context-switch between the cores defined by numactl. To avoid context switching you should also set OMP_PROC_BIND.

  • The application is multi-threaded and the library is single-threaded:

    You should set the threading options desired for the application-level parallelism, as shown in the previous section.

  • The application and library are both multi-threaded:

    This is a typical scenario of nested parallelism. The default number of active levels of parallelism depends upon the compiler but is often just one. Control over the number of active levels and the desired amount of parallelism at each level can be done using more advanced options for the OMP_NUM_THREADS environment variable and the OMP_MAX_ACTIVE_LEVELS environment variable. Consider the OpenMP specification for further details.

    Thread pinning for the application and the library can be done using more advanced options for OMP_PROC_BIND, e.g.:

    export OMP_PROC_BIND=spread,close
    

    At an outer (application) level, the threads are spread and at the inner (AOCL library) level, the threads are scheduled closer to their master threads. More complicated scenarios could involve multiple levels of parallelism within the application and/or libraries. Finer grained control is possible by adding suitable OpenMP API calls within the application.

  • Multiple copies of the application running simultaneously:

    This could be independently started copies (aka “task farming”) or an application running multiple processes connected via MPI. The same considerations apply as in earlier scenarios regarding controlling any threading within the application or any libraries called. In general, you should spread the application processes across the cores available while keeping threads spawned close, i.e., on adjacent cores.

    Note

    If simple options for setting OMP_PLACES is used, e.g. OMP_PROC_BIND=close OMP_PLACES=cores then all the threads for all processes may map to the same subset of cores, leaving other cores idle. This can be avoided by using the other options, i.e. using specific core numbers when setting OMP_PLACES or using numactl to control thread placement.

    Note

    If using a batch queue system (e.g. Slurm, PBS, LSF, IBM Spectrum Symphony, etc) to run the applications, the batch scheduler may control process and thread placement, either automatically or via options it provides. Consult the documentation for the batch queue system you are using, as it is highly likely you should use their options and not set placement directly.

NUMA considerations

Depending upon OS configuration, multiple NUMA regions may be present within each socket of AMD Ryzen and Threadripper desktop and AMD EPYC server systems (this is generally referred to as NUMA Per Socket or NPS), and AMD EPYC server systems also scale to 2 sockets. Thus, when running multiple threads within a process, the placement of memory can significantly influence the performance. In Linux, memory is often allocated on a “first touch” basis, meaning that the memory will be assigned close to the thread that initializes it. If this occurs in a single NUMA region, the memory access latency can be higher and the aggregate bandwidth lower if this memory is then accessed simultaneously by multiple threads that reside on cores across multiple NUMA regions. Where possible, this can be avoided by allocating and initializing the memory on the thread that will use it. You can also use numactl to interleave the memory across specified NUMA regions. This will not achieve perfect placement based on memory utilization, but will spread the memory access overheads more evenly, allowing higher aggregate memory bandwidth.

For example, on a two-socket system with 64 cores per socket, and NPS=4, to pin threads and distribute memory across all the cores and memory regions on the second socket:

$ OMP_NUM_THREADS=64 numactl -C 64-127 -i 4-7 <./application>

or

$ OMP_NUM_THREADS=64 numactl -N 4-7 -i 4-7 <./application>