Linking AOCL to Applications - 5.0 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2024-12-14
Version
5.0 English

17. Linking AOCL to Applications#

This section provides examples of how AOCL can be linked with the HPL benchmark and MUMPS sparse solver library.

17.1. High-Performance LINPACK Benchmark (HPL)#

HPL is a software package that solves a (random) dense linear system in double precision (64-bits) arithmetic on distributed memory computers. It is a LINPACK benchmark that measures the floating- point rate of execution for solving a linear system of equations.

To build an HPL binary from the source code, edit the MPxxx and LAxxx directories in your architecture-specific Makefile to match the installed locations of your MPI and Linear Algebra library. For AOCL-BLAS, use the F77 interface with F2CDEFS = -DAdd -DF77_INTEGER=int - DStringSunStyle.

Use the multi-threaded AOCL-BLAS with the following configuration for an optimal performance:

$ ./configure --enable-cblas -t openmp --prefix=<path> auto

Setup HPL.dat before running the benchmark.

17.1.1. Configuring HPL.dat#

  • HPL.dat file contains the configuration parameters. The important parameters are Problem Size, Process Grid, and BlockSize.

Table 17.1 Configuration Parameters#

Parameter

Purpose

Problem Size (N)

For best results, the problem size must be set large enough to use 80-90% of the available memory.

Process Grid (P and Q)

P x Q must match the number of MPI ranks. P and Q must be as close to each other as possible. If the numbers cannot be equal, Q must be larger.

BlockSize (NB)

HPL uses the block size for the data distribution and for the computational granularity. Set NB=240 for an optimal performance.

Set BCASTs=2

Increasing-2-ring (2rg) broadcast algorithm gives a better performance than the default broadcast algorithm.

17.1.2. Running the Benchmark#

The combination of multi-threading (through OpenMP library) and MPI is important to configure for optimal performance. Set the number of MPI tasks to number of L3 caches in the system for optimal performance.

The HPL benchmark typically produces a better single node performance number with the following configurations depending on which generation of AMD EPYCTM processor is used:

  • 2nd Gen AMD EPYCTM Processors (codenamed “Rome”)

    A dual socket AMD EPYC 7742 system consists of 32 CCXs, each having an L3 cache and a total of 2 x 64 cores (four cores per CCX). For maximum performance, use 32 MPI ranks with 4 OpenMP threads. Each MPI rank is bonded to 1 CCX and 4 threads per L3 cache.

    Set the following flags while building and running the tests:

    $ export BLIS_IC_NT=4
    $ export BLIS_JC_NT=1
    

    Execute the following command to run the test:

    $ mpirun -np 32 --report-bindings --map-by ppr:1:l3cache,pe=4 -x
    OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl
    

    BLIS_IC_NT and BLIS_JC_NT parameters are set for DGEMM parallelization at each shared L3 cache to improve the performance further.

  • 3rd Gen AMD EPYCTM Processors (codenamed “Milan”)

    The number of MPI ranks and maximum thread count per MPI rank depends on the specific EPYC SKU. For better performance, bind each MPI rank to a CCX, if there are 4 OpenMP threads. However, if 8 threads are used, then you should specify CCD instead of CCX.

    Set the following flags while building and running the tests:

    $ export BLIS_IC_NT=8
    $ export BLIS_JC_NT=1
    

    Execute the following command to run the test:

    $ mpirun -np 16 --report-bindings --map-by ppr:1:l3cache,pe=8 -x
    OMP_NUM_THREADS=8 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl
    

17.2. MUMPS Sparse Solver Library#

MUltifrontal Massively Parallel Solver (MUMPS: http://mumps-solver.org/) is an open-source package for solving systems of linear equations of the form:

\[Ax = b,\]

where \(A\) is a square sparse matrix that can be one of the following on distributed memory computers:

  • Unsymmetric

  • Symmetric positive definite

  • General symmetric

MUMPS implements a direct method based on a multi-frontal approach which performs the Gaussian factorization:

\[A = LU,\]

where \(L\) is a lower triangular matrix and \(U\) an upper triangular matrix. If the matrix is symmetric then the factorization:

\[A = LDL^T,\]

where \(D\) is a block diagonal matrix performed. The system \(Ax = b\) is solved in the following steps:

  1. Analysis

    During an analysis, preprocessing including re-ordering and a symbolic factorization are performed. This depends on the external libs METIS, SCOTCH, and PORD (inside MUMPS source). \(A_{pre}\) denotes the preprocessed matrix.

  2. Factorization

    During the factorization, \(A_{pre} = LU\) or \(A_{pre} = LDL^T\), depending on the symmetry of the preprocessed matrix, is computed. The original matrix is first distributed (or redistributed) onto the processors depending on the mapping computed during the analysis. The numerical factorization is then a sequence of dense factorization on the frontal matrices.

  3. Solution

    The solution \(x_{pre}\) of:

    \[LUx_{pre} = b_{pre} \quad\text{or}\quad LDL^T x_{pre} = b_{pre}\]

    where \(x_{pre}\) and \(b_{pre}\) are the transformed solution \(x\) and right-hand side \(b\) respectively. They are associated to the preprocessed matrix \(A_{pre}\) and obtained through the forward elimination step:

    \[Ly = b_{pre} \quad\text{or}\quad LDy = b_{pre}\]

    Followed by the backward elimination step:

    \[Ux_{pre} = y \quad\text{or}\quad L^T x_{pre} = y .\]

    The solution \(x_{pre}\) is finally processed to obtain the solution \(x\) of the original system \(Ax = b\).

The AOCL libraries can be integrated with the MUMPS sparse solver to perform highly optimized linear algebra operations on AMD “Zen”-based processors.

17.2.1. Enabling AOCL with MUMPS#

17.2.1.1. Using Spack on Linux#

Complete the following steps to enable AOCL with MUMPS on Linux:

  1. Set up Spack on the target machine.

  2. Link the AOCL libraries AOCL-BLAS, AOCL-LAPACK, and AOCL-ScaLAPACK while installing MUMPS. Use the following Spack commands to install MUMPS with:

    • gcc compiler:

      $ spack install mumps ^amdblis ^amdlibflame ^amdscalapack ^aoclutils
      
    • aocc compiler:

      $ spack install mumps ^amdblis ^amdlibflame ^amdscalapack ^aoclutils %aocc
      
    • To use an external reordering library (for example, METIS), run the following command:

      $ spack install mumps ^metis ^amdblis ^amdlibflame ^amdscalapack ^aoclutils
      

17.2.1.2. On Windows#

GitHub URL: amd/mumps-build

Prerequisites

Ensure that the following prerequisites are met:

  • CMake and Ninja Makefile Generator - Ensure that Ninja is installed/updated in the Microsoft Visual Studio installation folder:

    C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja
    
  • Download the latest Binary Ninja from the URL:

    ninja-build/ninja

  • Intel® oneAPI toolkit must include C, C++, Fortran Compilers, and MPI. For more information, refer Intel documentation (https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html).

  • Pre-built AOCL libraries for AOCL-BLAS, AOCL-LAPACK, and AOCL-ScaLAPACK.

  • If reordering library is METIS, complete the following steps:

    1. Download the pre-built METIS library from SuiteSparse public repository (grup-gu/SuiteSparse.git).

    2. Build METIS library from the metis folder:

      $ cd SuiteSparse\metis-5.1.0
      
    3. Define IDXTYPEWIDTH and REALTYPEWIDTH to 32 or 64 based on the required integer size in metis/include/metis.h.

    4. Configure:

      $ cmake S . -B ninja_build_dir -G "Ninja" -DBUILD_SHARED_LIBS=OFF
      -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON
      
    5. Build the project:

      $ cmake --build ninja_build_dir --verbose
      

    The library metis.lib is generated in ninja_build_dirlib.

  • Boost libraries on Windows:

Building MUMPS Sources

Complete the following steps to build the MUMPS sources on Windows:

  1. Checkout the MUMPS build repository from AOCL GitHub (amd/mumps-build).

  2. Open Intel oneAPI command prompt for Intel 64 for Microsoft Visual Studio 2019 from Windows search box.

  3. Edit the default options in options.cmake in mumps/cmake/.

  4. Remove any build directory if it exists already.

  5. Configure the MUMPS project using Ninja:

    cmake S . -B ninja_build_dir -G "Ninja" -DENABLE_AOCL=ON
    -DENABLE_MKL=OFF -DBUILD_TESTING=ON
    -DCMAKE_INSTALL_PREFIX="</mumps/install/path>" -Dscotch=ON -Dopenmp=ON
    -DBUILD_SHARED_LIBS=OFF -Dparallel=ON -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON
    -DCMAKE_BUILD_TYPE=Release
    -DUSER_PROVIDED_BLIS_LIBRARY_PATH="<path/to/AOCL-BLAS/library/path>"
    -DUSER_PROVIDED_BLIS_INCLUDE_PATH="<path/to/ AOCL-BLAS /headers/path>"
    -DUSER_PROVIDED_LAPACK_LIBRARY_PATH="<path/to/
    AOCL-LAPACK/library/path>"
    -DUSER_PROVIDED_LAPACK_INCLUDE_PATH="<path/to/ AOCL-LAPACK
    /headers/path>"
    -DUSER_PROVIDED_SCALAPACK_LIBRARY_PATH="<path/to/scalapack/library/path>"
    -DUSER_PROVIDED_METIS_LIBRARY_PATH="<path/to/metis/library/path>"
    -DUSER_PROVIDED_METIS_INCLUDE_PATH="<path/to/metis/include/path>"
    -DUSER_PROVIDED_IMPILIB_ILP64_PATH="<path/to/intel/mpi/lib/ilp64>"
    -DCMAKE_C_COMPILER="icx.exe" -DCMAKE_CXX_COMPILER="icx.exe"
    -DCMAKE_Fortran_COMPILER="ifx.exe" -DBOOST_ROOT="<path/to/boost>"
    -Dintsize64=OFF -DMUMPS_UPSTREAM_VERSION="5.6.2"
    

    The following options are enabled in the command:

    • -DENABLE_AOCL=ON: <Enable AOCL Libraries>

    • -DENABLE_MKL=OFF: <Enable MKL Libraries>

    • -DBUILD_TESTING=ON: <Enable Mumps linking to test application to

      test>

    • -Dscotch=ON: <Enable Metis Library for Reordering>

    • -Dopenmp=ON: <Enable Multithreading using openmp>

    • -Dintsize64=OFF: <Enable LP64 i.e., 32-bit integer size>

    • -DBUILD_SHARED_LIBS=OFF: <Enable Static Library>

    • -Dparallel=ON: <Enable Multithreading>

    • -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON: <Enable verbose build log>

    • -DCMAKE_BUILD_TYPE= Release: <Enable Release build>

    • -DUSER_PROVIDED_BLIS_LIBRARY_PATH= “<path/to/blas/lib>”

    • -DUSER_PROVIDED_BLIS_INCLUDE_PATH= “<path/to/blas/header>”

    • -DUSER_PROVIDED_LAPACK_LIBRARY_PATH= “<path/to/lapack/lib>”

    • -DUSER_PROVIDED_LAPACK_INCLUDE_PATH= “<path/to/lapack/include/header>”

    • -DUSER_PROVIDED_SCALAPACK_LIBRARY_PATH= “<path/to/scalapack/lib>”

    • -DUSER_PROVIDED_METIS_LIBRARY= “<Metis/library/with/absolute/path>”

    • -DUSER_PROVIDED_METIS_LIBRARY_PATH= “<path/to/metis/lib>”

    • -DUSER_PROVIDED_METIS_INCLUDE_PATH= “<path/to/metis/header>”

    • -DCMAKE_C_COMPILER= “<intel c compiler>”

    • -DCMAKE_Fortran_COMPILER= “<intel fortran compiler>”

    • -DBOOST_ROOT= “<path/to/BOOST/INSTALLATION>”

    • -DUSER_PROVIDED_IMPILIB_ILP64_PATH= “<path/to/64-bit/Intel IMPI Library>”

    • -DMUMPS_UPSTREAM_VERSION = “<valid/supported mumps source versions: 5.4.1, 5.5.0, 5.5.1 and 5.6.2>”

  6. AOCL dependencies can also be configured using AOCL_ROOT. Define the environment variable AOCL_ROOT to point to AOCL libs installation

    set "AOCL_ROOT=C:\Program Files\AMD\AOCL-Windows"
    
    cmake S . -B ninja_build_dir -G "Ninja" -DENABLE_AOCL=ON
    -DENABLE_MKL=OFF -DBUILD_TESTING=ON
    -DCMAKE_INSTALL_PREFIX="</mumps/install/path>" -Dscotch=ON
    -Dopenmp=ON -DBUILD_SHARED_LIBS=OFF -Dparallel=ON
    -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release
    -DUSER_PROVIDED_METIS_LIBRARY_PATH="<path/to/metis/library/path>"
    -DUSER_PROVIDED_METIS_INCLUDE_PATH="<path/to/metis/include/path>"
    -DUSER_PROVIDED_IMPILIB_ILP64_PATH="<path/to/intel/mpi/lib/ilp64>"
    -DCMAKE_C_COMPILER="icx.exe" -DCMAKE_CXX_COMPILER=icx.exe"
    -DCMAKE_Fortran_COMPILER="ifx.exe" -DBOOST_ROOT="<path/to/boost>"
    -Dintsize64=OFF -DMUMPS_UPSTREAM_VERSION="5.6.2"
    
  7. Toggle/Edit the options in step 5 to get:

    1. Debug or Release build

    2. LP64 or ILP64 libs

  8. Build the project:

    cmake --build ninja_build_dir --config Release --target install
    --verbose
    
  9. Run the executable in ninja_build_dirtests:

    mpiexec -env I_MPI_PIN_DOMAIN cache3 -n 2 amd_mumps_aocl LFAT5.mtx 1 1 1