The combination of multi-threading (through OpenMP library) and MPI is important to configure for optimal performance. Set the number of MPI tasks to number of L3 caches in the system for optimal performance.
The HPL benchmark typically produces a better single node performance number with the following configurations depending on which generation of AMD EPYCTM processor is used:
2nd Gen AMD EPYCTM Processors (codenamed “Rome”)
A dual socket AMD EPYC 7742 system consists of 32 CCXs, each having an L3 cache and a total of 2 x 64 cores (four cores per CCX). For maximum performance, use 32 MPI ranks with 4 OpenMP threads. Each MPI rank is bonded to 1 CCX and 4 threads per L3 cache.
Set the following flags while building and running the tests:
$ export BLIS_IC_NT=4 $ export BLIS_JC_NT=1
Execute the following command to run the test:
$ mpirun -np 32 --report-bindings --map-by ppr:1:l3cache,pe=4 -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl
BLIS_IC_NT and BLIS_JC_NT parameters are set for DGEMM parallelization at each shared L3 cache to improve the performance further.
3rd Gen AMD EPYCTM Processors (codenamed “Milan”)
The number of MPI ranks and maximum thread count per MPI rank depends on the specific EPYC SKU. For better performance, bind each MPI rank to a CCX, if there are 4 OpenMP threads. However, if 8 threads are used, then you should specify CCD instead of CCX.
Set the following flags while building and running the tests:
$ export BLIS_IC_NT=8 $ export BLIS_JC_NT=1
Execute the following command to run the test:
$ mpirun -np 16 --report-bindings --map-by ppr:1:l3cache,pe=8 -x OMP_NUM_THREADS=8 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl