5. AOCL-FFTW#
Following are the tuning guidelines to get the best performance out of AMD optimized FFTW:
Use the configure option
--enable-amd-opt
to build the targeted library. This option enables all the improvements and optimizations meant for AMD EPYCTM CPUs.This is the mandatory master optimization switch that must be set for enabling any other optional configure options, such as:
--enable-amd-mpifft
--enable-amd-mpi-vader-limit
--enable-amd-trans
--enable-amd-fast-planner
--enable-amd-top-n-planner
--enable-amd-app-opt
--enable-dynamic-dispatcher
When enabling the AMD CPU specific improvements with the configure option
--enable-amd-opt
, do not use the configure option--enable-generic-simd128
or--enable-generic-simd256
.An optional configure option
--enable-amd-trans
is provided and it may benefit the performance of transpose operations in the case of very large FFT problem sizes. This feature is to be used only when running in single-thread and single instance mode.Use the configure option
--enable-amd-mpifft
to enable MPI FFT related optimizations. This is provided as an optional parameter and will benefit most of the MPI problem types and sizes.An optional configure option
--enable-amd-mpi-vader-limit
that controls enabling of AMD’s new MPI transpose algorithms is supported. When using this configure option, you must set--mca btl_vader_eager_limit
appropriately (current preference is 65536) in the MPIRUN command.You can enable AMD optimized fast planner using the optional configure option
--enable-amd-fast-planner
. You can use this option to reduce the planning time without much trade-off in the performance. It is supported for single and double precisions.To minimize single-threaded run-to-run variations, you can enable the planner feature Top N planner using configure option
--enable-amd-top-n-planner
. It works by employing WISDOM feature to generate and reuse a set of top N plans for the given size (wherein the value of N is currently set to 3). It is supported for only single-threaded execution runs.For best performance, use the
PATIENT
planner flag of FFTW.A sample running of FFTW bench test application with PATIENT planner flag is as follows:
$ ./bench -opatient -s icf65536
Where,
-s
option is for speed/performance run and icf options stand for in-place, complex data- type, and forward transform.When configured with
--enable-openmp
and running multi-threaded test, set the OpenMP variables as:set OMP_PROC_BIND=TRUE OMP_PLACES=cores set OMP_PROC_BIND=TRUE OMP_PLACES=cores
Then, run the test bench executable binary using numactl as follows:
$ numactl --interleave=0,1,2,3 ./bench -opatient -onthreads=64 -s icf65536
Where,
numactl --interleave=0,1,2,3
sets the memory interleave policy on nodes 0, 1, 2, and 3.When running MPI FFTW test, set the appropriate MPI mapping, binding, and rank options. For example, to run 64 MPI rank FFTW on a 64-core AMD EPYCTM processor, use:
$ mpirun --map-by core --rank-by core --bind-to core -np 64 ./mpi-bench -opatient -s icf65536
Use the configure option
--enable-amd-app-opt
to enable AMD’s application optimization layer in AOCL-FFTW to help uplift performance of various HPC and scientific applications. For more information, refer AOCL-FFTW chapter in AOCL User Guide.To build a single portable optimized library that can run on a wide range of CPU architectures, a dynamic dispatcher feature is implemented. Use
--enable-dynamic-dispatcher
configure option to enable this feature for Linux-based systems. The set of x86 CPUs on which the single portable library can work depends on the highest level of CPU SIMD instruction set with which it is configured.