8.5.3. Runtime Thread Control - 5.2 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2025-12-29
Version
5.2 English

AOCL-DLP provides comprehensive runtime control over threading behavior through environment variables and API calls.

AOCL-DLP Threading Variables:

Variable

Type

Description

Example Values

DLP_NUM_THREADS

Integer

Sets total number of threads for GEMM operations

1, 4, 8, 16

DLP_IC_NT

Integer

Number of threads for inner loop parallelization (IC dimension)

1, 2, 4

DLP_JC_NT

Integer

Number of threads for outer loop parallelization (JC dimension)

1, 2, 4

Note: When both DLP_IC_NT and DLP_JC_NT are set, DLP_NUM_THREADS is ignored.

OpenMP Environment Variables:

When using OpenMP threading model, these variables affect performance:

Variable

Type

Description

Example Values

OMP_NUM_THREADS

Integer

Number of OpenMP threads

1, 4, 8, 16

OMP_PROC_BIND

String

Thread affinity policy

close, spread, true

OMP_PLACES

String

Thread placement specification

cores, sockets, threads

OMP_WAIT_POLICY

String

Thread wait policy for better performance

active, passive

Additional Environment Variables:

Variable

Description

Example

AOCL_ENABLE_INSTRUCTIONS

Specify target instruction set

AVX2, AVX512

AOCL_ENABLE_LPGEMM_LOGGER

Enable detailed logging for low-precision GEMM operations. This variable only takes effect if the library is built with DLP_ENABLE_LOGGING=ON.

1, true, yes

Usage Examples:

Basic Threading Configuration:

# Set 8 threads for all GEMM operations
export DLP_NUM_THREADS=8
./your_application

# Use 2x4 thread decomposition (2 for JC, 4 for IC)
export DLP_JC_NT=2
export DLP_IC_NT=4
./your_application

OpenMP Optimization for NUMA Systems:

# Multi-socket systems - bind to specific NUMA node with interleaved memory
# Example: 128 cores total, using second socket (cores 64-127, NUMA node 1)
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl --cpunodebind=1 --interleave=1 ./your_application

# Alternative: Bind to specific core range
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl -C 64-127 --interleave=1 ./your_application

# Single-socket systems - keep threads and memory local
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=16
numactl --cpunodebind=0 --membind=0 ./your_application

Recommended Production Command:

For optimal AOCL-DLP performance on multi-socket systems, use the following comprehensive command template:

# Optimal configuration for second socket with 128 cores
# Adjust OMP_NUM_THREADS based on your system's core count per socket
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

API-Based Thread Control:

AOCL-DLP provides runtime APIs for thread control that take precedence over environment variables:

#include "aocl_dlp.h"

// Set total number of threads
dlp_thread_set_num_threads(8);

// Set thread decomposition (JC threads, IC threads)
dlp_thread_set_ways(2, 4);

// Get current thread configuration
int total_threads = dlp_thread_get_num_threads();

For comprehensive performance optimization strategies and detailed tuning information, please refer to the Performance Guide at

amd/aocl-dlp