AOCL-DLP provides comprehensive runtime control over threading behavior through environment variables and API calls.
AOCL-DLP Threading Variables:
Variable |
Type |
Description |
Example Values |
|---|---|---|---|
|
Integer |
Sets total number of threads for GEMM operations |
|
|
Integer |
Number of threads for inner loop parallelization (IC dimension) |
|
|
Integer |
Number of threads for outer loop parallelization (JC dimension) |
|
Note: When both DLP_IC_NT and DLP_JC_NT are set, DLP_NUM_THREADS is ignored.
OpenMP Environment Variables:
When using OpenMP threading model, these variables affect performance:
Variable |
Type |
Description |
Example Values |
|---|---|---|---|
|
Integer |
Number of OpenMP threads |
|
|
String |
Thread affinity policy |
|
|
String |
Thread placement specification |
|
|
String |
Thread wait policy for better performance |
|
Additional Environment Variables:
Variable |
Description |
Example |
|---|---|---|
|
Specify target instruction set |
|
|
Enable detailed logging for low-precision GEMM operations. This variable only takes effect if the library is built with |
|
Usage Examples:
Basic Threading Configuration:
# Set 8 threads for all GEMM operations
export DLP_NUM_THREADS=8
./your_application
# Use 2x4 thread decomposition (2 for JC, 4 for IC)
export DLP_JC_NT=2
export DLP_IC_NT=4
./your_application
OpenMP Optimization for NUMA Systems:
# Multi-socket systems - bind to specific NUMA node with interleaved memory
# Example: 128 cores total, using second socket (cores 64-127, NUMA node 1)
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl --cpunodebind=1 --interleave=1 ./your_application
# Alternative: Bind to specific core range
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl -C 64-127 --interleave=1 ./your_application
# Single-socket systems - keep threads and memory local
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=16
numactl --cpunodebind=0 --membind=0 ./your_application
Recommended Production Command:
For optimal AOCL-DLP performance on multi-socket systems, use the following comprehensive command template:
# Optimal configuration for second socket with 128 cores
# Adjust OMP_NUM_THREADS based on your system's core count per socket
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application
API-Based Thread Control:
AOCL-DLP provides runtime APIs for thread control that take precedence over environment variables:
#include "aocl_dlp.h"
// Set total number of threads
dlp_thread_set_num_threads(8);
// Set thread decomposition (JC threads, IC threads)
dlp_thread_set_ways(2, 4);
// Get current thread configuration
int total_threads = dlp_thread_get_num_threads();
For comprehensive performance optimization strategies and detailed tuning information, please refer to the Performance Guide at