Environment Variables - Environment Variables - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2026-04-13
Revision
5.2.1 English

The environment variables to setup paths and control logs, and tune performance are enumerated here.

The settings given in the following table are used in the ZenDNN library and apply to zentorch and zentf.

Table 1. ZenDNN Environment Variables common to all frameworks
Environment Variable Description Default Value/User Defined Value
Generic (Setup paths and control logs)
ZENDNNL_<module>_LOG_LEVEL Enables ZenDNN logs. See Logging and Debugging for details on how to use logs. 0
ZENDNNL_LRU_CACHE_CAPACITY Sets maximum capacity of LRU cache for blocked weights of MatMul algo.

You can modify it as required a.

1024
ZENDNNL_EMBAG_THREAD_ALGO Sets Embedding Bag thread type. This is the recommended setting for RecSys models. 1
ZENDNNL_EMBAG_ALGO Sets the embedding bag backend kernel.

For FP32/BF16/INT4:

  • 1 = Native
  • 2 = FBGEMM
2
OMP_DYNAMIC OMP variable to control dynamic adjustment of OMP threads. Refer to OpenMP documentation for details. FALSE
Optimized (Tune performance)
OMP_NUM_THREADS Sets the number of OMP threads. Generally, this is equal to the number of cores present.

Set it based on the number of cores in the user system a.

128
OMP_WAIT_POLICY Sets the behavior of waiting threads. Refer to the OMP documentation for details. ACTIVE
KMP_BLOCKTIME

Sets the amount of time, in milliseconds, that a thread should wait before sleeping when a parallel region ends. Setting it to 1 minimizes idle time and can improve responsiveness for short tasks by quickly putting threads to sleep after work is complete.

Note: Do not set this for Recommender System models.
1
KMP_TPAUSE

Controls the behavior of threads when they are waiting for work, aiming to reduce CPU usage. Setting it to 0 indicates threads should not enter an active wait state, optimizing CPU efficiency.

0
KMP_FORKJOIN_BARRIER_PATTERN

Specifies the synchronization pattern for fork/join barriers. dist,dist means a distributed barrier pattern is applied both when threads are forked and joined, potentially reducing synchronization contention.

dist,dist
KMP_PLAIN_BARRIER_PATTERN

Sets the synchronization pattern for plain barriers to dist,dist indicating a distributed pattern that helps manage thread synchronization efficiently during plain barriers.

dist,dist
KMP_REDUCTION_BARRIER_PATTERN

Controls the barrier pattern used in reduction operations (for example, sum or product of arrays across threads). Using dist,dist specifies a distributed pattern to enhance efficiency.

dist,dist
KMP_AFFINITY

Determines how threads are bound to CPU cores. The setting granularity=fine,compact,1,0 specifies fine-grained affinity with threads compacted to as few cores as possible, minimizing memory access latency and maximizing cache utilization, respectively.

granularity=fine,compact,1,0
ZENDNNL_MATMUL_ALGO

Specifies the MatMul algo to be used.

For FP32/BF16/INT8:

  • AUTO (Auto-Tuner)
  • 0 = Static Decision Tree
  • 1 = AOCL_DLP (Blocked with weight-caching)
  • 2 = oneDNN (Blocked with weight-caching)
  • 3 = LIBXSMM-blocked
  • 4 = AOCL_DLP
  • 5 = oneDNN
  • 6 = LIBXSMM

Auto is an experimental feature and should be used with application warm up iteration >=8.

Note: Different workloads on different frameworks (PyTorch, TensorFlow) have specific ZENDNNL_MATMUL_ALGO settings for optimized performance. See Optimal Environment Variable Settings for zentorch and Optimal Environment Variable Settings for zentf.

ZENDNNL_MATMUL_ALGO=1

a You must set these environment variables explicitly.

LLVM OpenMP

LLVM OpenMP runtimes provides the necessary libraries and compiler directives for implementing parallelism in programs.

Developers can use LLVM OpenMP 18.1.18 to compile and run parallel programs written in Fortran and C/C++, taking advantage of shared memory parallelism and improving the performance and scalability of their applications.

The LLVM OpenMP implementation supports various features, including:

  • Compiler directives for specifying parallel regions, tasks, and data dependencies
  • Library routines for creating and managing teams, parallel loops, and synchronization
  • Environment variables for controlling OpenMP behavior

Complete the following steps to install and leverage llvm openmp in your Conda environment:

  1. conda install -c conda-forge llvm-openmp=18.1.8=hf5423f3_1 --no-deps -y
  2. export LD_PRELOAD="<path to conda>/pkgs/llvm-openmp-18.1.8-hf5423f3_1/lib/libiomp5.so:$LD_PRELOAD"

Additional settings used to tune performance with the zentf to the TensorFlow framework

Table 2. zentf Environment Variables-Generic
Environment Variable Description Default Value/User Defined Value
TF_ENABLE_ZENDNN_OPTS

Set TF_ENABLE_ONEDNN_OPTS=0 when you want to enable vanilla training and inference.

Set it to 1 along with TF_ENABLE_ONEDNN_OPTS=0 to enable ZenDNN for inference.

0
TF_ENABLE_ONEDNN_OPTS By default, TensorFlow is shipped with oneDNN optimizations enabled. Hence, set it to 0 when you enable ZenDNN. 1
TF_ZENDNN_PLUGIN_BF16

Set it to 1 to enable Automatic Mixed Precision (AMP) for BF16.

0
zentf Environment Variables-Optimization
USE_ZENDNN_MATMUL_DIRECT For optimal single core MatMul execution modify it to 1 1