The environment variables to setup paths and control logs, and tune performance are enumerated here.
The settings given in the following table are used in the ZenDNN library and apply to zentorch and zentf.
| Environment Variable | Description | Default Value/User Defined Value |
|---|---|---|
| Generic (Setup paths and control logs) | ||
ZENDNNL_<module>_LOG_LEVEL
|
Enables ZenDNN logs. See Logging and Debugging for details on how to use logs. | 0 |
ZENDNNL_LRU_CACHE_CAPACITY
|
Sets maximum capacity of LRU cache for blocked weights of MatMul
algo. You can modify it as required a. |
1024 |
ZENDNNL_EMBAG_THREAD_ALGO
|
Sets Embedding Bag thread type. This is the recommended setting for RecSys models. | 1 |
ZENDNNL_EMBAG_ALGO
|
Sets the embedding bag backend kernel. For FP32/BF16/INT4:
|
2 |
OMP_DYNAMIC
|
OMP variable to control dynamic adjustment of OMP threads. Refer to OpenMP documentation for details. | FALSE |
| Optimized (Tune performance) | ||
OMP_NUM_THREADS
|
Sets the number of OMP threads. Generally, this is equal to the number
of cores present. Set it based on the number of cores in the user system a. |
128 |
OMP_WAIT_POLICY
|
Sets the behavior of waiting threads. Refer to the OMP documentation for details. | ACTIVE |
KMP_BLOCKTIME
|
Sets the amount of time, in milliseconds, that a thread should wait before sleeping when a parallel region ends. Setting it to 1 minimizes idle time and can improve responsiveness for short tasks by quickly putting threads to sleep after work is complete. Note: Do not set this for
Recommender System models.
|
1 |
KMP_TPAUSE
|
Controls the behavior of threads when they are waiting for work, aiming to reduce CPU usage. Setting it to 0 indicates threads should not enter an active wait state, optimizing CPU efficiency. |
0 |
KMP_FORKJOIN_BARRIER_PATTERN
|
Specifies the synchronization pattern for fork/join barriers. dist,dist means a distributed barrier pattern is applied both when threads are forked and joined, potentially reducing synchronization contention. |
dist,dist |
KMP_PLAIN_BARRIER_PATTERN
|
Sets the synchronization pattern for plain barriers to dist,dist indicating a distributed pattern that helps manage thread synchronization efficiently during plain barriers. |
dist,dist |
KMP_REDUCTION_BARRIER_PATTERN
|
Controls the barrier pattern used in reduction operations (for example, sum or product of arrays across threads). Using dist,dist specifies a distributed pattern to enhance efficiency. |
dist,dist |
KMP_AFFINITY
|
Determines how threads are bound to CPU cores. The setting granularity=fine,compact,1,0 specifies fine-grained affinity with threads compacted to as few cores as possible, minimizing memory access latency and maximizing cache utilization, respectively. |
granularity=fine,compact,1,0 |
ZENDNNL_MATMUL_ALGO
|
Specifies the MatMul algo to be used. For FP32/BF16/INT8:
Auto is an experimental feature and should be used with application warm up iteration >=8. Note: Different workloads on different frameworks
(PyTorch, TensorFlow) have specific
ZENDNNL_MATMUL_ALGO settings for optimized
performance. See Optimal Environment Variable Settings for zentorch and Optimal Environment Variable Settings for zentf. |
|
| a You must set these environment variables explicitly. | ||
LLVM OpenMP
LLVM OpenMP runtimes provides the necessary libraries and compiler directives for implementing parallelism in programs.
Developers can use LLVM OpenMP 18.1.18 to compile and run parallel programs written in Fortran and C/C++, taking advantage of shared memory parallelism and improving the performance and scalability of their applications.
The LLVM OpenMP implementation supports various features, including:
- Compiler directives for specifying parallel regions, tasks, and data dependencies
- Library routines for creating and managing teams, parallel loops, and synchronization
- Environment variables for controlling OpenMP behavior
Complete the following steps to install and leverage llvm openmp in your Conda environment:
-
conda install -c conda-forge llvm-openmp=18.1.8=hf5423f3_1 --no-deps -y -
export LD_PRELOAD="<path to conda>/pkgs/llvm-openmp-18.1.8-hf5423f3_1/lib/libiomp5.so:$LD_PRELOAD"
Additional settings used to tune performance with the zentf to the TensorFlow framework
| Environment Variable | Description | Default Value/User Defined Value |
|---|---|---|
TF_ENABLE_ZENDNN_OPTS
|
Set Set it to 1 along with |
0 |
TF_ENABLE_ONEDNN_OPTS
|
By default, TensorFlow is shipped with oneDNN optimizations enabled. Hence, set it to 0 when you enable ZenDNN. | 1 |
TF_ZENDNN_PLUGIN_BF16
|
Set it to 1 to enable Automatic Mixed Precision (AMP) for BF16. |
0 |
| zentf Environment Variables-Optimization | ||
USE_ZENDNN_MATMUL_DIRECT
|
For optimal single core MatMul execution modify it to 1 | 1 |