AOCL supports multi-threaded execution for many of its math kernels. You can
control the number of threads via environment variables such as
OMP_NUM_THREADS or through library-specific APIs.
Specifying Prime Number of Threads
If you specify a prime number of threads (e.g., 13, 17, 19), AOCL internally reduces the thread count by 1 for values greater than 11. This is done to enable efficient 2D parallelization, which relies on factorizing the thread count for optimal workload distribution.
Using a prime number can lead to:
Reduced thread utilization (e.g., 13 becomes 12 internally)
Potential performance degradation due to uneven workload partitioning
Workaround
To retain the prime thread count explicitly, you can set the following environment variable:
export BLIS_JC_NT=<prime_number>
However, note that performance may not be optimal with this workaround, as internal parallelization strategies may not be fully effective with prime thread counts.
Recommendation
For best performance, it is recommended to use thread counts that are:
Powers of two (e.g., 4, 8, 16)
Divisible by the number of physical cores
Composite numbers that allow better factorization