No code changes are required. Once installed, simply run your vLLM inference workload as usual. The plugin will be automatically detected and used for inference on supported x86 CPUs that meet the required ISA features. While optimized for AMD EPYC™ CPUs, it may also function on other compatible x86 processors.
Note: Upon importing vLLM, you should see the following
message in the
logs:
INFO [__init__.py] Platform plugin zentorch is activated
Environment Configuration
The plugin is recommended to be run with ZENDNNL_MATMUL_ALGO=1 (the
default).
Environment Variables
export VLLM_CPU_KVCACHE_SPACE=120 # GB for KV cache
export VLLM_CPU_OMP_THREADS_BIND=0-127 # CPU cores to use
export TORCHINDUCTOR_FREEZING=1
export VLLM_USE_AOT_COMPILE=0
export TORCHINDUCTOR_AUTOGRAD_CACHE=0
Performance Libraries
Install and preload tcmalloc and llvm-openmp for best
performance:
# tcmalloc
#The following command is for Ubuntu
sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD
# llvm-openmp
conda install -c conda-forge llvm-openmp=18.1.8=hf5423f3_1 -y
export LD_PRELOAD="$CONDA_PREFIX/lib/libiomp5.so:$LD_PRELOAD"
Example
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B", dtype="bfloat16")
params = SamplingParams(temperature=0.8, top_p=0.95)
output = llm.generate(["Hello, world!"], sampling_params=params)
print(output)
Note: These hardware recommendations are specific to vLLM CPU
workloads. ZenTorch can be used independently and may have different requirements or
optimizations for other use cases.
Support and Feedback
For questions, feedback, or to contribute, visit the AMD ZenDNN PyTorch Plugin GitHub page.