No code changes are required. Once installed, simply run your vLLM inference workload as usual. The plugin will be automatically detected and used for inference on supported x86 CPUs that meet the required ISA features. While optimized for AMD EPYC™ CPUs, it may also function on other compatible x86 processors.
# Example: Standard vLLM inference code
from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/phi-2")
params = SamplingParams(temperature=0.0, top_p=0.95)
output = llm.generate(["Hello, world!"], sampling_params=params)
print(output)
zentorch plugin will accelerate attention if installed and running on supported x86 CPUs (best performance on AMD EPYC™ CPUs).
Recommendation
For optimal performance with vLLM CPU inference, set the temperature parameter to 0.0 and use supported x86 CPUs (with best results on the latest AMD EPYC™ CPUs). Also, if NUMA is enabled in the hardware platform, it’s recommended to use the best performant NPS setting.
Support and Feedback
For questions, feedback, or to contribute, visit the AMD ZenDNN PyTorch Plugin GitHub page.