The zentorch vLLM plugin integrates zentorch with vLLM's V1 engine to deliver optimized large language model inference on AMD EPYC™ CPUs. By leveraging ZenDNN's highly optimized kernels, this plugin accelerates both attention and non-attention operations in vLLM, providing significant throughput improvements for popular LLMs.
The plugin uses vLLM's platform and general plugin entry points to:
- Inject zentorch optimization passes into torch.compile
- Disable replacement with Intel oneDNN kernels to enable replacement with zentorch kernels
- Enable CPU-only profiling
Key Features
- Plug-and-Play Acceleration: No code modifications required—just install zentorch alongside vLLM for automatic acceleration.
- Seamless vLLM Integration: vLLM detects zentorch and transparently uses ZenDNN-optimized GEMM and Embedding kernels for supported CPUs.
- Optimized for Modern x86 CPUs: Delivers best-in-class performance on AMD EPYC™ processors, while supporting a broad range of x86 CPUs with the necessary instruction set.
- Powered by ZenDNN: Leverages AMD's ZenDNN library for state-of-the-art, CPU-optimized neural network operations.