The vLLM-zentorch plugin brings together zentorch and vLLM to deliver efficient, plug-and-play large language model (LLM) inference on modern x86 CPU servers. By leveraging ZenDNN's highly optimized kernels, this plugin accelerates both attention and non-attention operations in vLLM, providing significant throughput improvements for popular LLMs.
zentorch is designed for acceleration of PyTorch workloads on CPUs, offering drop-in, high-performance implementations of key deep learning operations. When used with vLLM, zentorch automatically replaces default attention mechanisms and other compute-intensive kernels with ZenDNN-optimized versions—no code changes required. While optimized for AMD EPYC™ CPUs, the plugin supports any x86 CPU with the required ISA features.
Key Features
- Plug-and-Play Acceleration: No code modifications required—just install zentorch alongside vLLM for automatic acceleration.
- Seamless vLLM Integration: vLLM detects zentorch and transparently uses ZenDNN-optimized attention and non-attention kernels for supported CPUs.
- Optimized for Modern x86 CPU servers: Delivers best-in-class performance on AMD EPYC™ processors, while supporting a broad range of x86 CPUs with the necessary instruction set.
- Powered by ZenDNN: Leverages AMD's ZenDNN library for state-of-the-art, CPU-optimized neural network operations.
Compatibility
- vLLM: v0.9.0 or later (explicitly tested; earlier versions may not be supported)