ZenDNN continues its focus on optimizing inference performance for Recommender Systems and Large Language Models on AMD EPYC™ CPUs. This latest upgrade brings a host of enhancements designed to push the boundaries of efficiency and speed. We've introduced significant improvements for bfloat16 performance, expanded support for cutting-edge models like Llama 3.1 and 3.2, and added crucial capabilities like INT4 quantized datatype support. This includes the advanced Activation-Aware Weight Quantization (AWQ) algorithm and optimized quantized DLRM models.
Under the hood, ZenDNN’s enhanced AMD-specific optimizations operate at every level. In addition to highly optimized operator microkernels, these optimizations include comprehensive graph optimizations such as pattern identification, graph reordering, and fusions. Notable improvements include optimized embedding bag kernels and an enhanced zenMatMul matrix splitting strategy, both designed to maximize throughput and minimize latency. The result is a significant performance boost over vanilla frameworks. Beyond its powerful optimizations, the ZenDNN plug-ins offer broad compatibility, seamlessly integrating with popular frameworks like TensorFlow and PyTorch. This release also adds support for PyTorch 2.7 and TensorFlow 2.19, along with a new vLLM + zentorch Plugin that delivers better performance on various models compared to vLLM-IPEX. We've also enabled Java® Integration by contributing and upstreaming a new PluggableDevice feature to the TensorFlow-Java repository, strengthening its core capabilities.