Dynamically Quantized Models - Dynamically Quantized Models - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2026-04-13
Revision
5.2.1 English

Quantizing models

Hugging Face models can be quantized using Int8DynamicActivationInt8WeightConfig from TorchAO. Zentorch supports dynamically quantized models with INT8 dynamically quantized activation with per-token granularity and INT8 quantized weights with per-channel granularity.

To quantize a model, use the TorchAoConfig integration in Hugging Face Transformers as shown here.


import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TorchAoConfig,
)
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType

quantization_config = TorchAoConfig(
    Int8DynamicActivationInt8WeightConfig(
        version=2,
        act_mapping_type=MappingType.SYMMETRIC,
    )
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    args.model_name,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
    quantization_config=quantization_config,
    trust_remote_code=True,
)
Note:
  • Ensure you have the required dependencies installed: pip install transformers>=4.57.6 torchao==0.16.0
  • zentorch v5.2.1 is compatible with TorchAO. AMD Quark is no longer required for quantization.
  • Use MappingType.SYMMETRIC for optimal performance with zentorch.
  • Use scale_dtype=torch.bfloat16 for compatibility with AMD EPYC™ CPU optimizations.
  • Use safe_serialization=False when saving for compatibility with zentorch.
  • For per-group quantization, we recommend a group_size of 128, as this configuration has been validated by zentorch across a broad set of mainstream models.