Weight Only Quantized Models - Weight Only Quantized Models - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2026-04-13
Revision
5.2.1 English

Quantizing Models

Hugging Face models can be quantized using Int4WeightOnlyOpaqueTensorConfig and IntxWeightOnlyConfig from TorchAO. zentorch supports Weight Only Quantization (WOQ) with 4-bit weights and BF16 activations, with both per-channel (limited to IntxWeightOnlyConfig) and per-group quantization granularity.

To quantize a model, use the TorchAoConfig integration in HuggingFace Transformers as shown below:

  • For per-group quantization (recommended, group_size=128):
    from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
    from torchao.quantization.quant_api import IntxWeightOnlyConfig
    from torchao.quantization.quant_primitives import MappingType
    from torchao.quantization.granularity import PerGroup
    import torch
    
    quantization_config = TorchAoConfig(
        IntxWeightOnlyConfig(
            weight_dtype=torch.int4,
            mapping_type=MappingType.SYMMETRIC,
            scale_dtype=torch.bfloat16,
            granularity=PerGroup(128),
        )
    )
    quantized_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        dtype=torch.bfloat16,
        device_map="cpu",
        quantization_config=quantization_config,
        trust_remote_code=True
    )
  • For per-channel quantization, replace PerGroup(128) with PerChannel():
    from torchao.quantization.granularity import PerChannel
    
    quantization_config = TorchAoConfig(
        IntxWeightOnlyConfig(
            weight_dtype=torch.int4,
            mapping_type=MappingType.SYMMETRIC,
            scale_dtype=torch.bfloat16,
        )
    )
  • Using Int4WeightOnlyOpaqueTensorConfig for per-group weight-only quantization:
    import torch
    from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
    from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig 
    
    quantization_config = TorchAoConfig(Int4WeightOnlyOpaqueTensorConfig(group_size=128))
    
    quantized_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        dtype=torch.bfloat16,
        device_map="cpu",
        quantization_config=quantization_config,
        trust_remote_code=True
    )
Attention: Currently, we do not support Mixture-of-Experts (MoE) models for any quantization scheme.