Quantizing models
Hugging Face models can be quantized using
Int8DynamicActivationInt8WeightConfig from TorchAO. Zentorch
supports dynamically quantized models with INT8 dynamically quantized activation with
per-token granularity and INT8 quantized weights with per-channel granularity.
To quantize a model, use the TorchAoConfig integration in Hugging Face Transformers as shown here.
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TorchAoConfig,
)
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType
quantization_config = TorchAoConfig(
Int8DynamicActivationInt8WeightConfig(
version=2,
act_mapping_type=MappingType.SYMMETRIC,
)
)
quantized_model = AutoModelForCausalLM.from_pretrained(
args.model_name,
torch_dtype=torch.bfloat16,
device_map="cpu",
quantization_config=quantization_config,
trust_remote_code=True,
)
Note:
- Ensure you have the required dependencies installed:
pip install transformers>=4.57.6 torchao==0.16.0 - zentorch v5.2.1 is compatible with TorchAO. AMD Quark is no longer required for quantization.
- Use
MappingType.SYMMETRICfor optimal performance with zentorch. - Use
scale_dtype=torch.bfloat16for compatibility with AMD EPYC™ CPU optimizations. - Use
safe_serialization=Falsewhen saving for compatibility with zentorch. - For per-group quantization, we recommend a
group_sizeof 128, as this configuration has been validated by zentorch across a broad set of mainstream models.