Quantizing Models
Hugging Face models can be quantized using
Int4WeightOnlyOpaqueTensorConfig and
IntxWeightOnlyConfig from TorchAO. zentorch supports Weight
Only Quantization (WOQ) with 4-bit weights and BF16 activations, with both
per-channel (limited to IntxWeightOnlyConfig) and per-group
quantization granularity.
To quantize a model, use the TorchAoConfig integration in HuggingFace Transformers as shown below:
- For per-group quantization (recommended,
group_size=128):
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer from torchao.quantization.quant_api import IntxWeightOnlyConfig from torchao.quantization.quant_primitives import MappingType from torchao.quantization.granularity import PerGroup import torch quantization_config = TorchAoConfig( IntxWeightOnlyConfig( weight_dtype=torch.int4, mapping_type=MappingType.SYMMETRIC, scale_dtype=torch.bfloat16, granularity=PerGroup(128), ) ) quantized_model = AutoModelForCausalLM.from_pretrained( model_name, dtype=torch.bfloat16, device_map="cpu", quantization_config=quantization_config, trust_remote_code=True ) - For per-channel quantization, replace PerGroup(128) with
PerChannel():
from torchao.quantization.granularity import PerChannel quantization_config = TorchAoConfig( IntxWeightOnlyConfig( weight_dtype=torch.int4, mapping_type=MappingType.SYMMETRIC, scale_dtype=torch.bfloat16, ) ) - Using
Int4WeightOnlyOpaqueTensorConfigfor per-group weight-only quantization:import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig quantization_config = TorchAoConfig(Int4WeightOnlyOpaqueTensorConfig(group_size=128)) quantized_model = AutoModelForCausalLM.from_pretrained( model_name, dtype=torch.bfloat16, device_map="cpu", quantization_config=quantization_config, trust_remote_code=True )
Attention: Currently, we do not support
Mixture-of-Experts (MoE) models for any quantization scheme.