The zentorch.llm.optimize API has been
deprecated.
You can run generative models using torch.compile (model, backend="zentorch"), but for optimal performance we
recommend using vLLM. See vLLM-zentorch Plugin for more
details.
zentorch provides support for Weight-Only Quantization (WOQ) models with both per-channel and per-group quantization granularity, enabling efficient 4-bit quantized inference, along with Dynamic quantization with INT8 activations with per-token granularity and INT8 quantized weights for large language models on AMD EPYC™ CPUs with significant memory savings and minimal impact on model accuracy.
Quantizing Models
Use the following steps to quantize Hugging Face models with different TorchAO configurations. While the first step is different for different configurations, steps 2 through 5 are common for all configurations.
Step 1
Weight-only Quantization using IntxWeightOnlyConfig with Per-channel
granularity.
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization.quant_api import IntxWeightOnlyConfig
from torchao.quantization.quant_primitives import MappingType
# Step 1: Create quantization config with IntxWeightOnlyConfig per-channel granularity
quantization_config = TorchAoConfig(
IntxWeightOnlyConfig(
weight_dtype=torch.int4,
mapping_type=MappingType.SYMMETRIC,
scale_dtype=torch.bfloat16,
)
)
Weight-only Quantization using Int4WeightOnlyOpaqueTensorConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig
quantization_config = TorchAoConfig(
Int4WeightOnlyOpaqueTensorConfig(group_size=128)
)
Dynamic quantization with Int8DynamicActivationInt8WeightConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType
quantization_config = TorchAoConfig(
Int8DynamicActivationInt8WeightConfig(
version=2,
act_mapping_type=MappingType.SYMMETRIC,
)
)
Steps 2 through 5 - Common for all the afore mentioned configurations
model_name = "meta-llama/Llama-3.2-1B-Instruct"
output_dir = "./quantized_model"
# Step 2: Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cpu",
quantization_config=quantization_config,
trust_remote_code=True,
)
# Step 3: Save the quantized model
quantized_model.save_pretrained(
output_dir,
safe_serialization=False,
)
# Step 4: Save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
)
tokenizer.save_pretrained(output_dir)
# Step 5: Test the quantized model
input_text = "what are we having for dinner?"
Sample Output
You mentioned we were planning to go to a restaurant
- Ensure you have the required dependencies installed:
pip install transformers==4.57.6 torchao==0.16.0 - The
MappingType.SYMMETRICoption enables symmetric quantization which is recommended for optimal performance with zentorch. - The
scale_dtype=torch.bfloat16option ensures compatibility with AMD EPYC™ CPU optimizations. - Use
safe_serialization=Falsewhen saving for compatibility with zentorch. - WOQ quantized models are only supported with freezing
enabled
export TORCHINDUCTOR_FREEZING=1 export VLLM_USE_AOT_COMPILE=0 export TORCHINDUCTOR_AUTOGRAD_CACHE=0 - Sample output of the quantized model example described earlier is expected to be different for each run.
Running Quantized Models
Quantized models are supported in vLLM via the zentorch backend. This release introduces functional support for Weight-Only Quantization (WOQ). See vLLM-zentorch Plugin for detailed instructions on running models with vLLM.
./quantized_model) as the model path instead
of the original Hugging Face model name.