Hugging Face Language Models - Hugging Face Language Models - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2026-04-13
Revision
5.2.1 English

The zentorch.llm.optimize API has been deprecated.

You can run generative models using torch.compile (model, backend="zentorch"), but for optimal performance we recommend using vLLM. See vLLM-zentorch Plugin for more details.

zentorch provides support for Weight-Only Quantization (WOQ) models with both per-channel and per-group quantization granularity, enabling efficient 4-bit quantized inference, along with Dynamic quantization with INT8 activations with per-token granularity and INT8 quantized weights for large language models on AMD EPYC™ CPUs with significant memory savings and minimal impact on model accuracy.

Quantizing Models

Use the following steps to quantize Hugging Face models with different TorchAO configurations. While the first step is different for different configurations, steps 2 through 5 are common for all configurations.

Step 1

Weight-only Quantization using IntxWeightOnlyConfig with Per-channel granularity.

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization.quant_api import IntxWeightOnlyConfig
from torchao.quantization.quant_primitives import MappingType

# Step 1: Create quantization config with IntxWeightOnlyConfig per-channel granularity
quantization_config = TorchAoConfig(
    IntxWeightOnlyConfig(
        weight_dtype=torch.int4,
        mapping_type=MappingType.SYMMETRIC,
        scale_dtype=torch.bfloat16,
    )
)

Weight-only Quantization using Int4WeightOnlyOpaqueTensorConfig

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig

quantization_config = TorchAoConfig(
    Int4WeightOnlyOpaqueTensorConfig(group_size=128)
)

Dynamic quantization with Int8DynamicActivationInt8WeightConfig

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType

quantization_config = TorchAoConfig(
    Int8DynamicActivationInt8WeightConfig(
        version=2,
        act_mapping_type=MappingType.SYMMETRIC,
    )
)

Steps 2 through 5 - Common for all the afore mentioned configurations

model_name = "meta-llama/Llama-3.2-1B-Instruct"
output_dir = "./quantized_model"

# Step 2: Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
    quantization_config=quantization_config,
    trust_remote_code=True,
)

# Step 3: Save the quantized model
quantized_model.save_pretrained(
    output_dir,
    safe_serialization=False,
)

# Step 4: Save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
tokenizer.save_pretrained(output_dir)

# Step 5: Test the quantized model
input_text = "what are we having for dinner?"

Sample Output

You mentioned we were planning to go to a restaurant
Note:
  • Ensure you have the required dependencies installed: pip install transformers==4.57.6 torchao==0.16.0
  • The MappingType.SYMMETRIC option enables symmetric quantization which is recommended for optimal performance with zentorch.
  • The scale_dtype=torch.bfloat16 option ensures compatibility with AMD EPYC™ CPU optimizations.
  • Use safe_serialization=False when saving for compatibility with zentorch.
  • WOQ quantized models are only supported with freezing enabled
    export TORCHINDUCTOR_FREEZING=1
    export VLLM_USE_AOT_COMPILE=0
    export TORCHINDUCTOR_AUTOGRAD_CACHE=0
  • Sample output of the quantized model example described earlier is expected to be different for each run.

Running Quantized Models

Quantized models are supported in vLLM via the zentorch backend. This release introduces functional support for Weight-Only Quantization (WOQ). See vLLM-zentorch Plugin for detailed instructions on running models with vLLM.

Note: When running quantized models, use the path to your quantized model directory (example: ./quantized_model) as the model path instead of the original Hugging Face model name.