Weight Only Quantized Models - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2025-08-18
Revision
5.1 English

Hugging Face models are quantized using the AMD Quark tool. After downloading the zip file, install Quark and follow these steps:

  1. Navigate to the examples/torch/language_modeling/llm_ptq/ directory.
  2. Install the necessary dependencies:
    pip install -r requirements.txt
    pip install -r ../llm_eval/requirements.txt
  3. Run the following command to quantize the model:
    • For per-channel quantization:
      OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py 
      --model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export hf_format 
      --quant_algo awq --quant_scheme w_int4_per_group_sym --group_size -1 
      --num_calib_data 128 --dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> 
      --pack_method order
    • For per-group quantization:
      OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py 
      --model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export hf_format 
      --quant_algo awq --quant_scheme w_int4_per_group_sym --group_size <group_size> --num_calib_data 128 
      --dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> --pack_method order
Note: The channel/out_features dimension (property of your model) must be divisible by the specified group_size. To find out the values for channel and out_features in your model, refer to the model definition. We recommend using a group_size of 128, as this configuration has been validated by zentorch across a broad set of mainstream models.

For example:

The Llama-3.2 model contains multiple linear layers subject to quantization, with out_features values of [2048, 512, 512, 2047, 8192, 8192, 2048, 128256].

Similarly, the Llama-2 model has linear layers that can be quantized with out_features values of [4096, 4096, 4096, 4096, 11008, 11008, 4096, 32000].

The ChatGLM model includes linear layers with out_features values of [4068, 4096, 27392, 4096, 65024].

For effective quantization, the chosen group_size must be a factor of each channel/out_features value within the model.

OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py 
--model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export quark_safetensors 
--quant_algo awq --quant_scheme w_int4_per_group_sym --group_size -1 --num_calib_data 128 
--dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> --pack_method order
Note: zentorch v5.1 is compatible with Quark v0.8. Make sure you download the right version.
Table 1. Constraints for zentorch WOQ with the AWQ algorithm
Constraint Remarks
--device cpu zentorch only supports CPU device.
--data_type bfloat16 Currently, zentorch only supports the BFloat16 model data type.
--group_size -1 group-size -1 refers to per-channel quantization; for per-group quantization, the channel/out_features dimension should be divisible by group_size value.
--quant_algo awq Currently, the zentorch release supports only the AWQ quantization algorithm.
--quant_scheme w_int4_per_group_sym Currently, the zentorch release supports only the w_int4_per_group_sym quantization scheme.
--packing_method order Currently, the zentorch release supports only the packing_method order.

As Hugging Face currently does not support the AWQ format for CPU, an additional code block has to be added to your inference script for loading the WOQ models.

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = zentorch.load_quantized_model(model, safetensor_path)

Here, the safetensor_path refers to the "<output_dir>" path of the quantized model. After the loading steps, the model can be executed in a similar fashion as the cases # 1-3 listed in Recommendations (Hugging Face Generative LLM Models).