Hugging Face models are quantized using the AMD Quark tool. After downloading the zip file, install Quark and follow these steps:
- Navigate to the
examples/torch/language_modeling/llm_ptq/directory. - Install the necessary
dependencies:
pip install -r requirements.txtpip install -r ../llm_eval/requirements.txt - Run the following command to quantize the model:
- For per-channel
quantization:
OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py --model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export hf_format --quant_algo awq --quant_scheme w_int4_per_group_sym --group_size -1 --num_calib_data 128 --dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> --pack_method order - For per-group
quantization:
OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py --model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export hf_format --quant_algo awq --quant_scheme w_int4_per_group_sym --group_size <group_size> --num_calib_data 128 --dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> --pack_method order
- For per-channel
quantization:
For example:
The Llama-3.2 model contains multiple linear layers subject to quantization, with out_features values of [2048, 512, 512, 2047, 8192, 8192, 2048, 128256].
Similarly, the Llama-2 model has linear layers that can be quantized with out_features values of [4096, 4096, 4096, 4096, 11008, 11008, 4096, 32000].
The ChatGLM model includes linear layers with out_features values of [4068, 4096, 27392, 4096, 65024].
For effective quantization, the chosen group_size must be a factor of each channel/out_features value within the model.
OMP_NUM_THREADS=<physical-cores-num> numactl --physcpubind=<physical-cores-list> python quantize_quark.py
--model_dir <hugging_face_model_id> --device cpu --data_type bfloat16 --model_export quark_safetensors
--quant_algo awq --quant_scheme w_int4_per_group_sym --group_size -1 --num_calib_data 128
--dataset pileval_for_awq_benchmark --seq_len 128 --output_dir <output_dir> --pack_method order
| Constraint | Remarks |
|---|---|
--device cpu
|
zentorch only supports CPU device. |
--data_type bfloat16
|
Currently, zentorch only supports the BFloat16 model data type. |
--group_size -1
|
group-size -1 refers to per-channel
quantization; for per-group quantization, the channel/out_features
dimension should be divisible by group_size value. |
--quant_algo awq
|
Currently, the zentorch release supports only the AWQ quantization algorithm. |
--quant_scheme w_int4_per_group_sym
|
Currently, the zentorch release supports only the w_int4_per_group_sym quantization scheme. |
--packing_method order
|
Currently, the zentorch release supports only the packing_method order. |
As Hugging Face currently does not support the AWQ format for CPU, an additional code block has to be added to your inference script for loading the WOQ models.
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = zentorch.load_quantized_model(model, safetensor_path)
Here, the safetensor_path refers to the "<output_dir>" path of the quantized model. After the loading steps, the model can be executed in a similar fashion as the cases # 1-3 listed in Recommendations (Hugging Face Generative LLM Models).