Pre-processing float32 model transforms and prepares it for quantization. It consists of the following three optional steps:
- Symbolic shape inference: It is best suited for transformer models.
- Model optimization: It uses ONNX Runtime native library to rewrite the computation graph, including merging computation nodes and eliminating redundancies to improve Runtime efficiency.
- ONNX shape inference.
The primary objective of these steps is to enhance quantization quality. The ONNX Runtime quantization tool performs optimally when the tensor's shape is known. Both symbolic shape inference and ONNX shape inference play a crucial role in determining tensor shapes. Symbolic shape inference is particularly effective for transformer-based models, whereas ONNX shape inference works well with other models.
Model optimization performs certain operator fusion, making the quantization tool’s job easier. For instance, a Convolution operator followed by BatchNormalization can be fused into one during the optimization, which enables effective quantization.
ONNX Runtime has a known issue: model optimization cannot output a model size greater than 2 GB. As a result, for large models, optimization must be skipped.
from onnxruntime.quantization import shape_inference
shape_inference.quant_pre_process(
input_model_path: str,
output_model_path: str,
skip_optimization: bool = False,
skip_onnx_shape: bool = False,
skip_symbolic_shape: bool = False,
auto_merge: bool = False,
int_max: int = 2**31 - 1,
guess_output_rank: bool = False,
verbose: int = 0,
save_as_external_data: bool = False,
all_tensors_to_one_file: bool = False,
external_data_location: str = "./",
external_data_size_threshold: int = 1024,)
- input_model_path
- Path to the input model file.
- output_model_path
- Path to the output model file.
- skip_optimization
- Skip the model optimization step if set to true. This might result in ONNX shape inference failure for some models.
- skip_onnx_shape
- Skip ONNX shape inference. Symbolic shape inference is most effective with transformer-based models. Skipping all shape inferences might reduce the effectiveness of quantization because a tensor with an unknown shape cannot be quantized.
- skip_symbolic_shape
- Skip symbolic shape inference. Symbolic shape inference is most effective with transformer-based models. Skipping all shape inferences might reduce the effectiveness of quantization because a tensor with an unknown shape can not be quantized.
- auto_merge
- For symbolic shape inference. Automatically merge symbolic dims when conflict happens.
- int_max
- For symbolic shape inference, specify the maximum value for the integer to be treated as boundless for ops like slice.
- guess_output_rank
- Guess output rank to be the same as input 0 for unknown ops.
- verbose
- Logs detailed info of inference. Options are 0: turn off, 1: warnings, 3: detailed.
- save_as_external_data
- Saving an ONNX model to external data.
- all_tensors_to_one_file
- Saving all the external data to one file.
- external_data_location
- The file location to save the external file.
- external_data_size_threshold
- The size threshold for external data.