It is recommended you use torch.no_grad() for optimal inference
performance with zentorch.
CNN
For torchvision CNN models, set dynamic=False when calling for
torch.compile as follows:
model = torch.compile(model, backend='zentorch', dynamic=False)
with torch.no_grad():
output = model(input)
NLP & RecSys
Optimize Hugging Face NLP models as follows.
model = torch.compile(model, backend='zentorch')
with torch.no_grad():
output = model(input)
Hugging Face Generative LLM Models
For Hugging Face Generative LLM models, usage of zentorch.llm.optimize is recommended. All optimizations included in this
API are specifically targeted for Generative Large Language Models from Hugging Face. If
a model is not a valid Generative Large Language Model from Hugging Face, the following
warning will be displayed and zentorch.llm.optimize
will act as a dummy with no optimizations applied to the model that is passed to the
method:
“Cannot detect the model transformers family by model.config.architectures. Please pass a valid Hugging Face LLM model to the zentorch.llm.optimize API.”
This check confirms the presence of the config and
architectures attributes of the model to get the model ID.
Considering the check, two scenarios the zentorch.llm.optimize can still act as a dummy function:
- Hugging Face has a plethora of models, of which Generative LLMs are a subset
of. So, even if the model has the attributes of
configandarchitectures, the model ID might not yet be present in the supported models list from zentorch. In this casezentorch.llm.optimizewill act as a dummy function.A model can be a valid generative LLM from Hugging Face but may miss the
configandarchitecturesattributes. In this case also, thezentorch.llm.optimizeAPI will act as a dummy function. - If the model passed is valid, all the supported optimizations will be applied, and
performant execution is ensured. To check the supported models, run the following
command:
python -c 'import zentorch; print("\n".join([f"{i+1:3}. {item}" for i, item in enumerate(zentorch.llm.SUPPORTED_MODELS)]))'
If a model ID other than the listed above are passed, zentorch.llm.optimize will not apply the above specific optimizations to
the model and the following warning will be displayed:
“Complete set of optimizations are currently unavailable for this model.”
Control will pass to the “zentorch” custom backend in torch.compile for
applying optimizations.
zentorch_llm_optimize, install IPEX corresponding to the PyTorch version
that is installed in the environment.The PyTorch version for performant execution of supported LLMs should be greater than or equal to 2.6.0. The recommended version for optimal performance is PyTorch 2.7.0.
Case #1: If output is generated through a call to direct model, optimize it as shown here:
model = zentorch.llm.optimize(model, dtype)
model = torch.compile(model, backend='zentorch')
with torch.no_grad():
output = model(input)
Case #2. If output is generated through a call to model.forward, optimize it as shown here:
model = zentorch.llm.optimize(model, dtype)
model.forward = torch.compile(model.forward, backend='zentorch')
with torch.no_grad():
output = model.forward(input)
Case #3: If output is generated through a call to model.generate, optimize it as shown here:
- Optimize the
model.forwardwithtorch.compileinstead ofmodel.generate - However, proceed to generate the output through a
call to
model.generatemodel = zentorch.llm.optimize(model, dtype) model.forward = torch.compile(model.forward, backend='zentorch') with torch.no_grad(): output = model.generate(input)