Recommendations - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2025-08-18
Revision
5.1 English

It is recommended you use torch.no_grad() for optimal inference performance with zentorch.

CNN

For torchvision CNN models, set dynamic=False when calling for torch.compile as follows:

model = torch.compile(model, backend='zentorch', dynamic=False) 
with torch.no_grad():
    output = model(input)

NLP & RecSys

Optimize Hugging Face NLP models as follows.

model = torch.compile(model, backend='zentorch') 
with torch.no_grad():
    output = model(input)

Hugging Face Generative LLM Models

For Hugging Face Generative LLM models, usage of zentorch.llm.optimize is recommended. All optimizations included in this API are specifically targeted for Generative Large Language Models from Hugging Face. If a model is not a valid Generative Large Language Model from Hugging Face, the following warning will be displayed and zentorch.llm.optimize will act as a dummy with no optimizations applied to the model that is passed to the method:

“Cannot detect the model transformers family by model.config.architectures. Please pass a valid Hugging Face LLM model to the zentorch.llm.optimize API.”

This check confirms the presence of the config and architectures attributes of the model to get the model ID. Considering the check, two scenarios the zentorch.llm.optimize can still act as a dummy function:

  1. Hugging Face has a plethora of models, of which Generative LLMs are a subset of. So, even if the model has the attributes of config and architectures, the model ID might not yet be present in the supported models list from zentorch. In this case zentorch.llm.optimize will act as a dummy function.

    A model can be a valid generative LLM from Hugging Face but may miss the config and architectures attributes. In this case also, the zentorch.llm.optimize API will act as a dummy function.

  2. If the model passed is valid, all the supported optimizations will be applied, and performant execution is ensured. To check the supported models, run the following command:
    python -c 'import zentorch; print("\n".join([f"{i+1:3}. {item}" for i, item in enumerate(zentorch.llm.SUPPORTED_MODELS)]))'

If a model ID other than the listed above are passed, zentorch.llm.optimize will not apply the above specific optimizations to the model and the following warning will be displayed:

“Complete set of optimizations are currently unavailable for this model.”

Control will pass to the “zentorch” custom backend in torch.compile for applying optimizations.

Note: To leverage the best performance of zentorch_llm_optimize, install IPEX corresponding to the PyTorch version that is installed in the environment.

The PyTorch version for performant execution of supported LLMs should be greater than or equal to 2.6.0. The recommended version for optimal performance is PyTorch 2.7.0.

Case #1: If output is generated through a call to direct model, optimize it as shown here:

model = zentorch.llm.optimize(model, dtype) 
model = torch.compile(model, backend='zentorch')
with torch.no_grad(): 
    output = model(input)

Case #2. If output is generated through a call to model.forward, optimize it as shown here:

model = zentorch.llm.optimize(model, dtype) 
model.forward = torch.compile(model.forward, backend='zentorch')
with torch.no_grad(): 
    output = model.forward(input)

Case #3: If output is generated through a call to model.generate, optimize it as shown here:

  • Optimize the model.forward with torch.compile instead of model.generate
  • However, proceed to generate the output through a call to model.generate
    model = zentorch.llm.optimize(model, dtype) 
    model.forward = torch.compile(model.forward, backend='zentorch') 
    with torch.no_grad(): 
        output = model.generate(input)