Using Warm-up Iterations for Optimal Performance - Using Warm-up Iterations for Optimal Performance - 57300

ZenDNN User Guide (57300)

Document ID
57300
Release Date
2026-04-13
Revision
5.2.1 English

For optimal performance when using torch.compile with zentorch as a backend, it is recommended to set a warm-up count of five. This entails running the inference section of the model five times before measuring or relying on inference latency. The initial runs trigger graph compilation, operator fusion, and internal ZenDNN cache population (example: weight reordering). Subsequent runs after warm-up reflect the true optimized performance.

Here is a complete example using a ResNet-50 model.

import torch
import zentorch
from torchvision.models import resnet50

# Step 1: Load a pretrained model and set it to eval mode
model = resnet50(pretrained=True).eval()

# Step 2: Compile the model with zentorch backend
compiled_model = torch.compile(model, backend="zentorch")

# Step 3: Create a sample input
sample_input = torch.randn(1, 3, 224, 224)

# Step 4: Warm-up phase — run inference 5 times to trigger compilation and
#          populate internal caches for optimal steady-state performance
WARMUP_COUNT = 5
with torch.no_grad():
    for i in range(WARMUP_COUNT):
        output = compiled_model(sample_input)
        print(f"Warm-up iteration {i + 1}/{WARMUP_COUNT} complete")

# Step 5: Timed inference — measure performance after warm-up
import time
with torch.no_grad():
    start = time.time()
    output = compiled_model(sample_input)
    end = time.time()
    print(f"Inference latency after warm-up: {(end - start) * 1000:.2f} ms")

Sample Output

Inference latency after warm-up: 4.11 ms
Note:
  • The first call to the compiled model triggers torch.compile graph capture, tracing, and zentorch backend optimizations (operator fusion, ZenDNN op replacement). This makes the first iteration significantly slower.
  • Subsequent warm-up iterations allow ZenDNN internal caches (such as weight repacking and matmul strategies) to reach a steady state.
  • After five warm-up iterations, the model runs at peak throughput and the measured latency accurately reflects production performance.
  • This warm-up pattern applies to all model types (CNN, NLP, LLM) when using the zentorch backend.