Parameter/tensor/zeros 
data 
const 
data 
Allocate memory for input
data. 

shape 

data_type 
Conv2d 
in_channels 
conv2d (groups = 1) /
depthwiseconv2d (groups = input channel) 

If groups == input channel, the
convolution would be compiled into DepthwiseConvolution Engine. If
groups == 1, the convolution would be mapped to Convolution Engine.
Otherwise, it would be mapped to the CPU. 
out_channels 

kernel_size 
kernel 
stride 
stride 
padding 
pad 
padding_mode('zeros') 
pad_mode (FLOOR) 
groups 

dilation 
dilation 
ConvTranspose2d 
in_channels 
transposedconv2d (groups = 1) /
depthwisetransposedconv2d (groups = input channel) 

If groups == input channel, the
convolution would be compiled into DepthwiseConvolution Engine. If
groups == 1, the convolution would be mapped to Convolution Engine.
Otherwise, it would be mapped to the CPU. 
out_channels 

kernel_size 
kernel 
stride 
stride 
padding 
pad 
padding_mode('zeros') 
pad_mode (FLOOR) 
groups 

dilation 
dilation 
matmul 

conv2d / matmul 
transpose_a 
The matmul would be transformed to
conv2d and compiled to Convolution Engine. If the matmul fails to be
transformed, it would be implemented by CPU. 

transpose_b 
MaxPool2d /
AdaptiveMaxPool2d 
kernel_size 
maxpool2d 
kernel 
Pooling Engine 
stride 
stride 
padding 
pad 
ceil_mode 
pad_mode 
output_size (adaptive) 
global 
AvgPool2d /
AdaptiveAvgPool2d 
kernel_size 
avgpool2d 
kernel 
Pooling Engine 
stride 
stride 
padding 
pad 
ceil_mode 
pad_mode 
count_include_pad 
count_include_pad 

count_include_invalid (true) 
output_size (adaptive) 
global 
ReLU 

relu 

Activations would be fused to adjacent operations such as
convolution and add. 
LeakyReLU 
negative_slope 
leakyrelu 
alpha 
ReLU6 

relu6 

Hardtanh 
min_val = 0 

max_val = 6 

Hardsigmoid 

hardsigmoid 

Hardswish 

hardswish 

ConstantPad2d / ZeroPad2d 
padding 
pad 
paddings 
"CONSTANT" padding would be fused
adjacent operations. 
value = 0 
mode ("CONSTANT") 
add 

add 

If the add is an elementwise add,
the add would be mapped to DPU Elementwise Add Engine. If the add
is a channelwise add, search for opportunities to fuse the add with
adjacent operations such as convolutions. If they are shaperelated
operations, they would be removed during compilation. If they are
components of a coarsegrained operation, they would be fused with
adjacent operations. Otherwise, they would be compiled into CPU
implementations. 
sub / rsub 

sub 

mul 

mul 

neg 

neg 

sum 
dim 
reduction_sum 
axis 
keepdim 
keep_dims 
max 
dim 
reduction_max 
axis 
keepdim 
keep_dims 
mean 
dim 
reduction_mean 
axis 
keepdim 
keep_dims 
interpolate / upsample /
upsample_bilinear / upsample_nearest 
size 
resize 
size 
If the mode of the resize is
'BILINEAR', align_corner=false, half_pixel_centers = false, size =
2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4
can be transformed to DPU implementations (pad+depthwisetransposed
conv2d). If the mode of the resize is 'NEAREST' and the size are
integers, the resize would be mapped to DPU
implementations. 
scale_factor 

mode 
mode 
align_corners 
align_corners 

half_pixel_centers = !align_corners 
transpose 
dim0 
transpose 
order 
These operations would be transformed to the reshape
operation in some cases. Additionally, search for opportunities to
fuse the dimension transformation operations into special load or
save instructions of adjacent operations to reduce the overhead.
Otherwise, they would be mapped to CPU. 
dim1 

permute 
dims 


view/reshape 
size 
reshape 
shape 
flatten 
start_dim 
reshape / flatten 
start_axis 
end_dim 
end_axis 
squeeze 
dim 
reshape / squeeze 
axis 
cat 
dim 
concat 
axis 
Reduce the overhead resulting from the concat by
special reading or writing strategies and allocating the onchip
memory carefully. 
aten::slice* 
dim 
strided_slice 

If the strided_slice is
shaperelated or is the component of a coarsegrained operation, it
would be removed. Otherwise, the strided_slice would be compiled
into CPU implementations. 
start 
begin 
end 
end 
step 
strides 
BatchNorm2d 
eps 
depthwiseconv2d /
scale 
epsilon 
If the batch_norm is quantized and
can be transformed to a depthwiseconv2d equivalently, it would be
transformed to depthwiseconv2d and the compiler would search for
compilation opportunities to map the batch_norm into DPU
implementations. Otherwise, the batch_norm would be executed by
CPU. 

axis 

moving_mean 

moving_var 

gamma 

beta 
softmax 
dim 
softmax 
axis 
They would only be compiled into
CPU implementations. 
Tanh 

tanh 

Sigmoid 

sigmoid 

PixelShuffle 
upscale_factor 
pixel_shuffle 
scale 
They would be transformed to tile if there's convolution
as its input. 



upscale=True 
PixelUnshuffle 
downscale_factor 
pixel_shuffle 
scale 



upscale=False 
 If the slice of tensor in PyTorch is
written in the Python syntax, it is transformed into
aten::slice .
