Currently Supported Operators
Xilinx is continuously improving the DPU IP and the compiler to support more operators with better performance. The following table lists some typical operations and the configurations such as kernel size, stride, etc. that the DPU can support. If the operation configurations exceed these limitations, the operator will be assigned to the CPU. Additionally, the operators that the DPU can support are dependent on the DPU types, ISA versions, and configurations.
In order to make DPU adaptable to a variety of FPGA devices, some kinds of DPU are configurable. You can choose necessary engines, adjust some intrinsic parameters and create your own DPU IP with TRD projects. But that means the limitations can be very different between configurations. You can find more information about how will those options impact on the limitations in PG338. Or it is recommended that you could try compiling the model with your own DPU configuration. The compiler will tell you which operators would be assigned to CPU and why they would be so. The table shows a specific configuration of each DPU architeciture.
Typical Operation Type in CNN | Parameters | DPUCZDX8G_ISA0_B4096_MAX_BG2 (ZCU102/104) | DPUCAHX8L_ISA0 (U280) | DPUCAHX8H_ISA2 (U50LV9E, U50LV10E, U280), DPUCAHX8H_ISA2_ELP2 (U50) | DPUCVDX8G_ISA0_B8192C32B3 (VCK190) | DPUCVDX8H_ISA0 (VCK5000) |
---|---|---|---|---|---|---|
Intrinsic Parameter | channel_parallel: 16 bank_depth: 2048 |
channel_parallel: 32 bank_depth: 4096 |
channel_parallel: 16 bank_depth: 2048 |
channel_parallel: 16 bank_depth: 16384 |
channel_parallel: 64 bank_depth: 256 |
|
conv2d | Kernel size | w, h: [1, 16] | w, h: [1, 16] | w, h: [1, 16] | w, h: [1, 16] w * h <= 64 |
w, h: [1, 16] |
Strides | w, h: [1, 8] | w, h: [1, 4] | w, h: [1, 4] | w, h: [1, 4] | w, h: [1, 4] | |
Dilation | dilation * input_channel <= 256 * channel_parallel | |||||
Paddings | pad_left, pad_right: [0, (kernel_w - 1) * dilation_w + 1] | |||||
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h + 1] | ||||||
In Size | kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth | |||||
Out Size | output_channel <= 256 * channel_parallel | |||||
Activation | ReLU, LeakyReLU, ReLU6 | ReLU, ReLU6 | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU | |
Group* (Caffe) | group==1 | |||||
depthwise-conv2d | Kernel size | w, h: [1, 16] | w, h: [3] | Not supported | ||
Strides | w, h: [1, 8] | w, h: [1, 2] | ||||
dilation | dilation * input_channel <= 256 * channel_parallel | |||||
Paddings | pad_left, pad_right: [0, (kernel_w - 1) * dilation_w + 1] | |||||
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h + 1] | ||||||
In Size | kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth | |||||
Out Size | output_channel <= 256 * channel_parallel | |||||
Activation | ReLU, ReLU6 | ReLU, ReLU6 | ||||
Group* (Caffe) | group==input_channel | |||||
transposed-conv2d | Kernel size | kernel_w/stride_w, kernel_h/stride_h: [1, 16] | ||||
Strides | ||||||
Paddings | pad_left, pad_right: [1, kernel_w-1] | |||||
pad_top, pad_bottom: [1, kernel_h-1] | ||||||
Out Size | output_channel <= 256 * channel_parallel | |||||
Activation | ReLU, LeakyReLU, ReLU6 | ReLU, ReLU6 | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU | |
depthwise-transposed-conv2d | Kernel size | kernel_w/stride_w, kernel_h/stride_h: [1, 16] | kernel_w/stride_w, kernel_h/stride_h: [3] | Not supported | ||
Strides | ||||||
Paddings | pad_left, pad_right: [1, kernel_w-1] | |||||
pad_top, pad_bottom: [1, kernel_h-1] | ||||||
Out Size | output_channel <= 256 * channel_parallel | |||||
Activation | ReLU, ReLU6 | ReLU, ReLU6 | ||||
max-pooling | Kernel size | w, h: [2, 8] | w, h: {2, 3, 5, 7, 8} | w, h: [1, 8] | w, h: [2, 8] | w, h: {1, 2, 3, 7} |
Strides | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 4] | w, h: [1, 8] | |
Paddings | pad_left, pad_right: [1, kernel_w-1] | |||||
pad_top, pad_bottom: [1, kernel_h-1] | ||||||
Activation | ReLU | not supported | ReLU | ReLU | not supported | |
average-pooling | Kernel size | w, h: [2, 8] w==h |
w, h: {2, 3, 5, 7, 8} w==h |
w, h: [1, 8] w==h |
w, h: [2, 8] w==h |
w, h: {1, 2, 3, 7} w==h |
Strides | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 4] | w, h: [1, 8] | |
Paddings | pad_left, pad_right: [1, kernel_w-1] | |||||
pad_top, pad_bottom: [1, kernel_h-1] | ||||||
Activation | ReLU | not support | ReLU | ReLU | not support | |
eltwise-sum | Input Channel | input_channel <= 256 * channel_parallel | ||||
Activation | ReLU | ReLU | ReLU | ReLU | ReLU | |
concat | Network-specific limitation, which relates to the size of feature maps, quantization results and compiler optimizations. | |||||
reorg | Strides | reverse==false :
stride ^ 2 * input_channel <= 256 * channel_parallel reverse==true : input_channel <= 256 * channel_parallel |
||||
pad | In Size | input_channel <= 256 * channel_parallel | ||||
Mode | "SYMMETRIC" ("CONSTANT" pad would be fused into adjacent operators during compiler optimization process) | |||||
global pooling | Global pooling will be processed as general pooling with kernel size euqal to input tensor size. | |||||
InnerProduct, Fully Connected, Matmul | These ops will be transformed into conv2d op with kernel size equal to 1x1 |
The following operators are primitively defined in different deep learning frameworks. The compiler can automatically parse these operators, transform them into the XIR format, and distribute them to DPU or CPU. These operators are partially supported by the tools, and they are listed here for your reference.