XIR Based Flow for DPUv3 - 1.1 English

Vitis AI User Guide (UG1414)

Document ID
UG1414
Release Date
2020-03-23
Version
1.1 English

Xilinx Intermediate Representation (XIR) is a graph based intermediate representation of the AI algorithms which is well designed for compilation and efficient deployment of the Domain-specific Processing Unit (DPU) on the powerful FPGA platform. It is composed of Op, Tensor, Graph and Subgraph libraries. In future, the Vitis™ AI quantizer, compiler, runtime and many other tools will use the XIR to transmit data. Also, an advanced user can achieve ML+X to release more energy of FPGA by extending the XIR to support customized IP in the Vitis AI flow. Currently, the DPUv3 is enabled by the XIR based flow. This section describes the DPUv3 compiler and steps to use common VAI_C interface to create compiled xmodel from the vai_quantizer outputs.

Figure 1. XIR Bases Flow

The XIR based compiler for DPUv3 takes the quantized TensorFlow or Caffe model as the input. It will first transform the input models into the XIR format as the foundation of the following processes. Most of the variations among different frameworks are eliminated and transferred to a unified representation in XIR. Then it applies various optimizations on the graph and break up the graph into several subgraphs on the basis of whether the OP can be executed on DPU. And some more architecture awared optimizations will be applied for each subgraph. For DPU subgraph, the compiler will generate the instruction stream and attach on it. Finally the optimized graph with necessary information and instructions for VART will be serialized to a compiled xmodel file.

Steps to compile Caffe or TensorFlow models for DPUv3 with VAI_C are as same as previous DPU. It is assumed that you have successfully installed the Vitis AI package including VAI_C and compressed your model with vai_quantizer.

Caffe

For caffe, vai_q_caffe is supposed to generate a PROTOTXT(deploy.prototxt) and a MODEL(deploy.caffemodel). Make sure you specify “-keep_fixed_neuron” option for vai_q_caffe which is essential for DPUv3 compiler. Then the following command is almost everything you need to do to get the compiled xmodel.

vai_c_caffe -p /PATH/TO/deploy.prototxt -c /PATH/TO/deploy.caffemodel -a /PATH/TO/arch/dpuv3e/arch.json -o /OUTPUTPATH -n netname}

The compiler will create three files in OUTPUTPATH directory. ‘netname_org.xmodel’ is the pre-compiled xmodel which is generated by compiler frontend. ‘netname.xmodel’ is the compiled xmodel which contains instructions and other necessary information. ‘meta.json’ is for runtime.

See Model Deployment Overview for more information on deploying the network on DPU with those files.

TensorFlow

For TensorFlow, vai_q_tensorflow is supposed to generate a pb file(quantize_eval_model.pb). Notice that there are two pb files generated by vai_q_tensorflow and ‘quantize_eval_model.pb’ is the proper one for DPUv3 compiler, which is different from DPUv2. The compilation command is similar.

vai_c_tensorflow -f /PATH/TO/quantize_eval_model.pb -a /PATH/TO/arch/dpuv3e/arch.json -o /OUTPUTPATH -n netname}

And the outputs will be as same as Caffe.

Currently Supported Operators

Xilinx is continuously improving DPUv3 IP and compiler to support more operators with better performance. Now DPUv3 can support OPs defined by Caffe and TensorFlow with some limitations as below.
Table 1. Currenlty Supported Operators
Typical Layers in CNN Parameters DPU Support
Convolution

(Caffe: Convolution)

(Tensorflow: Conv2d,

SeparaleConv2D…)

Kernel size W: [1, 8], H: [1, 8]
Strides W: [1, 4], H: [1, 4]
Paddings Left, Right: [1, kernel_w-1]

Top, Bottom: [1, kernel_h-1]

In/Out Size Arbitary
In/Out Channels [1, 256 * channel_parallel]
Activation ReLU, LeakyReLU or ReLU6
Dilation Dilation * input_channel <= 256 * channel_parallel && stride ==1
Group* (Caffe) Group==1
Deconvolution

(Caffe: Deconvolution)

(Tensorflow: Conv2DTranspose)

Kernel size W: [1, 8], H: [1, 8]
Strides W: [1, 4], H: [1, 4]
Paddings Left, Right: [1, kernel_w-1]

Top, Bottom: [1, kernel_h-1]

In/Out Size Arbitary
In/Out Channels [1, 256 * channel_parallel]
Activation ReLU, LeakyReLU or ReLU6
Max Pooling

(Caffe: Pooling)

(Tensorflow: MaxPool2D)

Kernel size W: [1, 8], H: [1, 8]
Strides W: [1, 4], H: [1, 4]
Paddings Left, Right: [1, kernel_w-1]

Top, Bottom: [1, kernel_h-1]

Average Pooling

(Caffe: Pooling)

(Tensorflow: AveragePooling2D, Mean)

Kernel size W: [1, 8], H: [1, 8]
Strides W: [1, 4], H: [1, 4]
Paddings Left, Right: [1, kernel_w-1]

Top, Bottom: [1, kernel_h-1]

Element-wise Sum

(Caffe: Eltwise)

(Tensorflow: Add)

Input Size Arbitrary
Input Channel [1, 256 * channel_parallel]
Activation ReLU or LeakyReLU
Concat (Caffe: Concat)

(Tensorflow: Concatenate)

Number, Axis Arbitrary
Out Channel [1, 256 * channel_parallel]
Reorg* (Caffe) Strides* stride ^ 2 * input_channel <= 256 * channel_parallel
Scale*, Reverse* Arbitrary
Fully Connection

(Caffe: Inner Product)

(Tensorflow: Matmul, Mul)

Input Channel Input_channel < 2048 * channel_parallel
Output Channel Arbitrary
  1. Group* and Reorg* are specific parameters in Caffe.
  2. The parameter channel_parallel is determined by the DPU configuration. The channel_parallel for DPUv3 is 16.
  3. Support both VALID and SAME pad_mode for operators in Tensorflow.

Operators listed above are commonly used in CNN models, and DPU can support many configurations of these operators.

Operators below are primitively defined in different deep learning frameworks. The compiler can automatically parse these operators and distribute them to DPU or CPU. These operators are partially supported by the tools, and they are listed here for your reference.

Table 2. Operators Information
Operators Framework Parameters DPU Support
Const Tensorflow - Arbitrary
Shape Tensorflow - Arbitrary
Identity Tensorflow - Arbitrary
Batchnorm+ Caffe - Arbitrary
Neg* Tensorflow - Partially
Mul* Tensorflow - Partially
Sub* Tensorflow - Partially
Gstiling* Caffe reverse, stride Partially
Permute* Caffe order Partially
Flatten* Caffe/TensorFlow start_dim, end_dim Partially
Squeeze* Tensorflow dims Partially
Reshape* Tensorflow shape Partially
Stack* Tensorflow axis Partially
Matmul* Tensorflow transpose_a, transpose_b Partially
Strided_Slice* Tensorflow begin, end, strides, begin_mask, end_mask, ellipsis_mask, new_axis_mask, shrink_axis_mask Partially
Mean* Tensorflow dims, keep_dims Avgpool-like configurations
Resize* Tensorflow scale, align_corners, mode scale = 2, false, NEAREST
Pad* Tensorflow pad, pad_mode, constant_value “Constant”and pad with 0, “SYMMETRIC”
Resize_nearest* Tensorflow align_corners False
DeephiResize* Caffe scale, mode Scale = 2, NEAREST
Upsample2D** Tensorflow align_corners -
Resize_bilinear** Tensorflow align_corners -
Space_to_batch** Tensorflow block_shape, Paddings -
Batch_to_space** Tensorflow block_shape, Paddings -
Prior_box** Caffe - -
Softmax** Tensorflow axis -