Hardware Architecture - 1.3 English

DPUCVDX8G for Versal ACAPs Product Guide (PG389)

Document ID
PG389
Release Date
2023-01-23
Version
1.3 English

The DPUCVDX8G is composed of the PL and the AI Engine. The AI Engine is used for the convolution operation in neural networks. Data transfers, instruction scheduling, pooling, element-wise sum, and depth-wise convolution are executed in the PL.

The DPUCVDX8G can be set up with multiple batch handlers. For each batch handler, there is a corresponding AI Engine group and related AI Engine interface tile resources. On the PL side, the DPUCVDX8G is split into two parts: batch handler and shared logic. The batch handler is mainly for the processing of the feature map, such as loading, saving, pooling, etc. The Arithmetic and Logic Unit (ALU) module in the batch handler can process the pooling, element-wise, and depth-wise convolution operations for the feature maps. Feature maps are stored in the IMG BANK which is composed of the on-chip RAM. The image sender and weights sender modules are used for preparing the data for the AI Engine array. The shared logic in the PL component includes the Permuter module and the Scheduler module. The scheduler fetches and dispatches instructions from the NoC and transfers them to the batch handler and the Permuter module. The Permuter module loads the weights and bias from the NoC and sends the data to the AI Engine array for each calculation iteration. For more information, see Table 1.

The DPUCVDX8G can also be set up with multiple compute units to run different models simultaneously using additional NoC NMU interfaces.

After start up, the DPUCVDX8G fetches DPU instructions from the NoC. The instructions are generated by the Vitis™ AI compiler, where substantial optimizations are performed, and are used to control the operation of the various engines in the DPU.

On-chip memory is used to buffer input, intermediate, and output activations resulting in high-throughput and efficiency. Data in these local buffers is reused in order to reduce external memory bandwidth. A deeply pipelined design is used for the computing engines in the DPU.

Figure 1. Hardware Architecture of the DPUCVDX8G