The detailed hardware architecture of the DPUCADF8H is shown in the following figure. One design can have one to four DPUCADF8H instances, and the number of DPUCADF8H instances depends on available FPGA resource. Each DPUCADF8H has four batch engines, namely Process Elements (PEs). PE is the computing core of the accelerator. The parameters buffer (PB) is shared by all the PEs and is used to store various parameters, including weights, quantization parameters and nonlinear parameters, etc. The data rearrangement unit (DRU) is a pre-processing unit to process the original image on the external storage, such as zero-center, scaling, whitening, and data rearrangement, etc. It is particularly worth mentioning that the DPUCADF8H supports multitasking to reduce the interaction overhead between the host and the accelerator.
After starting up, the DPUCADF8H fetches model instructions from the external storage to control the operations of the PE. The model instructions are generated by the AMD Vitis™ AI compiler which performs substantial optimizations.