The detailed hardware architecture of the DPUCAHX8H is shown in the following figure.
Each implementation has one to three DPU cores, and each DPU has three to five processing engines. The number of cores and PEs/core are chosen based on throughput needs versus FPGA resource usage. After starting up, the DPU fetches model instructions from system memory to control the operation of the computing engine. AI tools are used to perform model optimizations for efficient DPU usage and then generate the runtime instructions needed to implement the model in the DPU core.
HBM is used to buffer weights, bias, intermediate data, and output data to achieve high throughput and efficiency.
Figure 1. DPU Hardware Architecture