The Xilinx® DPURADR16L IP is a programmable engine optimized for recurrent neural networks, mainly for low latency applications. This IP is implemented on the Alveo U25 card with a single thread configuration.
The design is composed of Scheduler, Load, and Save modules for data movement between the off-chip memory and on-chip caches. It also includes a 32x32 systolic array of DSPs to perform Matrix-Vector multiplications and some other computation modules for miscellaneous operations, such as element-wise multiplication and addition and non-linear function. The scheduler is responsible for instructions fetching from the off-chip memory and distributing them to different computation units according to dependency constraints. Figure 1 and Figure 1 show the architecture of the kernel and the systolic array module.