The following table shows a comparison between a 1024 x 1024 x1024 GeMM design implemented using the AI Engines and DSP with DSP Engines respectively. It lists the throughput, resource utilization, power consumption, and performance in throughput/Watt for cint16
implementations.
Design Target | TOPS |
Average Latency (in μs) | AIE Vector Cores | AIE Vector Load | Active Mem Banks / Mem R/W Rate |
Active AIE Tiles | FF (Regs) / CLB LUTs |
BRAMs | DSPs | Dynamic Power (in mW) |
TOPS per Watt (in TOPS/Watt) |
---|---|---|---|---|---|---|---|---|---|---|---|
AIE | 1.575 | 3.315 | 24 | 84.60% | 252 / 14.245% |
43 | 26478 / 13548 |
66 | 0 | 4911 | 0.320 |
DSP | 1.433 | 1497.971 | NA | NA | NA | NA | 74674 / 20700 |
64 | 1024 | 8709 | 0.164 |
It is important to understand that those 46 AI Engines tiles are not all required for the GeMM compute: 24 AI Engines/vector cores are required for computation, and 22 AI Engines are required for the memory to store the Matrices and also to enable connectivity around the array. The average load on these additional 22 AI Engine tiles is 84.63%.
Measurement:
AI Engine design resource utilization is measured using Xilinx Power Estimator (XPE) and Vivado (report utilization under implementation for FFs and CLB LUTs). For the HLS design, resource utilization is measured using Vivado.
AI Engine power consumption is measured using XPE. HLS power consumption is measured using Vivado (report power under implementation).
Throughput is measured using viewing runtime profiling generated trace texts in
vitis_analyzer
.
For detailed instructions on taking measurements of the parameters, refer to the individual implementation section.