AI Engine and DSP Implementation Comparison - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

The following table shows a comparison between a 1024 x 1024 x1024 GeMM design implemented using the AI Engines and DSP with DSP Engines respectively. It lists the throughput, resource utilization, power consumption, and performance in throughput/Watt for cint16 implementations.

Design Target TOPS
Average Latency (in μs) AIE Vector Cores AIE Vector Load Active Mem Banks /
Mem R/W Rate
Active AIE Tiles FF (Regs) /
CLB LUTs
BRAMs DSPs Dynamic Power
(in mW)
TOPS per Watt
(in TOPS/Watt)
AIE 1.575 3.315 24 84.60% 252 /
14.245%
43 26478 /
13548
66 0 4911 0.320
DSP 1.433 1497.971 NA NA NA NA 74674 /
20700
64 1024 8709 0.164

It is important to understand that those 46 AI Engines tiles are not all required for the GeMM compute: 24 AI Engines/vector cores are required for computation, and 22 AI Engines are required for the memory to store the Matrices and also to enable connectivity around the array. The average load on these additional 22 AI Engine tiles is 84.63%.

Measurement:

  1. AI Engine design resource utilization is measured using Xilinx Power Estimator (XPE) and Vivado (report utilization under implementation for FFs and CLB LUTs). For the HLS design, resource utilization is measured using Vivado.

  2. AI Engine power consumption is measured using XPE. HLS power consumption is measured using Vivado (report power under implementation).

  3. Throughput is measured using viewing runtime profiling generated trace texts in vitis_analyzer.

For detailed instructions on taking measurements of the parameters, refer to the individual implementation section.