AI Engine and HLS Implementation Comparison - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

The following table compares a 1024 x 2048 point 10-instance FFT-2D design implemented using the AI Engines and HLS with DSP Engines respectively. It lists the throughput, resource utilization, power consumption, and performance in throughput/Watt for cint16 implementations.

Design Target Aggregate Throughput
(in MSPS)
Average Latency (in μs) AIE Vector Cores AIE Vector Load Active Mem Banks /
Mem R/W Rate
Active AIE Tiles FF (Regs) /
CLB LUTs
BRAMs DSPs Dynamic Power
(in W)
Performance per Watt
(in MSPS/Watt)
AIE 6229.350 3537.296 20 78.47% 420 /
44%
60 11360 /
3647
0 0 5.542 1134.740773
HLS 6277.483 4211.296 NA NA NA NA 88447 /
56429
250 180 6.819 920.587051

These observations give a clear indication of where the AI Engines in Versal can offer improvements:

  • Reduced latency by ~19.054%.

  • Moving to AI Engine implementation reduces the PL and DSP resources considerably; 180 DSPs, ~88K FFs, ~56K LUTs, and 250 BRAMs are reduced to 72 AI Engines, 11k FFs, and 3K LUTs.

It is important to understand that those 72 AI Engines are not all required for the 2D-FFT compute: 20 AI Engines/vector cores are required for computation, and 52 AI Engines are required for the memory to store the FFT twiddle factors and also to enable connectivity around the array. The average load on these additional 52 AI Engine tiles is only 79%. This means that if your application needs it, these AI Engines can be shared with other functions to run sequentially, or they can use user constraints to better map and route this function to a reduced number of AI Engine tiles (see this page for details on the AI Engine mapper/router).

Additionally, increasing the number of instances in the AI Engine design is easier than the HLS design, which runs into timing closure issues, especially for higher FFT point size designs.

Measurement:

  1. AI Engine design resource utilization is measured using Xilinx Power Estimator (XPE) and AMD Vivado™ (report utilization under implementation for FFs and CLB LUTs). For the HLS design, resource utilization is measured using Vivado.

  2. AI Engine power consumption is measured using XPE. HLS power consumption is measured using Vivado (report power under implementation).

  3. Throughput is measured using viewing runtime profiling generated trace texts in vitis_analyzer.

For detailed instructions on taking measurements of the parameters, refer to the individual implementation section.