Design Throughput Calculations (Effective vs. Theoretical) - Design Throughput Calculations (Effective vs. Theoretical) - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

The following table describes the total number of floating-point operations (FLOP) for 1 iteration of a single nbody() AI Engine kernel:

Section of Code

mac

mul

add

sub

invsqr

Total FLOP

Step 1

0

0

0

0

0

0

Step 2

96

0

0

0

0

192

Step 3

2,470,400

1,228,800

51,200

1,228,800

3,276,800

10,726,400

Note: Each section is clearly commented in the nbody.cc source file.

Note: To calculate the total, each mac is considered two operations (mul and add).

Thus, each nbody() kernel executes ~10.7 million FLOP/iteration. Since we have 400 AI Engine tiles (that is, 400 nbody() kernels) that execute simulatenously, the total number for the entire AI Engine array becomes ~4.2 billion FLOP/iteration. We calculated each iteration of the entire design (including data movement from DDR to AI Engine) takes an average of 0.0072 seconds. Therefore the effective throughput of the entire design is ~598.404 GFLOP/s.

The theoretical peak throughput the AI Engine array alone can acheive is ~8 Tera FLOP/s. You are using less than 1/10th of its potential!

Effective Throughput

Theoretical Peak Throughput

0.598 TFLOP/s

8 TFLOP/s

This design of an N-Body Simulator on the AI Engine is a straightforward implementation without any major optimizations done. To further maximize the throughput of the entire design:

  • you can explore increasing FMAX of the PL kernels from 200 MHz to closer to 500 MHz to reduce the latency of moving data from DDR to the AI Engine

  • PL kernels currently implement a round-robin method of transmitting data. You can design these to optimally cache and schedule to increate data bandwidth

  • you can refactor the nbody() kernel to reduce its reliance on the scalar processor and only use the vector processor in each AI Engine tile by approximating inverse square root