The following table describes the total number of floating-point operations (FLOP) for 1 iteration of a single nbody() AI Engine kernel:
Section of Code |
mac |
mul |
add |
sub |
invsqr |
Total FLOP |
|---|---|---|---|---|---|---|
Step 1 |
0 |
0 |
0 |
0 |
0 |
0 |
Step 2 |
96 |
0 |
0 |
0 |
0 |
192 |
Step 3 |
2,470,400 |
1,228,800 |
51,200 |
1,228,800 |
3,276,800 |
10,726,400 |
Note: Each section is clearly commented in the nbody.cc source file.
Note: To calculate the total, each mac is considered two operations (mul and add).
Thus, each nbody() kernel executes ~10.7 million FLOP/iteration. Since we have 400 AI Engine tiles (that is, 400 nbody() kernels) that execute simulatenously, the total number for the entire AI Engine array becomes ~4.2 billion FLOP/iteration. We calculated each iteration of the entire design (including data movement from DDR to AI Engine) takes an average of 0.0072 seconds. Therefore the effective throughput of the entire design is ~598.404 GFLOP/s.
The theoretical peak throughput the AI Engine array alone can acheive is ~8 Tera FLOP/s. You are using less than 1/10th of its potential!
Effective Throughput |
Theoretical Peak Throughput |
|---|---|
0.598 TFLOP/s |
8 TFLOP/s |
This design of an N-Body Simulator on the AI Engine is a straightforward implementation without any major optimizations done. To further maximize the throughput of the entire design:
you can explore increasing
FMAXof the PL kernels from 200 MHz to closer to 500 MHz to reduce the latency of moving data from DDR to the AI EnginePL kernels currently implement a round-robin method of transmitting data. You can design these to optimally cache and schedule to increate data bandwidth
you can refactor the
nbody()kernel to reduce its reliance on the scalar processor and only use the vector processor in each AI Engine tile by approximating inverse square root