The throughput of buffer-based AI Engine transforms can be improved at the expense of additional latency using “batch processing.” This technique employs a buffer size that is larger than the transform size. This way, the switching overhead from ping to pong buffers occurs only once per batch instead of every transform. Overall, throughput is improved at the expense of additional latency because it takes longer to buffer multiple data sets.
The following table illustrates the impact of using REPEAT=128 over REPEAT=1 for the AIE API version of the FFT-32 design. The overall throughput increased from 209 Msps to 312 Msps, while the latency increased significantly due to the buffering up of 128 transforms.
In practice, the fundamental 128 KB limit of neighboring AI Engine local tile memory limits the improvement offered by batch processing, particularly for the larger transforms, and it needs to resort to additional techniques to improve throughput further.
| Design | # of AI Engines | REPEAT |
Throughput (Msps) | Latency (us) |
|---|---|---|---|---|
fft32_r2 |
1 | 1 | 209 | 0.446 |
fft32_r2 |
1 | 128 | 312 | 26.2 |
fft32_dsplib |
1 | 1 | 222 | 0.443 |
fft32_dsplib |
1 | 128 | 367 | 22.29 |
fft_dsplib_split |
3 | 1 | 363 | 0.408 |
fft_dsplib_ssr |
4 | 128 | 474 | 9.52 |