Optimization Technique: Batch Processing - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

The throughput of buffer-based AI Engine transforms can be improved at the expense of additional latency using “batch processing.” This technique employs a buffer size that is larger than the transform size. This way, the switching overhead from ping to pong buffers occurs only once per batch instead of every transform. Overall, throughput is improved at the expense of additional latency because it takes longer to buffer multiple data sets.

The following table illustrates the impact of using REPEAT=128 over REPEAT=1 for the AIE API version of the FFT-32 design. The overall throughput increased from 209 Msps to 312 Msps, while the latency increased significantly due to the buffering up of 128 transforms.

In practice, the fundamental 128 KB limit of neighboring AI Engine local tile memory limits the improvement offered by batch processing, particularly for the larger transforms, and it needs to resort to additional techniques to improve throughput further.

Design # of AI Engines REPEAT Throughput (Msps) Latency (us)
fft32_r2 1 1 209 0.446
fft32_r2 1 128 312 26.2
fft32_dsplib 1 1 222 0.443
fft32_dsplib 1 128 367 22.29
fft_dsplib_split 3 1 363 0.408
fft_dsplib_ssr 4 128 474 9.52