Throughput - 2022.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
XD100
Release Date
2022-12-01
Version
2022.2 English

The burst_cnt variable determines the total number of samples processed during each function call. The inner loop processes 8 samples per iteration, so the total number of processed samples will be burst_cnt * 8.

The throughput is obtained as follows (see api_thruput.xlsx):

  • build and run the design

  • open aiesimulator_output/default.aierun_summary

  • Get the Total Function + Descendants Time (cycles) for the main function (num_cycles)

  • Throughput = clk_freq * (burst_cnt * 8)/num_cycles

The thoughput with a 1GHz clock for different values of burst_cnt are shown below.

IIR Throughput (with API) | | | | | | | | | |—————————|——-|——-|——-|——-|——-|——-|——-| |burst_cnt |1 |8 |16 |32 |64 |128 |256 | |num_samples |8 |64 |128 |256 |512 |1024 |2048 | |num_cycles (API) |289 |799 |1508 |2925 |5761 |11431 |22772 | |API Throughput (Msa/sec)) |27.68 |80.10 |84.88 |87.52 |88.87 |89.58 |89.94 |

*clk_freq: 1GHz

The AI Engine APIs are a header-only implementation which act as a “buffer” between the user and the low-level intrinsics (LLI) to increase the level of abstraction. Is it possible that the API adds some overhead?

We modify the kernel code to use low-level intrinsics (LLI).