The burst_cnt
variable determines the total number of samples processed during each function call. The inner loop processes 8 samples per iteration, so the total number of processed samples will be burst_cnt
* 8.
The throughput is obtained as follows (see api_thruput.xlsx
):
build and run the design
open
aiesimulator_output/default.aierun_summary
Get the
Total Function + Descendants Time (cycles)
for themain
function (num_cycles
)Throughput =
clk_freq
* (burst_cnt
* 8)/num_cycles
The thoughput with a 1GHz clock for different values of burst_cnt
are shown below.
IIR Throughput (with API) | | | | | | | | | |—————————|——-|——-|——-|——-|——-|——-|——-| |burst_cnt |1 |8 |16 |32 |64 |128 |256 | |num_samples |8 |64 |128 |256 |512 |1024 |2048 | |num_cycles (API) |289 |799 |1508 |2925 |5761 |11431 |22772 | |API Throughput (Msa/sec)) |27.68 |80.10 |84.88 |87.52 |88.87 |89.58 |89.94 |
*clk_freq: 1GHz
The AI Engine APIs are a header-only implementation which act as a “buffer” between the user and the low-level intrinsics (LLI) to increase the level of abstraction. Is it possible that the API adds some overhead?
We modify the kernel code to use low-level intrinsics (LLI).