Filterbank Library Optimization - Filterbank Library Optimization - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

You can use the following approach to tradeoff throughput for storage, reducing the number of used AI Engine tiles:

  • Apply single_buffer constraint on the input. For more information, refer to AI Engine Kernel and Graph Programming Guide UG1076.

  • Add placement constraints to store each tile’s storage requirements locally.

Code snippet below taken from <path-to-design>/aie/tdm_fir/firbank_app.cpp shows an example of how this can be done.

single_buffer(dut.tdmfir.m_firKernels[ii+0].in[0]);
location<kernel>   (dut.tdmfir.m_firKernels[ii])                 =      tile(start_index+xoff,0);
location<stack>    (dut.tdmfir.m_firKernels[ii])                 =      bank(start_index+xoff,0,3);
location<parameter>(dut.tdmfir.m_firKernels[ii].param[0])        =      bank(start_index+xoff,0,3);
location<parameter>(dut.tdmfir.m_firKernels[ii].param[1])        =   address(start_index+xoff,0,0x4C00);
location<buffer>   (dut.tdmfir.m_firKernels[ii].in[0])           =      bank(start_index+xoff,0,0);
location<buffer>   (dut.tdmfir.m_firKernels[ii].out[0])          = {    bank(start_index+xoff,0,1), bank(start_index+xoff,0,3) };

Compile and simulate the design to confirm it works as expected.

[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make clean all
[shell]% vitis_analyzer aiesimulator_output/default.aierun_summary

Inspecting vitis_analyzer, we observe that our resource count dropped to 32 tiles with a throughput = 4096/1.837us = 2230 MSPS.

figure8

figure9

To reduce the number of input and output PLIOs, we can use the newly introduced Vitis Libraries Packet Switching IP.

  using TT_FIR = xf::dsp::aie::fir::tdm::fir_tdm_graph<TT_DATA,TT_COEFF,TP_FIR_LEN,TP_SHIFT,TP_RND,TP_INPUT_WINDOW_VSIZE,
                                       TP_TDM_CHANNELS,TP_NUM_OUTPUTS,TP_DUAL_IP,TP_SSR,TP_SAT,TP_CASC_LEN,TT_OUT_DATA>;

  static constexpr unsigned           NPORT_I = 2;
  static constexpr unsigned           NPORT_O = 4;
  xf::dsp::aie::pkt_switch_graph<TP_SSR, NPORT_I, NPORT_O, TT_FIR> tdmfir;

Compile the design to understand updated resources usage:

[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make clean compile
[shell]% vitis_analyzer Work/firbank_app.aiecompile_summary

figure

Under the hood, the IP instantiates 2 x pktsplit<16> and 4 x pktmerge<8>.

figure

We have 4k samples that need to be distributed between 32 parallel filters, hence each filter shall receive 128 samples. From here, we have to key decisions to make:

  • Size of the packet, i.e. number of samples per packet. A smaller packet requires less buffering in the PL but has higher bandwidth overhead since packet header needs to be sent more often, while the opposite is true for a larger packet. Edge cases are

    • Packet size = 1 can be simulated by commenting out gen_vectors.m line 127-146 and uncommenting line 167-177. This should double the bandwidth requirement since packet header has to be specified for every sample. When simulating the design, we observe that the system incurrs additional cycles of latency for packet arbitration. Therefore, this option is not ideal to proceed with.

    • Packet size = 128 results in minimal packet switching overhead but requires a full transform double-buffered storage in PL

  • Order of the packets

    • Linear ordering: Since the # output ports > # input ports, distributing the packets in linear order means output_0 starts producing samples way before output_1. Simiarly, output_2 before output_3. This can be simulated by commenting out gen_vectors.m line 127-146 and uncommenting line 149-164. The large latency (~0.84us, measured in aiesimulator) delta between output ports will results in stalling and degraded throughput when connecting the TDM to the rest of the channelizer. This latency can be absorbed by adding large FIFOs on TDM output in system.cfg, but this consumes PL resources. More on applying FIFOs on stream connections can be found in Specifying Streaming Connections • Embedded Design Development Using Vitis User Guide (UG1701). figure

    • Interleaved ordering: Alternatively, we can distribute the packets such that each output port receives one packet at at a time. Therefore, input_0 receives input data for tdm_0, tdm_8, tdm_1, tdm_9, etc. Doing so comes at no cost and results in latency delta reduction down to ~0.091us, measured in aiesimulator.

Run gen_vectors.m and simulate the design to understand achieved throughput:

[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make gen_vectors
[shell]% make profile
[shell]% vitis_analyzer aiesimulator_output/default.aierun_summary

Below we measure the aiesimulator throughput for the optimized packet switching-based TDM => 4096/1.833 = 2234 Msps. figure