You can use the following approach to tradeoff throughput for storage, reducing the number of used AI Engine tiles:
Apply
single_bufferconstraint on the input. For more information, refer to AI Engine Kernel and Graph Programming Guide UG1076.Add placement constraints to store each tile’s storage requirements locally.
Code snippet below taken from <path-to-design>/aie/tdm_fir/firbank_app.cpp shows an example of how this can be done.
single_buffer(dut.tdmfir.m_firKernels[ii+0].in[0]);
location<kernel> (dut.tdmfir.m_firKernels[ii]) = tile(start_index+xoff,0);
location<stack> (dut.tdmfir.m_firKernels[ii]) = bank(start_index+xoff,0,3);
location<parameter>(dut.tdmfir.m_firKernels[ii].param[0]) = bank(start_index+xoff,0,3);
location<parameter>(dut.tdmfir.m_firKernels[ii].param[1]) = address(start_index+xoff,0,0x4C00);
location<buffer> (dut.tdmfir.m_firKernels[ii].in[0]) = bank(start_index+xoff,0,0);
location<buffer> (dut.tdmfir.m_firKernels[ii].out[0]) = { bank(start_index+xoff,0,1), bank(start_index+xoff,0,3) };
Compile and simulate the design to confirm it works as expected.
[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make clean all
[shell]% vitis_analyzer aiesimulator_output/default.aierun_summary
Inspecting vitis_analyzer, we observe that our resource count dropped to 32 tiles with a throughput = 4096/1.837us = 2230 MSPS.
To reduce the number of input and output PLIOs, we can use the newly introduced Vitis Libraries Packet Switching IP.
using TT_FIR = xf::dsp::aie::fir::tdm::fir_tdm_graph<TT_DATA,TT_COEFF,TP_FIR_LEN,TP_SHIFT,TP_RND,TP_INPUT_WINDOW_VSIZE,
TP_TDM_CHANNELS,TP_NUM_OUTPUTS,TP_DUAL_IP,TP_SSR,TP_SAT,TP_CASC_LEN,TT_OUT_DATA>;
static constexpr unsigned NPORT_I = 2;
static constexpr unsigned NPORT_O = 4;
xf::dsp::aie::pkt_switch_graph<TP_SSR, NPORT_I, NPORT_O, TT_FIR> tdmfir;
Compile the design to understand updated resources usage:
[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make clean compile
[shell]% vitis_analyzer Work/firbank_app.aiecompile_summary
Under the hood, the IP instantiates 2 x pktsplit<16> and 4 x pktmerge<8>.
We have 4k samples that need to be distributed between 32 parallel filters, hence each filter shall receive 128 samples. From here, we have to key decisions to make:
Size of the packet, i.e. number of samples per packet. A smaller packet requires less buffering in the PL but has higher bandwidth overhead since packet header needs to be sent more often, while the opposite is true for a larger packet. Edge cases are
Packet size = 1 can be simulated by commenting out
gen_vectors.mline 127-146 and uncommenting line 167-177. This should double the bandwidth requirement since packet header has to be specified for every sample. When simulating the design, we observe that the system incurrs additional cycles of latency for packet arbitration. Therefore, this option is not ideal to proceed with.Packet size = 128 results in minimal packet switching overhead but requires a full transform double-buffered storage in PL
Order of the packets
Linear ordering: Since the # output ports > # input ports, distributing the packets in linear order means output_0 starts producing samples way before output_1. Simiarly, output_2 before output_3. This can be simulated by commenting out
gen_vectors.mline 127-146 and uncommenting line 149-164. The large latency (~0.84us, measured in aiesimulator) delta between output ports will results in stalling and degraded throughput when connecting the TDM to the rest of the channelizer. This latency can be absorbed by adding large FIFOs on TDM output in system.cfg, but this consumes PL resources. More on applying FIFOs on stream connections can be found in Specifying Streaming Connections • Embedded Design Development Using Vitis User Guide (UG1701).Interleaved ordering: Alternatively, we can distribute the packets such that each output port receives one packet at at a time. Therefore, input_0 receives input data for tdm_0, tdm_8, tdm_1, tdm_9, etc. Doing so comes at no cost and results in latency delta reduction down to ~0.091us, measured in aiesimulator.
Run gen_vectors.m and simulate the design to understand achieved throughput:
[shell]% cd <path-to-design>/aie/tdm_fir
[shell]% make gen_vectors
[shell]% make profile
[shell]% vitis_analyzer aiesimulator_output/default.aierun_summary
Below we measure the aiesimulator throughput for the optimized packet switching-based TDM => 4096/1.833 = 2234 Msps.