Design Planning

Arbitrary Resampling Filter Design (XAPP1373)

Document ID
Release Date
1.0 English

According to the equation 3·L·K·Input_Sample_Rate, the number of multiplications required to meet the specifications in Table 1 is 3 x 16 taps x 350 MSPS x 2.0 = 33.6G MACs, which exceeds 32G MAC capability of one AI Engine running at 1 GHz. It means that at least two AI Engines are required, for Figure 1 and Figure 2, respectively. Another observation is that the implementation of Figure 2 involves large look-up tables {Fk} and {Gk}, in which all 16 coefficients for one output sample should be read out simultaneously. However, the vectorized implementation of Figure 1 needs the coefficients of four output samples to be interleaved for parallel computation. One more AI Engine should be inserted to interleave the coefficients.

Figure 4 shows the mapping of ARF to the Versal AI Core device. The AI Engines implementing Figure 1 and Figure 2 are labeled FILT and INTP, respectively. The third AI Engine INLV is for coefficient interleaving. The number of output samples computed from every input is fixed to K = 2 in AI Engine, and the OutIF module in the PL removes the invalid data according to a flag generated by the CTRL block, which also computes the phase information {s, α} for coefficient interpolation.

The input and output interfaces of AI Engine strictly follow the AXI protocol where the Ready and Valid signals might go Low at any time, creating idle cycles. The CTRL module in the PL offers a simple FIFO-like interface for inputs, and the output FIFO removes all the idle cycles to form a continuous data stream in the output clock domain. The AI Engine output sample rate must be K = 2 times that of the input to meet the throughput requirement, so the output AXI bus is 64 bits while the input is 32 bits. For a 350 MSPS input sample rate, a clock of 375 MHz is selected to ensure enough throughput despite the idle cycles. Also, the input phase information might change instantaneously with the data, and {s, α} must be computed on the fly for every input in the PL.

Figure 1. Partitioning of Arbitrary Fractional SRC Filter onto Versal Device

The input delays of every AI Engine kernel should be carefully balanced to avoid memory stalls. For example, the following figure shows the coefficients and data inputs to the FILT kernel have a large difference in latency, leading to memory stalls and throughput degradation. To solve this problem, a direct memory access (DMA) FIFO is constructed inside the AI Engine array to absorb the delay differences.

Figure 2. Balance FILT Input Delays with DMA FIFO

It takes the FILT kernel one clock cycle to compute one output sample, so the peak sample rate can be up to 1 GSPS. Because the target throughput is only 700 MSPS, a margin of 30% can be traded for latency. Figure 2 shows the total AI Engine processing delay is slightly more than twice that of the time to process one window of data. A window of 128 samples translates into 128 x 1/375 = 340 ns latency for the INTP kernel, plus another 340 ns for INLV, and 170 ns for FILT. The total latency is estimated to be 850 ns, leaving 150 ns margin in the 1 μs budget to fill up the output FIFO before the reading starts. The FIFO should be deep enough to accept all prefilled data with some margin to prevent FIFO underflow and overflow when the output is active.