AI Engine Utilization Estimation

Digital Down-conversion Chain Implementation on AI Engine (XAPP1351)

Document ID
Release Date
1.0 English

The AI Engine processes data block by block and uses a data structure called window to describe one block of input or output data. The window size, which is the number of samples in each block of input or output data, represents a tradeoff between efficiency and processing latency. Long windows lead to high efficiency, but the latency increases proportionally to the window size. Sometimes a short latency is preferable at a loss of a 5-10% AI Engine processing efficiency.

For example, in this application note, the input window size is set to 512 samples to limit the latency to within 2.1 μs. The window sizes and sample rates of DDC filters are listed in the following table.

Table 1. Window Size and Sample Rate of DDC Filters
Filter Input Sample Rate (MSPS) Output Sample Rate (MSPS) Input Window Output Window
HB47 245.76 122.88 512 256
HB11 122.88 61.44 256 128
FIR199 122.88 122.88 256 256
HB23 61.44 30.72 128 64
FIR89 30.72 30.72 64 64
Mixer 122.88 122.88 256 1280/5x carriers

A cycle budget is the number of instruction cycles a function can take to compute a block of output data, given by:

At a 1 GHz AI Engine clock in the lowest speed-grade device, the processing of 512 samples at 245.76 MSPS has a cycle budget of 2083 cycles.

Suppose every output needs P 16-bit-real by 16-bit-real multiplications. The AI Engine can compute 32 such real-by-real multiplications every cycle. For an ideal implementation, the utilization lower boundary is given by:

Take FIR199 as an example. FIR199 has 199 real symmetric filter taps and it takes 100 16 bit-complex by 16 bit-real multiplications to compute each output. Therefore, every output of FIR199 needs 200 16-bit-real by 16-bit-real multiplications at 122.88 MSPS, and the utilization lower boundary is given by 200 cycles × 256 samples / (32 × 2083 cycle budget) = 76.8%. Similarly, the utilization lower bounds of other DDC filters are calculated and listed in the following table.

Table 2. AI Engine Utilization Lower Bound Analysis
Filter Input Window Size Output Window Size Number of Taps Number of MACs/Output Utilization / Instance Number of Inst Utilization Lower Bound
FIR199 256 256 199 200 76.8% 1 76.8%
FIR89 64 64 89 96 9.3% 5 46.5%
HB47 512 256 47 32 12.3% 1 12.3%
HB11 256 128 11 8 1.6% 5 8%
HB23 128 64 23 16 1.6% 5 8%
Mixer 1 256 1280 - 8 23% 1 23%
Total 174.6%
  1. To support configurable carrier frequency at run time, an on-line DDS calculation consumes extra 180 cycles in each mixer kernel execution.

Although in theory this DDC can be implemented on two AI Engines with 87.3% utilization each, such high utilization requires very long windows and undesirable latency. One method to reduce the utilization is to take advantage of the fact that 5G NR and 4G LTE carriers do not co-exist in this case and the filters for unused carriers can be disabled during run time, depending on the carrier configuration. Detailed analysis and explanation is provided in the following sections.