1 Gsps Implementation with Cascade Stream - 2025.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2025-11-26
Version
2025.2 English

The AI Engine vector unit supports 8 MACs per cycle for cint16 multiply-accumulate cint16 types. If a four lane implementation of mul4/mac4 intrinsics is adopted, then there are two complex operations on each lane.

Computing four outputs requires 16 mac4() because each output requires 32 complex MACs. This means, computing four outputs requires 16 cycles using an AI Engine. So the sample rate of an AI Engine (assuming it runs at 1 GHz) is as follows.

4 Gsps/16 = 0.25 Gsps = 250 Msps

This calculates the compute bound of an AI Engine. However, you still need to consider the memory bound to see if that sample rate can be met. Assume that one stream input and one stream output are used for data transfer and coefficients are stored in the AI Engine internal memory. The stream interface of an AI Engine supports 32 bits per cycle. It is capable of transferring one sample of data every cycle.

Thus, the sample rate from the data transferring view is as follows.

1 sample/cycle *1 GHz = 1 Gsps

This is larger than the compute bound, which is 250 Msps. Therefore the AI Engine implementation operates at 250 Msps.

Figure 1. One AI Engine FIR Filter Realization

Based on the calculations, it is possible to achieve 1 Gsps via a stream input and output stream interface. If you split the MAC operations of a single kernel implementation into four kernels, 4*250Msps = 1 Gsps, compute throughput can be achieved. Those four kernels are connected through cascade streaming. Therefore, the AI Engine compute bound matches AI Engine interface throughput.

Figure 2. 1 Gsps Implementation with Four Cascaded Kernels