The AI Engine vector unit supports 8 MACs per cycle for cint16 multiply-accumulate cint16 types. If a four lane implementation of mul4/mac4 intrinsics is adopted, then there are two complex operations on each lane.
Computing four outputs requires 16 mac4() because each output requires 32 complex MACs. This means, computing four outputs requires 16 cycles using an AI Engine. So the sample rate of an AI Engine (assuming it runs at 1 GHz) is as follows.
This calculates the compute bound of an AI Engine. However, you still need to consider the memory bound to see if that sample rate can be met. Assume that one stream input and one stream output are used for data transfer and coefficients are stored in the AI Engine internal memory. The stream interface of an AI Engine supports 32 bits per cycle. It is capable of transferring one sample of data every cycle.
Thus, the sample rate from the data transferring view is as follows.
This is larger than the compute bound, which is 250 Msps. Therefore the AI Engine implementation operates at 250 Msps.
Based on the calculations, it is possible to achieve 1 Gsps via a stream input and output stream interface. If you split the MAC operations of a single kernel implementation into four kernels, 4*250Msps = 1 Gsps, compute throughput can be achieved. Those four kernels are connected through cascade streaming. Therefore, the AI Engine compute bound matches AI Engine interface throughput.