The last two sections (Multi-kernel and Single-Stream SSR) showed that when using a single input stream, the balance between stream bandwidth and compute performance for cint16 x cint16 is obtained for an 8-tap filter implementation in an AI Engine. This can be easily computed. For the device on the VCK190 speed grade of the AMD Versal™ AI Core Series, the entire AI Engine array (processors, AXI-Stream connections, memory modules) is clocked at 1.25 GHz. The input stream can transfer 32 bits per clock and a cint16 variable is 32-bit wide; hence, a rate of 1.25 GSPS (giga samples per second). The processor by itself is capable of eight cint16xcint16 operations per clock cycle. The result is that the processor can perform 8-tap filter processing per clock cycle.
If two input streams operate in an efficient way, the input sample rate can increase to 2.5 GSPS (1.25 GSPS per stream). As the processor performance does not change, it is able to process only four taps per clock cycle at the input sample rate.
This means that in the case of a single-stream implementation, the filter length should be a multiple of eight to extract maximum performance from the AI Engine array. In the case of a dual-stream implementation, the filter length must be a multiple of four to achieve this maximum performance. This lower granularity allows more freedom in the filter length. Take a 12 tap filter as an example, with an input sample rate at 2.5 GSPS.
Single-stream implementation: The input sample rate (2.5 GSPS) requires splitting the coefficients and input data into two phases (1.25 GSPS each). Because it has two phases, this implementation requires four kernels (2 x 2) in a grid. 12 taps divided into two phases results in six taps per phase. Each kernel handles six taps, but the maximum performance is eight taps. Single-stream input data use four AI Engines at 75 percent of their maximum performance.
Dual-stream implementation: This case is simpler. The input interface can handle a 2.5 GSPS input data sampling rate, but can process only four taps per kernel. This implementation requires three kernels (3 kernels x 4 taps = 12 taps) running at 100 percent of their compute performance.
A major impact is the way the data flows to the AI Engine. The AI Engine alternatively reads four samples on both the streams (or eight samples at the same time). The resulting stream must be equivalent to a 2.5 GSPS data stream. Suppose you have the following 2.5 GSPS data stream: d0, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, d16, d17, d18, d19, ...
The AI Engine read sequence must be:
- Read Stream 0 : `d0, d1, d2, d3`
- Read Stream 1 : `d4, d5, d6, d7`
- Read Stream 0 : `d8, d9, d10, d11`
- Read Stream 1 : `d12, d13, d14, d15`
- Read Stream 0 : `d16, d17, d18, d19`
- ...
So the content of each stream must be:
- Stream 0: `d0, d1, d2, d3, d8,d9, d10, d11, d16, d17, d18, d19, ...`
- Stream 1: `d4, d5, d6, d7, d12, d13, d14, d15, ...`
The stream content is dependent on the number of samples (bits), which the system reads as a block on each stream.
In single-stream implementation, some kernels had to discard one sample before the first invocation. The initialization function accomplished this easily, discarding one sample or blocks of eight samples if the coefficient phases were longer than eight coefficients. In dual-stream implementation, this is complex because if one sample comes from Stream 0 beforehand, the stream combination becomes completely disorganized. To avoid changing the stream content, reorganize the computation and start computing one sample after (change the Start parameter of the mul4/mac4 intrinsic).
On top of this, if the coefficient phase is longer than four, you must discard four elements. If it is longer than eight, discard it. Stream 0 provides the first four elements, Stream 1 provides the next four, and then Stream 0 again. If more blocks of four elements come from Stream 0 than from Stream 1, the first stream to read within the kernel is Stream 1.