Dual-Stream Input Impact - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
Release Date
2023.2 English

The last two sections (Multi-kernel and Single-Stream SSR) showed that when a single input stream is used, the balance between stream bandwidth and compute performance for cint16 x cint16 is obtained for an 8-tap filter implementation in an AI Engine. This can be easily computed. For the device on the VCK190 speed grade of the AMD Versal™ AI core, the entire AI Engine array (processors, AXI-Stream connections, memory modules, and so on) is clocked at 1.25 GHz. The input stream can transfer 32 bits per clock and a cint16 variable is 32-bit wide; hence, a rate of 1.25 Gsps (Giga samples per second). The processor by itself is capable of eight cint16xcint16 operations per clock cycle. The result is that the processor can perform 8-tap filter processing per clock cycle.

If the two input streams are used in an efficient way, the input sample rate can increase to 2.5 Gsps (1.25 Gsps per stream). As the processor performance does not change, it is able to process only four taps per clock cycle at the input sample rate.

This means that in the case of a single-stream implementation, the filter length should be a multiple of eight to have the maximum performance extracted from the AI Engine array. In the case of a dual-stream implementation, the filter length should be a multiple of four to achieve this maximum performance. This lower granularity allows more freedom in the filter length. Take a 12 tap filter as an example, with an input sample rate at 2.5 Gsps.

  1. Single-stream implementation: The input sample rate (2.5 Gsps) requires that the coefficients and the input data are split into two phases (1.25 Gsps each). Having two phases, this implementation requires four kernels (2 x 2) to be used in a grid. 12 taps divided into two phases results in six taps per phase. Each kernel will handle six taps, but the maximum performance is eight taps. Single-stream input data will use four AI Engines at 75 percent of their maximum performance.

  2. Dual-stream implementation: In this case, it is much simpler. The input interface can handle a 2.5 Gsps input data sampling rate, but can process only four taps per kernel. This implementation will require three kernels (3 kernels x 4 taps = 12 taps) running at 100 percent of their compute performance.

A major impact is the way the data is provided to the AI Engine. The AI Engine alternatively reads four samples on both the streams (or eight samples at the same time). The resulting stream should be equivalent to a 2.5 Gsps data stream. Suppose we have the following 2.5 Gsps data stream: d0, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, d16, d17, d18, d19, ...

The AI Engine read sequence should be:

  • Read Stream 0 : d0, d1, d2, d3

  • Read Stream 1 : d4, d5, d6, d7

  • Read Stream 0 : d8, d9, d10, d11

  • Read Stream 1 : d12, d13, d14, d15

  • Read Stream 0 : d16, d17, d18, d19

So the content of each stream should be:

  • Stream 0: d0, d1, d2, d3, d8,d9, d10, d11, d16, d17, d18, d19, ...

  • Stream 1: d4, d5, d6, d7, d12, d13, d14, d15, ...

The stream content is dependent on the number of samples (bits), which are read as a block on each stream.

In single-stream implementation, some of the kernels had to discard one sample before the first invocation of the kernel. This was done easily by the initialization function, that was able to discard one sample, but also blocks of eight samples if the coefficient phases were longer than eight coefficients. In dual-stream implementation, this is slightly more complex because if one sample is read from Stream 0 beforehand, the stream combination will be completely disorganized. To avoid changing the stream content in this case we reorganized the computation and started to compute one sample after (change the Start parameter of the mul4/mac4 intrinsic).

On top of this, if the coefficient phase is longer than four, four elements need to be discarded. If it is longer than eight, it needs to be discarded. The first four elements must be read from Stream 0, the next four from Stream 1, and then again from Stream 0. If more blocks of four elements are read from Stream 0 than from Stream 1, the first stream to read within the kernel is Stream 1.