Designing the Graph - Designing the Graph - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

This section reuses the kernels created in the previous section, as the only difference is how they connect together. In the preceding images, you can see 16 associations (Data Phase, Coefficient Phase). You can also clearly see that some data streams discard data before computation starts:

  • Output phase 0: No input data phase has discarded samples.

  • Output phase 1: Input data phase 0 has one discarded sample.

  • Output phase 2: Input data phase 0 and 1 have one discarded sample.

  • Output phase 3: Input data phase 0, 1, and 2 have one discarded sample.

To minimize the data routing, place all blocks using the same data stream in the same column. This leads to the following architecture:

FourPhasesSingleStream

In the AI Engine array, the cascade stream direction flips from one row to the next.

CascadeDir2

Take this feature into account when placing the kernels to get the cascade connections correct in the graph:

FourPhasesSingleStreamPlaced

The kernels highlighted in the following figure must discard one sample within the initialization function:

FourPhasesSingleStreamDiscard

At this point, consider latencies within the kernels. The operation scheduling performs the following sequence:

  1. Read data from the stream

  2. Performs a mul4 and three mac4

  3. Sends the accumulator to the cascade stream

Overall, the latency from ‘read’ to ‘write’ spans approximately 20-25 clock cycles (call it L, L~25). In the left-hand column, the data input from row one to row two needs a FIFO of length ~75 (3L). The input to row two is approximately the same as row zero. The system feeds row three simultaneously with row one. The following table shows the latencies as multiples of L:

| Column 0 | Column 1 | Column 2 | Column 3 | | —: | :—: | :—: | :—: | :—: | | Row 3 | 3L | 2L | L | 0 | | Row 2 | 0 | L | 2L | 3L | | Row 1 | 3L | 2L | L | 0 | | Row 0 | 0 | L | 2L | 3L |

Depending on the row, the latencies differ completely.

One possibility is to implement these FIFOs in the PL, with two streams coming from the PL for each column: one serving the even rows and the other serving the odd rows. The first and last columns require a single FIFO, but the inner columns need two.

FourPhasesDualStreams

Another possibility places them inside the AI Engine array. The AXI-Stream interconnect implements latencies under 32 clock cycles into the included FIFOs. Beyond that threshold, a memory module implements it as a DMA FIFO. You can either share one DMA FIFO for the odd rows and another for the even rows, or dedicate one FIFO for each AI Engine. This design uses the latter choice and constrains their placement right beside the kernel.