Designing the Graph - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English

The kernels created in the previous section can be reused here as the only difference is the way that thay are connected together. In the preceding images, you can see that there are 16 associations (Data Phase, Coefficient Phase). Also, it can clearly be seen that some of the data streams have the data discarded before the computation starts:

  • Output phase 0: No input data phase will have discarded samples.

  • Output phase 1: Input data phase 0 will have 1 discarded sample.

  • Output phase 2: Input data phase 0 and 1 will have 1 discarded sample.

  • Output phase 3: Input data phase 0, 1 and 2 will have 1 discarded sample.

To minimize the data routing, all blocks using the same data stream should be placed in the same column. This leads to the following architecture:

FourPhasesSingleStream

In the AI Engine array, the direction of the cascade stream is flipped from one row to the next.

CascadeDir2

This feature needs to be taken into account when placing the kernels to get the cascade connections correct in the graph:

FourPhasesSingleStreamPlaced

The kernels highlighted in the following figure need to discard one sample within the initialization function:

FourPhasesSingleStreamDiscard

At this point, consider latencies within the kernels. In the operation scheduling, data is first read from the stream, a mul4 and three mac4 are performed, and the accumulator is then sent to the cascade stream. Overall, the latency from ‘read’ to ‘write’ is approximately 20-25 clock cycles (call it L, L~25). In the left hand side column, this means that the data input from the first row to the second row should enter a FIFO of length ~75 (3L). The Input to row two is approximately the same as row zero, and row three should be fed at the same time as row 1. The following table shows the latencies in multiple of L:

Column 0

Column 1

Column 2

Column 3

Row 3

3L

2L

L

0

Row 2

0

L

2L

3L

Row 1

3L

2L

L

0

Row 0

0

L

2L

3L

Depending on the row, the latencies are completely different.

A first possibility is to have these FIFOs implemented in the PL, and have two streams coming from the PL for each column, one serving the even rows and the other serving the odd rows. A single FIFO is required on the first and last columns, but two are necessary for the inner columns.

FourPhasesDualStreams

Another possibility is to have them inside the AI Engine array. A latency of less than 32 colock cycle usually gets implemented into the FIFOs included in the AXI-Stream interconnect. Above that number, it gets implemented in a memory module as a DMA FIFO. Either you can share one DMA FIFO for the odd rows and another one for the even rows, or you dedicate one FIFO for each AI Engine. The latter choice that has been done here, and they are constrained to be placed right beside the kernel.