Compilation and Analysis - Compilation and Analysis - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

Navigate to the MultiKernel directory. The Makefile defines three methods:

  • aie

    • Compiles the graph and the kernels

  • aiesim

    • Runs the AI Engine SystemC simulator

  • aieviz

    • Runs vitis_analyzeron the output summary

Look at the source code (kernel and graph) to familiarize yourself with the C++ instantiation of kernels. In graph.cpp, the code declares PL AI Engine connections using 64-bit interfaces running at 500 MHz, allowing for maximum bandwidth on the AI Engine array AXI-Stream network.

To run the simulation, generate input data. There are two possibilities:

  1. Type make data.

  2. Change directory to data and type GenerateStreams. Set the following parameters for this example:

generateSingleStreamSSR4

Click Generate and then Exit. The generated files, PhaseIn_0.txt to PhaseIn_3.txt, must contain mainly 0s, with a few 1s and 2s.

Type make all and wait for the vitis_analyzer GUI to display. The AMD Vitis™ Analyzer shows the graph, its device implementation, and the complete simulation timeline. In this specific case, the graph is simple (a single kernel) and the implementation is on a single AI Engine.

Click Graph to visualize the graph of the application:

Graph4Phases

The 16 kernels and their eight independent input streams are clearly visible. The top graph is for the output phases 0 and 2, the phases where the cascade stream is from left to right on the physical device. The bottom graph is for phases 1 and 3 where the cascade stream is from right to left.

Click Array to visualize where the placer positioned the kernel, and how the PL feeds it:

Array4Phases

In this view, the cascade streams connecting neighboring AI Engines are key to the performance of this graph. With the four location constraints added, the placer had only one solution for the kernel placement: this square. The router had an easy job to feed all these kernels by simply using the south-north AXI-Stream. The path back to the PL from the extremities also uses only the vertical AXI-Streams.

Finally, click Trace to look at how the entire simulation went through. This may be useful to track where your AI Engine stalls if performance is not as expected:

Timeline4Phases

Now you can display the filter output. Because the input is a set of Dirac impulses, you must recognize the impulse response of the filter throughout the waveform. Navigate to aiesimulator_output/data and look at the output_0.txt. You can see that you have two complex outputs per line, prepended with a time stamp. ProcessAIEOutput output_*.

GraphOutput4Phases

The top graph reflects the real part of the output. The bottom graph this is the imaginary part. On both, the filter impulse response is recognizable.

After simulation the simulator displays the raw throughput at the input and output ports:

--------------------------------------------------------------------------------
| Intf Type   | Port Name                          | Type  | Throughput(MBps)  |
--------------------------------------------------------------------------------
| plio        | Phase 0                            | IN    | 4950.996958       |
|             | Phase 1                            | IN    | 4957.698816       |
|             | Phase 2                            | IN    | 4950.439288       |
|             | Phase 3                            | IN    | 4943.757030       |
|             | 64 bits out 0                      | OUT   | 4691.867125       |
|             | 64 bits out 1                      | OUT   | 4691.867125       |
|             | 64 bits out 2                      | OUT   | 4691.867125       |
|             | 64 bits out 3                      | OUT   | 4691.867125       |

The aggregated output port throughput in MSPS (cint16) is: 4753.9 Msps.

You can measure the performance of this architecture using the timestamped output. In the same directory (aiesimulator_output/data), type StreamThroughput output_*:

output_0.txt -->  1188.49 Msps
output_1.txt -->  1188.49 Msps
output_2.txt -->  1188.49 Msps
output_3.txt -->  1188.49 Msps

-----------------------


Total Throughput -->    4753.95 Msps

This architecture achieves close to 5 GSPS performance. The system spends cycles for initialization when calling the kernels, making it slightly less. This performance increases when you increase the frame length.