Navigate to the MultiKernel directory. In the Makefile, three methods are defined:
aieCompiles the graph and the kernels
aiesimRuns the AI Engine System C simulator
aievizRuns
vitis_analyzeron the output summary
Take a look at the source code (kernel and graph) to familiarize yourself with the C++ instanciation of kernels. In graph.cpp, the PL AI Engine connections are declared using 64-bit interfaces running at 500 MHz, allowing for maximum bandwidth on the AI Engine array AXI-Stream network.
To have the simulation running, input data must be generated. There are two possibilities:
Just type
make data.Change directory to
dataand typeGenerateStreams. The following parameters should be set for this example:
Click Generate and then Exit. The generated files PhaseIn_0.txt … PhaseIn_3.txt should contain mainly 0’s, with a few 1’s and 2’s.
Type make all and wait for the vitis_analyzer GUI to display. The AMD Vitis™ analyzer is able to show the graph, how it has been implemented in the device, and the complete timeline of the simulation. In this specific case, the graph is simple (a single kernel) and the implementation is on a single AI Engine.
Click Graph to visualize the graph of the application:
The 16 kernels and their eight independent input streams are clearly visible. The top graph is for the output phases 0 and 2, the phases where the cascade stream is from left to right on the physical device. The bottom graph is for phases 1 and 3 where the cascade stream is from right to left.
Click Array to visualize where the kernel has been placed, and how it is fed from the the PL:
In this view, the cascade streams connecting neighboring AI Engines are key to the performance of this graph. With the four location constraints that were added, the placer had only one solution for the kernel placement: this square. The router had an easy job to feed all these kernels by simply using the south-north AXI-Stream. The path back to the PL from the extremities also uses only the vertical AXI-Streams.
Finally, click Trace to look at how the entire simulation went through. This may be useful to track where your AI Engine stalls if performance is not as expected:
Now the output of the filter can be displayed. The input being a set of Dirac impulses, the impulse response of the filter should be recognized throughout the waveform. Navigate to aiesimulator_output/data and look at the output_0.txt. You can see that you have two complex outputs per line which is prepended with a time stamp. ProcessAIEOutput output_*.
The top graph reflects the real part of the output. The bottom graph this is the imaginary part. On both, the filter impulse response is recognizable.
After simulation the simulator displays the raw throughput at the input and output ports:
--------------------------------------------------------------------------------
| Intf Type | Port Name | Type | Throughput(MBps) |
--------------------------------------------------------------------------------
| plio | Phase 0 | IN | 4950.996958 |
| | Phase 1 | IN | 4957.698816 |
| | Phase 2 | IN | 4950.439288 |
| | Phase 3 | IN | 4943.757030 |
| | 64 bits out 0 | OUT | 4691.867125 |
| | 64 bits out 1 | OUT | 4691.867125 |
| | 64 bits out 2 | OUT | 4691.867125 |
| | 64 bits out 3 | OUT | 4691.867125 |
The aggregated ouput port throughput in Msps (cint16) is: 4753.9 Msps.
The performance of this architecture can be measured using the timestamped output. In the same directory (aiesimulator_output/data), type StreamThroughput output_*:
output_0.txt --> 1188.49 Msps
output_1.txt --> 1188.49 Msps
output_2.txt --> 1188.49 Msps
output_3.txt --> 1188.49 Msps
-----------------------
Total Throughput --> 4753.95 Msps
This architecture achieves close to 5 Gsps performance. It is slightly less because of the number of cycles spent for initialization when the kernels are called. This performance increases when the frame length is increased.