Performance validation can be challenging, but it is the most important step in verifying that you are meeting the AI Engine bandwidth. Consider a following example of a complex system design with multiple components in AI Engines as well as programmable logic.
Any backpressure between Kernels 1, 2, 3 will result in under performance of Kernel 4, even though Kernel 4 could be designed to operate most optimally. This is an additional challenge, which was not always present in traditional FPGA based designs.
For the entire system to operate efficiently and correctly, verify the functionality and performance of each of the kernels described above using the methods described in this and the preceding sections before they are integrated together, to achieve required system design.
AI Engine performance validation requires the generators to be able to generate data at required throughput and checkers to be able to sink data at required throughput. As each stream interface between the AI Engine and programmable logic can operate at 4 Gb/s, if the performance validation metrics are not met, you can modify the interface to widen the bus and reduce the frequency. For example, a 64-bit/500 MHz between the AI Engines and PL can be modified to 128b/250 MHz interface without jeopardizing the performance of entire system. Such decisions can be made only after performance validation of Kernels 2 and 4 independently. For more information, see the Performance Analysis of AI Engine Graph Application during Simulation in the AI Engine User Guide (UG1076).
The AI Engine tools include APIs to measure throughput of each interface in the AI Engine-PL shim interface without any additional hardware. This is the easiest method to get an idea of the performance numbers between the AI Engine and PL. For more information, see the Performance Analysis of AI Engine Graph Application during Simulation in the AI Engine User Guide (UG1076).
There are two approaches to generating enough bandwidth for the AI Engine:
- Fine tune the test harness generators/checkers such as AXI4 DMA/S2MM-MM2S kernels to source and sink enough bandwidth as needed by the AI Engines.
- Design custom RTL-based kernels that generate LFSR or BRAM/URAM based vectors at the required clock frequency in the fabric.
For high performance designs, the latter method provides consistent bandwidth because there is no data flow between the DDR memory and programmable logic. For AI Engine based designs with faster bandwidth requirements, the latter method is the more efficient way to create test harness. However this does require some additional work.
Additional hardware debug can be performed using the Vitis tools, as described in the Vitis Unified Software Platform Documentation .