There are several techniques to profile and improve the performance of AI Engine graphs and kernels.
You can use the Xilinx Runtime (XRT) APIs to measure performance metrics like platform I/O port bandwidth, graph throughput, and graph latency. Use these APIs in the host application code with the AI Engine graph object. This object is used to initialize, run, update and exit graphs. In addition, you can use these APIs to profile graph objects to measure bandwidth, throughput, and latency. For more information, see this link in the AI Engine Tools and Flows User Guide (UG1076).
AI Engine performance analysis typically involves system performance issues such as missing or mismatching locks, buffer overruns, and incorrect programming of direct memory access (DMA) buffers. It also includes memory/core stalls, deadlocks, and hot spot analysis. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution. This data can then be analyzed for functional issues and latency problems between kernels, memory stalls, deadlocks, etc. For more information, see the following:
- AI Engine Performance and Deadlock Analysis Tutorial available from the GitHub repository
- See this link and this link in the AI Engine Tools and Flows User Guide (UG1076)
AI Engine APIs versus Intrinsics
AI Engine API is a portable programming
interface for AI Engine accelerators. It is
implemented as a C++ header-only library that provides types and operations that get
translated into efficient low-level intrinsics. AMD strongly recommends using AI Engine APIs for your designs. Usage of intrinsics must only be
considered for situations where the stringent performance needs of the design
require capabilities that are not covered by the AI Engine API. For example, the AI Engine API does not currently support functionality provided by
some intrinsics such as, fft_data_incr
and cyclic_add
. While AI Engine APIs support and abstract the main permute use cases, not
all permute capabilities are covered. Using intrinsics might allow you to close the
performance gap required by your design.
For more information on the usage of AI Engine APIs and intrinsics, see AI Engine Kernel and Graph Programming Guide (UG1079) and AI Engine-ML Kernel and Graph Programming Guide (UG1603).