Design Analysis and Programming using Intrinsics - 2025.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2025-11-26
Version
2025.2 English
CAUTION:
AMD strongly recommends that you use AI Engine APIs for your designs. Use intrinsics only when the design’s strict performance requirements exceed what the AI Engine API supports. For example, the AI Engine API does not currently support functionality provided by some intrinsics such as, fft_data_incr and cyclic_add. AI Engine APIs support and abstract the main permute use cases, but they do not cover not all permute capabilities. Using intrinsics can allow you to close the performance gap required by your design.

AI Engines provide high compute density through large numbers of VLIW and SIMD compute units using connections through innovative memory and AXI4-Stream networks. When targeting an application on AI Engine, it is important to evaluate the compute needs of the AI Engine and data throughput requirements. For example, how the AI Engine interacts with PL kernels and external DDR memory. After meeting the compute and data throughput requirements for AI Engine, the next step involves divide and conquer methods. These map the algorithm into the AI Engine array.

For the divide and conquer step, you must understand vector processor architecture, memory structure, AXI4-Stream, and cascade stream interfaces. This step is often repeated. During each iteration, you optimize individual AI Engine kernels and construct and refine the graph. AI Engine tools simulate and debug AI Engine kernels and the graph. The graph is then integrated with PL kernels, GMIO, and PS to perform system level verification and performance tuning.

This chapter introduces the divide and conquer method for mapping the algorithm into data flow diagrams (DFD). Single and multiple kernel programming examples illustrate kernel partitioning by the compute and memory bound, single kernel vectorization and optimization, and streaming balancing between different kernels.

Note: The references to input_window and output_window in this appendix are being deprecated. Use input_buffer and output_buffer instead.