Design Analysis and Programming using Intrinsics - 2024.1 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2024-06-05
Version
2024.1 English
CAUTION:
It is strongly recommended that you use AI Engine APIs for your designs. Usage of intrinsics must only be considered for situations where the stringent performance needs of the design require capabilities that are not covered by the AI Engine API. For example, the AI Engine API does not currently support functionality provided by some intrinsics such as, fft_data_incr and cyclic_add. While AI Engine APIs support and abstract the main permute use cases, not all permute capabilities are covered. Using intrinsics may allow you to close the performance gap required by your design.

AI Engines provide high compute density through large amount of VLIW and SIMD compute units by connecting with each other through innovative memory and AXI4-Stream networks. When targeting an application on AI Engine, it is important to evaluate the compute needs of the AI Engine and data throughput requirements. For example, how the AI Engine interacts with PL kernels and external DDR memory. After the compute and data throughput requirements can be met for AI Engine, the next step involves divide and conquer methods to map the algorithm into the AI Engine array. In the divide and conquer step, it is necessary to understand vector processor architecture, memory structure, AXI4-Stream, and cascade stream interfaces. This step is usually iterated multiple times. At the same time, each single AI Engine kernel is optimized and the graph is constructed and optimized iteratively. AI Engine tools are used to simulate and debug AI Engine kernels and the graph. The graph is then integrated with PL kernels, GMIO, and PS to perform system level verification and performance tuning.

In this chapter, the divide and conquer method to map the algorithm into data flow diagrams (DFD) is briefly introduced. Single kernel programming and multiple kernels programming examples are provided to illustrate how to do kernel partitioning by the compute and memory bound, single kernel vectorization and optimization, and streaming balancing between different kernels.

Note: The references to input_window and output_window in this appendix are being deprecated. It is recommended that you use input_buffer and output_buffer instead.