Design Approach - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

This tutorial adopts a simple design approach to building the MNIST ConvNet classifier in AIE-ML geared towards establishing workable data flows and identifying efficient AIE API coding strategies for common convolutional and pooling layers. The focus is not on throughput performance nor resource utilization but instead to identify a suite of design techniques that may be applied to more advanced designs. The design approach adopts the following concepts:

  • The bfloat16 data type is chosen for both layer I/O data and for weights & biases. This simplifies the quantization of trained network parameters. No special tools nor quantization strategies are required.

  • No specific throughput target is chosen. The design is a toy example and so its performance is not of practical interest.

  • The design generally partitions each network layer to its own AIE-ML tile where feasible. This simplifies system partitioning and allows a well-defined scope for each AIE-ML kernel to be built.

  • Memory tile zero-padding capability is leveraged to expand input tensor shapes from (28,28,1) to (28,32,1) to satisfy AI Engine memory alignment and PLIO bit width requirements.

  • Memory tile multi-dimensional addressing capabilities are leveraged to efficiently transfer I/O data for compute consumption with minimal core cycles being required for data shuffling or lane adjustments within the core.

  • Compute workloads for 2D convolutional layers leverage the efficient mac_4x8_8x4() intrinsic for bfloat16 data types to achieve a maximum efficiency of 128 MAC operations per cycle when feasible by a particular layer.

  • Compute workloads leverage the less efficient mac_elem_16_2() intrinsic for bfloat16 data types with a maximum efficiency of 64 MAC operations per cycle in cases where mac_4x8_8x4() is not feasible (for example in the conv2d_w1() layer which only receives data from a single input channel).

  • The design combines the flatten_w6() and dense_w7() layers to the same AIE-ML tile.

  • Weights & biases are stored in local tile memory instead of in memory tiles because the former admit a means to establish a read-only access scheme using asynchronous buffers that permits the weights to be read only once at startup. Larger ML networks with millions of weights require streaming solutions based on memory tiles; such a complex solution is excessive for the small MNIST problem considered here where all weights may be stored easily within the array. Extending the programming model to support read-only operation in memory tiles is under development.