Design Approach - Design Approach - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English
  • This design chooses the bfloat16 data type for both layer I/O data and for weights and biases. This simplifies the quantization of trained network parameters. Using bfloat16 requires no special tools or quantization strategies.

  • This design does not set a specific throughput target.

  • The design partitions each network layer to its own AIE-ML v2 tile where feasible. This simplifies system partitioning and enables you to build a well-defined scope for each kernel.

  • Memory tile pre/post zero-padding capability is leveraged for 1D convolutional layers to expand input tensor shapes to satisfy model requirements that use padding="same". The model uses kernel_size=7 which requires the input samples dimension to be pre-padded and post-padded with three zeros.

  • Memory tile multi-dimensional addressing capabilities are leveraged to efficiently transfer I/O data for compute consumption with minimal core cycles being required for data shuffling or lane adjustments within the core.

  • Compute workloads for 1D convolutional layers leverage the efficient mac_4x8_8x8() intrinsic for bfloat16 data types to achieve a maximum efficiency of 256 MAC operations per cycle when feasible by a particular layer.

  • Compute workloads leverage the less efficient mac_elem_64() intrinsic for bfloat16 data types with a maximum efficiency of 64 MAC operations per cycle in cases where mac_4x8_8x8() is not feasible (for example in the conv1d_w1() layer which only receives data from two input nodes).

  • The host sends weights and biases at run-time as async RTPs and stores them in local tile memory. Larger ML networks with millions or billions of weights require streaming solutions based on memory tiles or DDR; such a complex solution is excessive for the small Radio-ML Modulation Classifier problem considered here, where all weights fit easily within the array.

  • The design does not achieve perfect functional bit-match against the Python model. The main contributors to this are the dense layers; the corresponding sections discuss more details. Achieve a closer match by building Python models aligning with the implementation, then training those models to extract updated weights/biases.