Design Approach - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English
  • The bfloat16 data type is chosen for both layer I/O data and for weights and biases. This simplifies the quantization of trained network parameters. No special tools nor quantization strategies are required.

  • No specific throughput target is chosen.

  • The design partitions each network layer to its own AIE-ML v2 tile where feasible. This simplifies system partitioning and allows you to build a well-defined scope for each kernel.

  • Memory tile pre/post zero-padding capability is leveraged for 1D convolutional layers to expand input tensor shapes to satisfy model requirements that use padding="same". The model uses kernel_size=7 which requires the input samples dimension to be pre-padded and post-padded with three zeros.

  • Memory tile multi-dimensional addressing capabilities are leveraged to efficiently transfer I/O data for compute consumption with minimal core cycles being required for data shuffling or lane adjustments within the core.

  • Compute workloads for 1D convolutional layers leverage the efficient mac_4x8_8x8() intrinsic for bfloat16 data types to achieve a maximum efficiency of 256 MAC operations per cycle when feasible by a particular layer.

  • Compute workloads leverage the less efficient mac_elem_64() intrinsic for bfloat16 data types with a maximum efficiency of 64 MAC operations per cycle in cases where mac_4x8_8x8() is not feasible (for example in the conv1d_w1() layer which only receives data from two input nodes).

  • Weights and biases are sent from the host at run-time as async RTPs and are stored in local tile memory. Larger ML networks with millions or billions of weights require streaming solutions based on memory tiles or DDR; such a complex solution is excessive for the small Radio-ML Modulation Classifier problem considered here where all weights may be stored easily within the array.

  • Perfect functional bit-match against the Python model is not achieved. The main contributors to this are the dense layers; more details are discussed in the corresponding sections. A closer match can be achieved by building Python models aligning with the implementation then training those to extract updated weights/biases.