AI Engine Implementation - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English

One of the key parameters impacting the amount of computation required for evaluating the softmax function is the number of classes. For the example presented here, 2048 classes are used to represent the output nodes of a neural network. Since data is in bfloat16 floating-point format, the floating-point vector unit of the AI Engine, shown in the following figure, is required. The floating-point multiply unit is designed to process vectors with 16 lanes, so softmax computation is designed to accommodate a SIMD factor of 16.

figure5

Figure 5 - AI Engine Floating-Point Vector Unit

From the preceding figure, you can observe that the floating-point vector processor has a pipeline depth of six. To improve compute efficiency, kernel processing should be designed to keep the pipeline full. This is not the case when a computation needs to wait for intermediate results to proceed. To take full advantage of software pipelining, computation is broken up into components, where intermediate results are stored in data memory. Each loop in the kernel processes a specific computation for the entire number of classes in the softmax function, 16 elements at a time. Each invocation of the kernel computes a single softmax vector comprising the values for all outputs according to the following processing order:

  1. Read and store all input values while searching for the maximum value, then subtract the maximum value from all inputs. (two computational loops)

  2. Compute exponential function of all values. (two computational loops)

  3. Sum all exponentials and invert sum to obtain scaling factor. (single computational loop plus scalar processor inverse operation)

  4. Multiply all exponentials by scaling factor and send result to output. (single computational loop)