One of the key parameters impacting the amount of computation required for evaluating the softmax function is the number of classes. For the example presented here, 2048 classes are used to represent the output nodes of a neural network. Since data is in single-precision floating-point format, the floating-point vector unit of the AI Engine, shown in the following figure, is required. The floating-point multiply unit is designed to process vectors with eight lanes, so softmax computation is designed to accommodate a SIMD factor of eight.
Figure 5 - AI Engine Floating-Point Vector Unit
From the preceding figure, you can observe that the floating-point vector processor has a pipeline depth of seven. To improve compute efficiency, kernel processing should be designed to keep the pipeline full. This is not the case when a computation needs to wait for intermediate results to proceed. To take full advantage of software pipelining, computation is broken up into components, where intermediate results are stored in data memory. Each loop in the kernel processes a specific computation for the entire number of classes in the softmax function, eight elements at a time. Each invocation of the kernel computes a single softmax vector comprising the values for all outputs according to the following processing order:
Read and store all input values while searching for the maximum value. (single loop)
Compute exponential function of all values. (10 computational loops, includes subtraction of maximum from input and correction polynomial evaluation)
Sum all exponentials and invert sum to obtain scaling factor. (single loop plus scalar processor inverse operation)
Multiply all exponentials by scaling factor and send result to output. (single loop)