The softmax function is defined for a vector of real values \(\mathbf{z} = \left( z_1, z_2, \ldots , z_M \right)\) by the equation
where \(z_i\) are the individual outputs of the layer. Softmax differs from other popular activation functions in that it takes into account the entire layer and scales outputs so they sum to a value of 1. Each individual output can then be interpreted as a probability. So in classification problems, softmax output may be interpreted as probability that the input data belongs to a specified class.
When computing the softmax function, there is a risk of overflow occurring during evaluation of the individual exponential functions that comprise the formula. For bfloat16 floating-point numbers, the exponential function overflows when input values exceed 88.5. To avoid overflow, the softmax function is often evaluated using the equivalent formula
where \(\alpha\) is a real-valued constant. In particular, \(\alpha\) is often chosen to be the maximum of all \(z_i\) values comprising the input vector. By subtracting the maximum value from all others, inputs to the exponential functions are constrained to the range \((-\infty, 0]\), which in turn limits the exponential function values to the range \([0, 1]\).
Another alternative to evaluating the softmax function is to use the equivalent formula
which is attractive because no division is required. However, it has been shown that in practice this formula tends to produce larger computational errors [1].