The first two processing loops of the kernel are shown below. The iterators are defined for work buffers specified in data memory to hold intermediate results. The first processing loop reads input values for softmax computation and stores them to memory while searching for the maximum input value. Once the maximum value is determined, a second loop reads input values from memory, subtracts the maximum value from each, and stores results back to memory. The iterators are reset after each loop completes to prepare them for the next loop.
// set rounding to match MATLAB generated test vectors
aie::set_rounding(aie::rounding_mode::symmetric_inf);
aie::set_saturation(aie::saturation_mode::saturate);
// work buffers in data memory
auto pWbufA = aie::begin_restrict_vector<16>(wbufA);
auto pWbufB = aie::begin_restrict_vector<16>(wbufB);
// read and store data while searching for maximum value
float max_val = -2 ^ 127;
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vin = v16bfloat16(readincr_v<16>(in));
float vmax = aie::reduce_max(vin);
if (vmax > max_val) {
max_val = vmax;
}
*pWbufA++ = v16int16(vin);
}
pWbufA -= (BUFFSZ/16);
chess_separator();
// subtract maximum value from all input values
aie::accum<accfloat,16> accM;
accM.from_vector(aie::broadcast<float,16>(max_val));
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vecA = v16bfloat16(*pWbufA++);
aie::accum<accfloat,16> accA;
accA.from_vector(vecA);
*pWbufB++ = v16int16(aie::to_vector<bfloat16>(aie::sub(accA, accM)));
}
pWbufA -= (BUFFSZ/16);
pWbufB -= (BUFFSZ/16);
chess_separator();
The next segment of kernel code, as shown below, is comprised of two computational loops used to evaluate the exponential function for all inputs. The first loop scales all inputs by \(\frac{1}{\log(2)}\) and adds the exponent offset. Also built into this computation is scaling by a factor of \(2^7\) to align the result with the exponent field of the bfloat16 format. Constants used in this computation are defined in the kernel header file. The second loop takes these values and extracts the 16-bit integer part used to represent a bfloat16 number. Since the maximum value is subtracted from each input, the exponential function values should all be in the range \([0, 1]\). Some instructions are added to check for values outside this range. If any such values are detected, they are assumed to be the result of overflow and set to zero.
/****** Start of computation of exponential functions of all input values ******/
// convert results to IEEE 754 format - use 2 loops
aie::accum<accfloat,16> b_acc;
b_acc.from_vector(aie::broadcast<float,16>(exp_B));
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vecB = v16bfloat16(*pWbufB++);
aie::accum<accfloat,16> aout = aie::mac(b_acc, vecB, exp_S);
*pWbufA++ = v16int16(aie::to_vector<bfloat16>(aout));
}
pWbufA -= (BUFFSZ/16);
pWbufB -= (BUFFSZ/16);
chess_separator();
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vecA = v16bfloat16(*pWbufA++);
aie::vector<int16,16> exp_i = aie::to_fixed<int16>(vecA,0);
// integer values should be in the range [0, 16,256], find outliers and set to zero
aie::mask<16> msk_neg = aie::lt(exp_i,int16(0));
aie::vector<int16,16> exp_bnd = aie::select(exp_i, aie::zeros<int16,16>(), msk_neg);
aie::mask<16> msk_pos = aie::gt(exp_bnd, int16(16256));
exp_bnd = aie::select(exp_bnd, aie::zeros<int16,16>(), msk_pos);
*pWbufB++ = exp_bnd;
}
pWbufA -= (BUFFSZ/16);
pWbufB -= (BUFFSZ/16);
/****** End of computation of exponential functions of all input values ******/
chess_separator();
With the exponential function of all inputs computed, the softmax function is evaluated by the kernel code shown below. The first loop sums exponential values in individual vector lanes. Next, individual vector lanes are summed, and the scalar processor is invoked to compute a scale factor, which is the inverse of the sum. The final loop multiples all the exponential values by the scale factor and sends the result to output.
// accumulate all vectors to determine scale factor
aie::accum<accfloat,16> accsum;
accsum.from_vector(aie::zeros<bfloat16,16>());
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vecB = v16bfloat16(*pWbufB++);
accsum = aie::add(accsum, vecB);
}
pWbufB -= (BUFFSZ/16);
chess_separator();
// compute inverse
bfloat16 scl_fctr = aie::inv(aie::reduce_add(aie::to_vector<bfloat16>(accsum)));
// scale values and write to output
for (unsigned i=0; i < BUFFSZ/16; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFFSZ/16)
{
aie::vector<bfloat16,16> vecB = v16bfloat16(*pWbufB++);
aie::vector<int16,16> vout = v16int16(aie::to_vector<bfloat16>(aie::mul(vecB, scl_fctr)));
writeincr(out, vout);
}