supports linear
approximation by parallel fetching data from aie::lut
and estimates output with the compute
The slope and offset values are stored in memory
way:constexpr unsigned size = 8;
const int16 lut_ab[size*2*2] = {
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3, //note 128b duplication
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7
const int16 lut_cd[size*2*2] = {
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope0, offset0, slope1, offset1, slope2, offset2, slope3, offset3,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7,
slope4, offset4, slope5, offset5, slope6, offset6, slope7, offset7
aie::lut<4, int16, int16> lookup_table(size, lut_ab, lut_cd);
has following
parameters besides the data aie::lut
: Can be zero or larger depending on data types. If it is larger than zero, the lowerstep_bits
bits of input are used with slope to do estimation. The higher part is used withbias
to do indexing. -
: It's added to the higher part of input to do indexing. -
: Optional scaling factor applied to the offset.
index= input>>step_bits + bias
Slope/offset pair read from LUT based on index
output = slope * input[step_bits-1:0] + (offset << shift_offset)
The steps for a floating point based linear approximation are:
index = (int(floor(input)) >> step_bits) + bias
slope/offset pair read from LUT based on index
output = slope * input + offset
:Figure 1. Linear Approximation with
An example kernel code:
const int size=1024;
int16 lnr_lutab[size*2*2]={
#include "data/LUT_SLOPE.h"
int16 lnr_lutcd[size*2*2]={
#include "data/LUT_SLOPE.h"
__attribute__((noinline)) void linear_approx(input_buffer<int16>& __restrict index, output_buffer<int16>& __restrict out){
const aie::lut<4, int16> my_lut(size,lnr_lutab,lnr_lutcd);
//calling linear_approx with my_lut, step_bits=3, bias=0, shift_offset=0
aie::linear_approx<int16, aie::lut<4, int16, int16>> linear_ap(my_lut, 3, 0, 0);
auto it=aie::begin_vector<16>(index);
auto ot=aie::begin_vector<16>(out);
for(int i=0;i<size/16;i++){
aie::vector<int16,16> vin=*it++;
*ot++ = linear_ap.compute(vin).to_vector<int16>(0);
For the data types supported, and step_bits
requirements, see aie::linear_approx in the
Engine API User Guide (UG1529).
To achieve full parallelism, the LUTs must be placed in different banks by constraining the LUTs in the graph. For details, see Global Graph-Scoped Tables.