AI Engine API User Guide (AIE) 2023.2
|
Overview
- Note
- Lookup table functionality is only available from AIE-ML
Two abstractions are provided to represent lookup tables on AIE architectures:
- aie::parallel_lookup which provides a direct lookup
- aie::linear_approx which provides a linear approximation for non-linear functions
The primary purpose of these abstractions is to leverage hardware support for parallel accesses on certain AIE architectures.
Both of these abstractions are built upon the aie::lut type that is used to encapsulate the raw LUT data. This encapsulation is implemented in an attempt to ensure correct data layout for a given lookup type. Specifically, to achieve a given level of access parallelism, the LUT values are required to have a specific layout in memory, which is dependent on the required number of parallel loads. For details on the memory layout requirements, see the aie::lut documentation.
Example implementations of parallel lookup and linear approximation functions are given below:
Classes | |
struct | aie::linear_approx< T, MyLUT > |
struct | aie::lut< ParallelAccesses, OffsetType, SlopeType > |
Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements. More... | |
struct | aie::parallel_lookup< T, MyLUT, oor_policy > |
Class Documentation
◆ aie::linear_approx
struct aie::linear_approx |
requires (arch::is(arch::AIE_ML))
struct aie::linear_approx< T, MyLUT >
- Note
- Linear approximation functionality is only available from AIE-ML
Type to support a linear approximation via interpolation with slope/offset values stored in a lookup table.
The offset values are simply the samples of the function to be approximated. The slope values, which are the slopes of the function at the corresponding sample, are used in conjunction with the input to more accurately estimate the function value between sample points.
The logical steps of the computation for an integer based linear approximation are:
- index = (input >> step_bits) + bias
- slope/offset pair read from LUT based on index
- output = slope * (input & ((1 << step_bits) - 1)) + (offset << shift_offset)
while the steps for a floating point based approximation are:
- index = (int(floor(input)) >> step_bits) + bias
- slope/offset pair read from LUT based on index
- output = slope * input + offset
Note that for integer based linear approximations, the slope is multiplied by an integer value in the range [0, 1 << step_bits) and therefore tweaking of the LUT values or linear_approx parameters may be required to ensure that offset[i] + slope[i] * ((1 << step_bits) - 1) approximately equals offset[i+1].
The slope and offset values are expected to be placed adjacent in memory. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. The following example shows the memory layout of a 128b bank width lookup table with 16b values and slopes, which achieves 4 lookups per cycle:
Input | Offset | Slope | Accumulator type | Lanes | Minumum step_bits required |
---|---|---|---|---|---|
int8 | int8 | int8 | acc32 | 32 | 2 |
int16 | int16 | int16 | acc64 | 16 | 3 |
int16 | int32 | int32 | acc64 | 16 | 4 |
bfloat16 | float | bfloat16 | accfloat | 16 | 0 |
Note that while the floating point linear approx requires the offset data to be 32b floats, the slope data is required to be bfloat16. However, it is required that all values in the LUT be 32b to ensure the LUT is correctly aligned. While it is safe to use floats as the storage type for the lookup table, it is required that the low 16 mantissa bits of the floating point slope value be zero.
- Template Parameters
-
T Type of the input vector, containing values used to index the lookup table. MyLUT Definition of the LUT type, using the lut type.
Public Member Functions | |
linear_approx (const MyLUT &l, unsigned step_bits, int bias=0, int shift_offset=0) | |
Constructor, configures aspects of how the approximation is performed. | |
template<Vector Vec> | |
auto | compute (const Vec &input) |
Performs a linear approximation for the input values with the configured lookup table. | |
Constructor & Destructor Documentation
◆ linear_approx()
|
inline |
Constructor, configures aspects of how the approximation is performed.
- Parameters
-
l LUT containing the stored slope/offset value pairs used for the linear approximation. Each value in the LUT has the slope in the LSB, the offset in the MSB. step_bits Lower bits that won't be used from the input to index the LUT. For integer input, these will be the remainder multiplied by the slope value at each point. For float values, the input values are used directly in the multiplication bias Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. shift_offset Optional scaling factor applied to the offset before adding it (to avoid loss of precision).
Member Function Documentation
◆ compute()
|
inline |
Performs a linear approximation for the input values with the configured lookup table.
An accumulator of the same number of elements as the input is returned.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
- Parameters
-
input Vector of input values that are used to index the look-up table.
◆ aie::lut
struct aie::lut |
requires (arch::is(arch::AIE_ML))
struct aie::lut< ParallelAccesses, OffsetType, SlopeType >
Abstraction to represent a LUT that is stored in memory, instantiated with pointer(s) to the already appropriately populated memory and the number of elements.
The requirement on memory layout is that for degree N parallel accesses, N copies of the LUT data are required; i.e.
- For a single load without parallelism, the values required to be stored linearly in memory.
- For 2 loads in parallel, the LUT needs to have 2 copies of the LUT values with repetition every bank width. For example with 32b values and a 128b bank width, in memory we would have the first 4 values (128b), then the same 4 again, then the next 4, which then repeat, etc.
- For 4 loads in parallel, we require the same layout as for 2 loads, but two distinct copies in this layout, placed in different memory banks.
Currently the only supported implementation on this architecture is for 4 parallel accesses.
- Template Parameters
-
ParallelAccesses Defines how many parallel accesses will be done in a single LUT access, possibilities depend on the hardware available for the given architecture OffsetType Type of values stored within the lookup table. SlopeType Optional template parameter, only needed in certain cases of linear approximation where the offset/slope value pair uses two different types.
Public Types | |
using | lut_impl = detail::lut< ParallelAccesses, OffsetType, SlopeType > |
using | offset_type = OffsetType |
using | slope_type = SlopeType |
Public Member Functions | |
lut (unsigned LUT_elems, const void *LUT_a) | |
Constructor for singular access. | |
lut (unsigned LUT_elems, const void *LUT_ab) | |
Constructor for two parallel accesses. | |
lut (unsigned LUT_elems, const void *LUT_ab, const void *LUT_cd) | |
Constructor for 4 parallel accesses. | |
Member Typedef Documentation
◆ lut_impl
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::lut_impl = detail::lut<ParallelAccesses, OffsetType, SlopeType> |
◆ offset_type
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::offset_type = OffsetType |
◆ slope_type
using aie::lut< ParallelAccesses, OffsetType, SlopeType >::slope_type = SlopeType |
Constructor & Destructor Documentation
◆ lut() [1/3]
|
inline |
Constructor for 4 parallel accesses.
Each pointer points to an equivalent LUT populated within which the values are repeated twice, interleaved at a bank width granularity. In total the same values need to be present 4 times in memory to allow for the 4 parallel accesses.
For example, with a 128b bank width:
- Parameters
-
LUT_elems Number elements in the LUT (not accounting for repetition). LUT_ab First two copies of the data, with the values repeated and interleaved at bank width granularity. LUT_cd Next two copies of the data, with the values repeated and interleaved at bank width granularity.
◆ lut() [2/3]
|
inline |
Constructor for two parallel accesses.
For example, with a 128b bank width:
- Parameters
-
LUT_elems Number of elements in the LUT (not accounting for repetition). LUT_ab Two copies of the data, with the values interleaved at bank width granularity.
◆ lut() [3/3]
|
inline |
Constructor for singular access.
For example,
- Parameters
-
LUT_elems Number of elements in the LUT. LUT_a Pointer to the LUT values.
◆ aie::parallel_lookup
struct aie::parallel_lookup |
requires (arch::is(arch::AIE_ML))
struct aie::parallel_lookup< T, MyLUT, oor_policy >
- Note
- Parallel lookup functionality is only available from AIE-ML
Type with functionality to directly index a LUT based on input vector of values. The number of achieved lookups per cycle is determined by the aie::lut object that encapsulates the contents of the lookup table. Refer to aie::lut for more details.
Real signed and unsigned integer types (>=8b) are supported as indices. All types (>=8b) are supported as value types, including bfloat16, real, and complex types.
- Note
- 8b value type lookups require the data to be stored in the lookup tables as 16b values due to the granularity of the memory accesses.
- Template Parameters
-
T Type of the input vector, containing values used to index the lookup table. MyLUT Definition of the LUT type, using the lut type oor_policy Defines the "out of range policy" for when index values on the input go beyond the size of the LUT. It can either saturate, taking on the min/max valid index, or truncate, retaining the lower bits for unsigned indicies or wrapping in the interval [-bias,lut_size-bias) for signed indices. Saturating is the default behaviour, but for certain non-linear functions which repeat after an interval truncation may be required.
Public Member Functions | |
template<typename U = T> requires (std::is_unsigned_v<T>) | |
parallel_lookup (const MyLUT &l, unsigned step_bits=0) | |
Constructor for unsigned input types, configures aspects of how the lookup is performed. | |
template<typename U = T> requires (std::is_signed_v<T>) | |
parallel_lookup (const MyLUT &l, unsigned step_bits=0, unsigned bias=0) | |
Constructor for signed input types, configures aspects of how the lookup is performed. | |
template<Vector Vec, unsigned N = Vec::size()> | |
vector< typename MyLUT::offset_type, N > | fetch (const Vec &input) |
Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector. | |
template<unsigned N, Vector Vec> | |
vector< typename MyLUT::offset_type, N > | fetch (const Vec &input) |
Accesses the lookup table based on the provided input values. | |
Constructor & Destructor Documentation
◆ parallel_lookup() [1/2]
requires (std::is_signed_v<T>)
|
inline |
Constructor for signed input types, configures aspects of how the lookup is performed.
Note that usage of step_bits requires either:
- The rounding mode is set to the default
aie::rounding_mode::floor
- The lowest step_bits of the index are zero
- Parameters
-
l LUT containing the stored values used for the linear approximation. step_bits Optional lower bits that will be ignored for indexing the LUT. bias Optional offset added to the input values used to index, for example to center on 0 by adding half the number of LUT elements. This value, if supplied, must be a power of 2.
◆ parallel_lookup() [2/2]
requires (std::is_unsigned_v<T>)
|
inline |
Constructor for unsigned input types, configures aspects of how the lookup is performed.
Note that usage of step_bits requires either:
- The rounding mode is set to the default
aie::rounding_mode::floor
- The lowest step_bits of the index are zero
- Parameters
-
l LUT containing the stored values used for the linear approximation. step_bits Optional lower bits that will be ignored for indexing the LUT.
Member Function Documentation
◆ fetch() [1/2]
|
inline |
Accesses the lookup table based on the provided input values, will return a vector of the same number of elements as the input vector.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor
.
- Parameters
-
input Vector of input values that are used to index the look-up table.
◆ fetch() [2/2]
|
inline |
Accesses the lookup table based on the provided input values.
This overload allows the size of the returned vector to be specified as a template parameter. This may be required when mapping small index types to large value types as a direct mapping may not be valid. For example, mapping int8
to cint32
on a given architecture may require input
to be 16 elements. fetch(input)
would therefore deduce a return type of aie::vector<cint32, 16>
, which may be unsupported. However, returning aie::vector<cint32, 8>
by calling fetch<8>(input)
may be valid.
Input values are interpreted from MSB to LSB: headroom | LUT elements | step_bits
Note the step_bits are required to be zeroed if the rounding mode is set to anything other than aie::rounding_mode::floor
.
- Template Parameters
-
N The number of elements to lookup, which may be less than the input vector size
- Parameters
-
input Vector of input values that are used to index the look-up table.