8.1.2. Data Types - 5.2 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2025-12-29
Version
5.2 English

The library supports multiple data types and mixed-precision flows:

Standard Precision Types:

  • float (f32): 32-bit full precision floating-point

  • bfloat16 (bf16): Brain floating-point format for reduced memory usage

  • float16 (f16): 16-bit floating-point format

  • int8 (s8): Signed 8-bit integer format

  • uint8 (u8): Unsigned 8-bit integer format

Quantized Integer Types:

  • int4 (s4): Signed 4-bit integer format for highly quantized models

  • uint4 (u4): Unsigned 4-bit integer format for highly quantized models

  • int16 (s16): Signed 16-bit integer format

  • uint16 (u16): Unsigned 16-bit integer format

  • int32 (s32): Signed 32-bit integer format

  • uint32 (u32): Unsigned 32-bit integer format

Specialized Mixed-Precision Flows:

  • bf16s4f32of32: BFloat16 with int4 (s4) mixed precision with float32 output

  • u8s8s32os32: uint8 and int8 mixed integer with int32 accumulation

  • Various quantized combinations: Supporting modern quantized neural network requirements

For detailed information about data types, see the Types API Reference and the Library Overview Wiki.