The library supports multiple data types and mixed-precision flows:
Standard Precision Types:
float (f32): 32-bit full precision floating-point
bfloat16 (bf16): Brain floating-point format for reduced memory usage
float16 (f16): 16-bit floating-point format
int8 (s8): Signed 8-bit integer format
uint8 (u8): Unsigned 8-bit integer format
Quantized Integer Types:
int4 (s4): Signed 4-bit integer format for highly quantized models
uint4 (u4): Unsigned 4-bit integer format for highly quantized models
int16 (s16): Signed 16-bit integer format
uint16 (u16): Unsigned 16-bit integer format
int32 (s32): Signed 32-bit integer format
uint32 (u32): Unsigned 32-bit integer format
Specialized Mixed-Precision Flows:
bf16s4f32of32: BFloat16 with int4 (s4) mixed precision with float32 output
u8s8s32os32: uint8 and int8 mixed integer with int32 accumulation
Various quantized combinations: Supporting modern quantized neural network requirements
For detailed information about data types, see the Types API Reference and the Library Overview Wiki.