8. AOCL-LibMem#
The default LibMem build generates a best optimized shared/static library tuned for the under lying Zen-micro architecture. LibMem also provides a Tunable build option where the users have an option to choose the instruction or tune the threshold values. This tunable build helps the user to run different configurations for any given workload and choose a best fit options for their workload and system configurations.
User should build a tunable binary to make use of the supported tunables.
Refer to the user guide for tunable build options
Note
The tunable build is for experimentation purpose only.
8.1. Running an Application with Tunables#
LibMem built with tunables enabled exposes two tunable parameters that will help you select the implementation of your choice:
LIBMEM_OPERATION: Instruction based on alignment and cacheability
LIBMEM_THRESHOLD: The threshold for ERMS and Non-Temporal instructions
Following two states are possible with this library based on the tunable settings:
Default State: None of the parameters is tuned.
Tuned State: One of the parameters is tuned with a valid option.
8.1.1. Default State#
In this state, none of the parameters are tuned; the library will pick up the best implementation based on the underlying AMD “Zen” micro-architecture.
Run the application by preloading the tunables enabled
libaocl-libmem.so
:
$ LD_PRELOAD=<path to build/lib/libaocl-libmem.so> <executable> <params>
8.1.2. Tuned State#
In this state, one of the parameters is tuned by the application at run time. The library will choose the implementation based on the valid tuned parameter at run time. Only one of the tunable can be set to a valid set of format/options as described in Application Implementations.
8.1.2.1. LIBMEM_OPERATION#
You can set the tunable LIBMEM_OPERATION as follows:
LIBMEM_OPERATION=<operations>,<source_alignment>,<destination_alignment>
Based on this option, the library chooses the best implementation based on the combination of move instructions, alignment of the source and destination addresses.
Valid Options
<operations> = [avx2|avx512|erms]
<source_alignment> = [b|w|d|q|x|y|n]
<destination_alignment> = [b|w|d|q|x|y|n]
Use the following table to select the right implementation for your application:
Application Requirement |
LIBMEM_OPERATION |
Instructions |
Side-effects |
---|---|---|---|
Vector unaligned source and destination |
[avx2|avx512],b,b |
Load:VMOVDQU
Store:VMOVDQU
|
None |
Vector aligned source and destination |
[avx2|avx512],y,y |
Load:VMOVDQA
Store:VMOVDQA
|
Unaligned source and/or destination address will lead to crash |
Vector aligned source and unaligned destination |
[avx2|avx512],y,[b|w|d|q|x] |
Load:VMOVDQA
Store:VMOVDQU
|
None |
Vector unaligned source and aligned destination |
[avx2|avx512],[b|w|d|q|x],y |
Load:VMOVDQU
Store:VMOVDQA
|
None |
Vector non temporal load and store |
[avx2|avx512],n,n |
Load:VMOVNTDQA
Store:VMOVNTDQ
|
Unaligned source and/or destination address will lead to crash |
Vector non temporal load |
[avx2|avx512],n,[b|w|d|q|x|y] |
Load:VMOVNTDQA
Store:VMOVDQU
|
None |
Vector non temporal store |
[avx2|avx512],[b|w|d|q|x|y],n |
Load:VMOVDQU
Store:VMOVNTDQ
|
None |
Rep movs unaligned source or destination |
erms,b,b |
REP MOVSB |
None |
Rep movs word aligned source and destination |
erms,w,w |
REP MOVSW |
Data corruption or crash if the length is not a multiple of 2 |
Rep movs double word aligned source and destination |
erms,d,d |
REP MOVSD |
Data corruption or crash if the length is not a multiple of 4 |
Rep movs quad word aligned source and destination |
erms,q,q |
REP MOVSQ |
Data corruption or crash if the length is not a multiple of 8 |
Note
A best-fit solution for the underlying micro-architecture will be chosen if the tunable is in an invalid format.
For example, to use only avx2-based move operations with both unaligned source and aligned destination addresses:
$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_OPERATION=avx2,b,y <executable>
8.1.2.2. LIBMEM_THRESHOLD#
You can set the tunable LIBMEM_THRESHOLD as follows:
LIBMEM_THRESHOLD=<repmov_start_threshold>,<repmov_stop_threshold>,<nt_start_threshold>,
<nt_stop_threshold>
Based on this option, the library will choose the implementation with tuned threshold settings for supported instruction sets: {vector, rep mov, non-temporal}.
Valid Options
<repmov_start_threshold> = [0, +ve integers]
<repmov_stop_threshold> = [0, +ve integers, -1]
<nt_start_threshold> = [0, +ve integers]
<nt_stop_threshold> = [0, +ve integers, -1]
Where, -1 refers to the maximum length.
Refer the following table for the sample threshold settings:
LIBMEM_THRESHOLD |
Vector Range |
RepMov Range |
Non-Temporal Range |
---|---|---|---|
0,2048,1048576,-1 |
(2049, 1048576) |
[0,2048] |
[1048576, max value of unsigned long long) |
0,0,1048576,-1 |
[0,1048576) |
[0,0] |
[1048576, max value of unsigned long long) |
Note
A system configured threshold will be chosen if the tunable is in an invalid format.
For example, to use REP MOVE
instructions for a range of 1KB to
2KB and non_temporal instructions for a range of 512 KB and above:
$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_THRESHOLD=1024,2048,524288,-1 <executable>