Depending on the tunable settings, two states are possible:
Default State: None of the parameters is tuned.
Tuned State: One of the parameters is tuned with a valid option.
Default State
In this state, none of the parameters is tuned; the library will pick up the best implementation based on the underlying AMD “Zen” micro-architecture.
Run the application by preloading the tunables enabled
libaocl-libmem.so:
$ LD_PRELOAD=<path to build/lib/libaocl-libmem.so> <executable> <params>
Tuned State
In this state, one of the parameters is tuned by the application at run time. The library will choose the implementation based on the valid tuned parameter at run time. Only one of the tunables can be set to a valid set of format/options as described in the Application Implementations table.
LIBMEM_OPERATION
You can set the tunable LIBMEM_OPERATION as follows:
LIBMEM_OPERATION=<operations>,<source_alignment>,<destination_alignment>
With this option, the library chooses the best implementation based on the combination of move instructions, alignment of the source and destination addresses.
Valid Options
<operations> = [avx2|avx512|erms]
<source_alignment> = [b|w|d|q|x|y|n]
<destination_alignment> = [b|w|d|q|x|y|n]
Use the following table to select the right implementation for your application:
Application Requirement |
LIBMEM_OPERATION |
Instructions |
Side-effects |
|---|---|---|---|
Vector unaligned source and destination |
[avx2|avx512],b,b |
Load:VMOVDQU
Store:VMOVDQU
|
None |
Vector aligned source and destination |
[avx2|avx512],y,y |
Load:VMOVDQA
Store:VMOVDQA
|
Unaligned source and/or destination address will lead to crash |
Vector aligned source and unaligned destination |
[avx2|avx512],y,[b|w|d|q|x] |
Load:VMOVDQA
Store:VMOVDQU
|
None |
Vector unaligned source and aligned destination |
[avx2|avx512],[b|w|d|q|x],y |
Load:VMOVDQU
Store:VMOVDQA
|
None |
Vector non temporal load and store |
[avx2|avx512],n,n |
Load:VMOVNTDQA
Store:VMOVNTDQ
|
Unaligned source and/or destination address will lead to crash |
Vector non temporal load |
[avx2|avx512],n,[b|w|d|q|x|y] |
Load:VMOVNTDQA
Store:VMOVDQU
|
None |
Vector non temporal store |
[avx2|avx512],[b|w|d|q|x|y],n |
Load:VMOVDQU
Store:VMOVNTDQ
|
None |
Rep movs unaligned source or destination |
erms,b,b |
REP MOVSB |
None |
Rep movs word aligned source and destination |
erms,w,w |
REP MOVSW |
Data corruption or crash if the length is not a multiple of 2 |
Rep movs double word aligned source and destination |
erms,d,d |
REP MOVSD |
Data corruption or crash if the length is not a multiple of 4 |
Rep movs quad word aligned source and destination |
erms,q,q |
REP MOVSQ |
Data corruption or crash if the length is not a multiple of 8 |
Note
A best-fit solution for the underlying micro-architecture will be chosen if the tunable is in an invalid format.
For example, to use only avx2-based move operations with both unaligned source and aligned destination addresses:
$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_OPERATION=avx2,b,y <executable>
LIBMEM_THRESHOLD
You can set the tunable LIBMEM_THRESHOLD as follows:
LIBMEM_THRESHOLD=<repmov_start_threshold>,<repmov_stop_threshold>,<nt_start_threshold>,
<nt_stop_threshold>
With this option, the library will choose the implementation with tuned threshold settings for supported instruction sets: {vector, rep mov, non-temporal}.
Valid Options
<repmov_start_threshold> = [0, +ve integers]
<repmov_stop_threshold> = [0, +ve integers, -1]
<nt_start_threshold> = [0, +ve integers]
<nt_stop_threshold> = [0, +ve integers, -1]
Where, -1 refers to the maximum length.
Refer to the following table for sample threshold settings:
LIBMEM_THRESHOLD |
Vector Range |
RepMov Range |
Non-Temporal Range |
|---|---|---|---|
0,2048,1048576,-1 |
(2049, 1048576) |
[0,2048] |
[1048576, max value of unsigned long long) |
0,0,1048576,-1 |
[0,1048576) |
[0,0] |
[1048576, max value of unsigned long long) |
Note
A system configured threshold will be chosen if the tunable is in an invalid format.
For example, to use REP MOVE instructions for a range of 1 KB to 2 KB and non_temporal instructions for a range of 512 KB and above:
$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_THRESHOLD=1024,2048,524288,-1 <executable>