AOCL-LibMem - 5.0 English

AOCL Performance Tuning Guide (63859)

Document ID
63859
Release Date
2024-10-10
Version
5.0 English

8. AOCL-LibMem#

The default LibMem build generates a best optimized shared/static library tuned for the under lying Zen-micro architecture. LibMem also provides a Tunable build option where the users have an option to choose the instruction or tune the threshold values. This tunable build helps the user to run different configurations for any given workload and choose a best fit options for their workload and system configurations.

User should build a tunable binary to make use of the supported tunables.

Refer to the user guide for tunable build options

Note

The tunable build is for experimentation purpose only.

8.1. Running an Application with Tunables#

LibMem built with tunables enabled exposes two tunable parameters that will help you select the implementation of your choice:

  • LIBMEM_OPERATION: Instruction based on alignment and cacheability

  • LIBMEM_THRESHOLD: The threshold for ERMS and Non-Temporal instructions

Following two states are possible with this library based on the tunable settings:

  • Default State: None of the parameters is tuned.

  • Tuned State: One of the parameters is tuned with a valid option.

8.1.1. Default State#

In this state, none of the parameters are tuned; the library will pick up the best implementation based on the underlying AMD “Zen” micro-architecture.

Run the application by preloading the tunables enabled libaocl-libmem.so:

$ LD_PRELOAD=<path to build/lib/libaocl-libmem.so> <executable> <params>

8.1.2. Tuned State#

In this state, one of the parameters is tuned by the application at run time. The library will choose the implementation based on the valid tuned parameter at run time. Only one of the tunable can be set to a valid set of format/options as described in Application Implementations.

8.1.2.1. LIBMEM_OPERATION#

You can set the tunable LIBMEM_OPERATION as follows:

LIBMEM_OPERATION=<operations>,<source_alignment>,<destination_alignment>

Based on this option, the library chooses the best implementation based on the combination of move instructions, alignment of the source and destination addresses.

Valid Options

  • <operations> = [avx2|avx512|erms]

  • <source_alignment> = [b|w|d|q|x|y|n]

  • <destination_alignment> = [b|w|d|q|x|y|n]

Use the following table to select the right implementation for your application:

Table 8.1 Application Implementations#

Application Requirement

LIBMEM_OPERATION

Instructions

Side-effects

Vector unaligned source and destination

[avx2|avx512],b,b

Load:VMOVDQU
Store:VMOVDQU

None

Vector aligned source and destination

[avx2|avx512],y,y

Load:VMOVDQA
Store:VMOVDQA

Unaligned source and/or destination address will lead to crash

Vector aligned source and unaligned destination

[avx2|avx512],y,[b|w|d|q|x]

Load:VMOVDQA
Store:VMOVDQU

None

Vector unaligned source and aligned destination

[avx2|avx512],[b|w|d|q|x],y

Load:VMOVDQU
Store:VMOVDQA

None

Vector non temporal load and store

[avx2|avx512],n,n

Load:VMOVNTDQA
Store:VMOVNTDQ

Unaligned source and/or destination address will lead to crash

Vector non temporal load

[avx2|avx512],n,[b|w|d|q|x|y]

Load:VMOVNTDQA
Store:VMOVDQU

None

Vector non temporal store

[avx2|avx512],[b|w|d|q|x|y],n

Load:VMOVDQU
Store:VMOVNTDQ

None

Rep movs unaligned source or destination

erms,b,b

REP MOVSB

None

Rep movs word aligned source and destination

erms,w,w

REP MOVSW

Data corruption or crash if the length is not a multiple of 2

Rep movs double word aligned source and destination

erms,d,d

REP MOVSD

Data corruption or crash if the length is not a multiple of 4

Rep movs quad word aligned source and destination

erms,q,q

REP MOVSQ

Data corruption or crash if the length is not a multiple of 8

Note

A best-fit solution for the underlying micro-architecture will be chosen if the tunable is in an invalid format.

For example, to use only avx2-based move operations with both unaligned source and aligned destination addresses:

$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_OPERATION=avx2,b,y <executable>

8.1.2.2. LIBMEM_THRESHOLD#

You can set the tunable LIBMEM_THRESHOLD as follows:

LIBMEM_THRESHOLD=<repmov_start_threshold>,<repmov_stop_threshold>,<nt_start_threshold>,
<nt_stop_threshold>

Based on this option, the library will choose the implementation with tuned threshold settings for supported instruction sets: {vector, rep mov, non-temporal}.

Valid Options

  • <repmov_start_threshold> = [0, +ve integers]

  • <repmov_stop_threshold> = [0, +ve integers, -1]

  • <nt_start_threshold> = [0, +ve integers]

  • <nt_stop_threshold> = [0, +ve integers, -1]

Where, -1 refers to the maximum length.

Refer the following table for the sample threshold settings:

Table 8.2 Sample Threshold Settings#

LIBMEM_THRESHOLD

Vector Range

RepMov Range

Non-Temporal Range

0,2048,1048576,-1

(2049, 1048576)

[0,2048]

[1048576, max value of unsigned long long)

0,0,1048576,-1

[0,1048576)

[0,0]

[1048576, max value of unsigned long long)

Note

A system configured threshold will be chosen if the tunable is in an invalid format.

For example, to use REP MOVE instructions for a range of 1KB to 2KB and non_temporal instructions for a range of 512 KB and above:

$ LD_PRELOAD=<build/lib/libaocl-libmem.so> LIBMEM_THRESHOLD=1024,2048,524288,-1 <executable>