10. AOCL-Compression#
AOCL-Compression provides options to configure the library to best suit your use case. Options are available to enable/disable various optimizations at both compile time and run time. These options can significantly impact run time performance.
10.1. Compile Time Tuning#
Compile time tuning is available through CMake options.
Do not compile the library targeted for performance with the AOCL_ENABLE_LOG_FEATURE option.
Following options are available to enable/disable specific optimizations in respective compression methods:
Flag |
Description |
Use case |
|---|---|---|
AOCL_DECOMPRESS_FAST (LZ4, ZSTD & SNAPPY) |
Enable fast decompression modes that might compromise on compression speed / ratio to produce streams that decompress faster. Supported values: ZSTD (levels 1-3), Snappy (levels 1-2), and LZ4 (level 1). |
Applications with focus on faster decompression speeds. E.g.: Applications that compress once, decompress multiple times. |
AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT1 (LZ4)
AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT2 (LZ4)
SNAPPY_MATCH_SKIP_OPT (Snappy)
AOCL_ZSTD_SEARCH_SKIP_OPT_DFAST_FAST (ZSTD)
|
If matches are not found for N number of bytes when parsing input data, increase the parsing step size from 1 to M and look for matches at these points only. This provides faster compression in scenarios where it is hard to find matches at the expense of compression ratio. AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT2 does more aggressive skipping that AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT1 for LZ4. [Values: ON / OFF, Defaults: AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT1 (OFF) AOCL_LZ4_MATCH_SKIP_OPT_LDS_STRAT2 (OFF) SNAPPY_MATCH_SKIP_OPT (ON) AOCL_ZSTD_SEARCH_SKIP_OPT_DFAST_FAST (ON)] |
Files that are hard to compress |
AOCL_LZ4_NEW_PRIME_NUMBER (LZ4)
AOCL_LZ4_EXTRA_HASH_TABLE_UPDATES (LZ4)
AOCL_LZ4_HASH_BITS_USED (LZ4)
SNAPPY_HIGH_COMPRESSION (Snappy)
|
A hash table is used to keep a dictionary of matches found in the past for different byte patterns in the input. In LZ4 multiplicative hashing function is used, which takes 5 bytes of input and multiplies it with a hard-coded prime number to get the hash. In Snappy, hash value is generated using the CRC32 algorithm. AOCL_LZ4_NEW_PRIME_NUMBER: Alternate prime number found through empirical studies is used [Values: ON / OFF (default)] AOCL_LZ4_EXTRA_HASH_TABLE_UPDATES: When a match of length N is found, next comparison starts from src+N. Bytes that are skipped are not added to hash table by default. This flag inserts some of these skipped bytes into the hash table thus providing better compression. [Values: ON / OFF (default)] AOCL_LZ4_HASH_BITS_USED: Use more than 5 bytes (40 bits) when computing hash. LOW: 41 bits, HIGH: 44 bits. [Values: LOW / HIGH (LOW by default)] SNAPPY_HIGH_COMPRESSION: Hash table size is doubled if input size is more than 32kb. [Values: ON / OFF (default)] |
AOCL_LZ4_NEW_PRIME_NUMBER: Determine experimentally if it is useful for your data set AOCL_LZ4_EXTRA_HASH_TABLE_UPDATES: Better compression desired AOCL_LZ4_HASH_BITS_USED: Better speed desired SNAPPY_HIGH_COMPRESSION: Better compression desired |
AOCL_LZ4_OPT_PREFETCH_BACKWARDS (LZ4) |
Prefetch match candidates in advance [Values: ON / OFF (default)] |
Data with higher likelihood of finding matches |
AOCL_LZ4HC_DISABLE_PATTERN_ANALYSIS (LZ4HC) |
Disable fast code path to handle repeated byte patterns such as “000000”. Faster compression when data does not have such patterns. [Values: ON (default) / OFF] |
Enable if data contains such patterns |
ENABLE_FAST_MATH |
Enable fast math optimizations [Values: ON / OFF (default)] |
Enable if application is not sensitive to floating-point numerical accuracy |
SNAPPY_ENABLE_DECOMPRESS_BRANCHLESS (Snappy) |
Enable Snappy branchless decompression optimization. [Values: ON / OFF (default value: ON for non-GCC and OFF for GCC compiler)] |
For non-GCC compilers, disable this option if performance is less when compared to with this option. |
10.2. Run Time Tuning#
Run time tuning is available through environment variables.
Following options are available to control library functionality at runtime:
Flag |
Description |
Use case |
|---|---|---|
AOCL_ZLIB_QUICK_MODE (ZLIB) |
Improves compression speed at the expense of compression ratio. Primarily for level 1. Improvements can be observed for levels 2, 3 and 5 as well. [Set environment variable: AOCL_ZLIB_QUICK_MODE Values: ON / OFF (default)] |
Suitable for applications that need faster compression speeds for lower levels. |
AOCL_DISABLE_OPT |
Disable AOCL optimizations and run the reference implementation. [Set environment variable: AOCL_DISABLE_OPT Values: ON / OFF (default)] |
Benchmarking performance improvements obtained by AOCL optimizations over reference. |
OMP_NUM_THREADS |
Environment variable based thread control provided by OpenMP. Library needs to be built with AOCL_ENABLE_THREADS=ON for this to be useful. [Set environment variable: OMP_NUM_THREADS Values: >= 1. Default: all threads that the implementation supports.] |
To limit the number of threads used to run compression and decompression in multi-threaded mode. Note: OMP_NUM_THREADS setting is not required by default as the algorithm automatically determines number of threads to use based on hardware and input file size. |
10.3. Reducing Run-to-Run Variation (Hardware Settings for Optimal Performance Benchmarking)#
A test bench is included in the source code to benchmark the library’s performance, with instructions for running it provided here. Additionally, third-party benchmarks that can link to the library may also be used. Some fluctuation in compression and decompression times during benchmark runs is normal. The observed variance in performance is not due to non-deterministic elements in the algorithms or the benchmark. Instead, it is majorly due to the hardware environment.
To reduce these variations, consider the following helpful techniques:
Clear the Caches: Clear the caches of the machine before running benchmarks to ensure consistent starting conditions.
Isolate the Workload: Avoid running multiple workloads on the machine during benchmarking to prevent resource contention.
Run Multiple Iterations: Perform ~50 iterations for single-threaded and ~100 for multi-threaded benchmarks, taking the best result to minimize anomalies.
Disable SMT/Hyperthreading: Disable SMT/Hyperthreading to reduce variability caused by shared resources.
Bind Processes to Cores: Use
numactlto bind the benchmarking process to specific cores and memory nodes. For multi-threaded benchmarks, OpenMP affinity settings such asOMP_PROC_BINDcan be used.