It is based upon lz4 and uses nearly the same block format (github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). The number of bytes to encode the offset is four instead of two in lz4 to reflect the longer distance matching. The block format is described in `ldm.h`.
Decompression and verification can be enabled by defining `DECOMPRESS_AND_VERIFY` in `main.c`.
The output file names are as follows:
-`<filename>.ldm` : compressed file
-`<filename>.ldm.dec` : decompressed file
### Parameters
There are various parameters that can be tuned. These parameters can be tuned in `ldm.h` or, alternatively if `ldm_params.h` is included, in `ldm_params.h` (for easier configuration).
The parameters are as follows and must all be defined:
-`LDM_MEMORY_USAGE` : the memory usage of the underlying hash table in bytes.
-`HASH_BUCKET_SIZE_LOG` : the log size of each bucket in the hash table (used in collision resolution).
-`LDM_LAG` : the lag (in bytes) in inserting entries into the hash table.
-`LDM_WINDOW_SIZE_LOG` : the log maximum window size when searching for matches.
-`LDM_MIN_MATCH_LENGTH` : the minimum match length.
-`INSERT_BY_TAG` : insert entries into the hash table as a function of the hash. This increases speed by reducing the number of hash table lookups and match comparisons. Certain hashes will never be inserted.
-`USE_CHECKSUM` : store a checksum with the hash table entries for faster comparison. This halves the number of entries the hash table can contain.
The optional parameter `HASH_ONLY_EVERY_LOG` is the log inverse frequency of insertion into the hash table. That is, an entry is inserted approximately every `1 << HASH_ONLY_EVERY_LOG` times. If this parameter is not defined, the value is computed as a function of the window size and memory usage to approximate an even coverage of the window.
Below is a comparison of various compression methods on a tar of four versions of llvm (versions `3.9.0`, `3.9.1`, `4.0.0`, `4.0.1`) with a total size of `727900160` B.
A large bottleneck in terms of speed is finding the matches and comparing to see if they are greater than the minimum match length. Generally:
- The fewer matches found (or the lower the percentage of the literals matched), the slower the algorithm will behave.
- Increasing `HASH_ONLY_EVERY_LOG` results in fewer inserts and, if `INSERT_BY_TAG` is set, fewer lookups in the table. This has a large effect on speed, as well as compression ratio.
- If `HASH_ONLY_EVERY_LOG` is not set, its value is calculated based on `LDM_WINDOW_SIZE_LOG` and `LDM_MEMORY_USAGE`. Increasing `LDM_WINDOW_SIZE_LOG` has the effect of increasing `HASH_ONLY_EVERY_LOG` and increasing `LDM_MEMORY_USAGE` decreases `HASH_ONLY_EVERY_LOG`.
-`USE_CHECKSUM` generally improves speed with hash table lookups.
The compression ratio is highly correlated with the coverage of matches. As a long distance matcher, the algorithm was designed to "optimize" for long distance matches outside the zstd compression window. The compression ratio after recompressing the output of the long-distance matcher with zstd was a more important signal in development than the raw compression ratio itself.
Generally, increasing `LDM_MEMORY_USAGE` will improve the compression ratio. However when using the default computed value of `HASH_ONLY_EVERY_LOG`, this increases the frequency of insertion and lookup in the table and thus may result in a decrease in speed.
Below is a table showing the speed and compression ratio when compressing the llvm tar (as described above) using different settings for `LDM_MEMORY_USAGE`. The other parameters were the same as used in the benchmark above.
| `LDM_MEMORY_USAGE` | Ratio | Speed (MB/s) | Ratio after zstd -6 |
|---:| ---: | ---: | ---: |
| `18` | `1.85` | `232.4` | `10.92` |
| `21` | `2.79` | `233.9` | `15.92` |
| `23` | `3.48` | `220.5` | `18.29` |
| `25` | `4.56` | `140.8` | `19.21` |
### Compression statistics
Compression statistics (and the configuration) can be enabled/disabled via `COMPUTE_STATS` and `OUTPUT_CONFIGURATION` in `ldm.h`.