* [ldm] Fix bug in overflow correction with large job size
* [zstdmt] Respect ZSTDMT_JOBSIZE_MAX (1G in 64-bit mode)
* [test] Add test that exposes the bug
Sadly the test fails on our CI because it uses too much memory, so
I had to comment it out.
When we wrote one byte beyond the end of the buffer for RLE
blocks back in 1.3.7, we would then have `op > oend`. That is
a problem when we use `oend - op` for the size of the destination
buffer, and allows further writes beyond the end of the buffer for
the rest of the function. Lets assert that it doesn't happen.
Also : minor speed optimization :
shortcut to ZSTD_reset_matchState() rather than the full reset process.
It still needs to be completed with ZSTD_continueCCtx() for proper initialization.
Also : changed position of LDM hash tables in the context,
so that the "regular" hash tables can be at a predictable position,
hence allowing the shortcut to ZSTD_reset_matchState() without complex conditions.
* Extract the overflow correction into a helper function.
* Load the dictionary `ZSTD_CHUNKSIZE_MAX = 512 MB` bytes at a time
and overflow correct between each chunk.
Data corruption could happen when all these conditions are true:
* You are using multithreading mode
* Your overlap size is >= 512 MB (implies window size >= 512 MB)
* You are using a strategy >= ZSTD_btlazy
* You are compressing more than 4 GB
The problem is that when loading a large dictionary we don't do
overflow correction. We can only load 512 MB at a time, and may
need to do overflow correction before each chunk.
We would only skip at most 192 bytes at a time before this diff.
This was added to optimize long matches and skip the middle of the
match. However, it doesn't handle the case of repetitive data.
This patch keeps the optimization, but also handles repetitive data
by taking the max of the two return values.
```
> for n in $(seq 9); do echo strategy=$n; dd status=none if=/dev/zero bs=1024k count=1000 | command time -f %U ./zstd --zstd=strategy=$n >/dev/null; done
strategy=1
0.27
strategy=2
0.23
strategy=3
0.27
strategy=4
0.43
strategy=5
0.56
strategy=6
0.43
strategy=7
0.34
strategy=8
0.34
strategy=9
0.35
```
At level 19 with multithreading the compressed size of `silesia.tar` regresses 300 bytes, and `enwik8` regresses 100 bytes.
In single threaded mode `enwik8` is also within 100 bytes, and I didn't test `silesia.tar`.
Fixes Issue #1634.
fast mode does the same thing as before :
it pre-emptively invalidates any index that could lead to offset > maxDistance.
It's supposed to help speed.
But this logic is performed inside zstd_fast,
so that other strategies can select a different behavior.
It's re-synchronized with nextToUpdate at beginning of each block.
It only needs to be tracked from within zstd_opt block parser.
Made the logic clear, so that no code tried to maintain this variable.
An even better solution would be to make nextToUpdate3
an internal variable of ZSTD_compressBlock_opt_generic().
That would make it possible to remove it from ZSTD_matchState_t,
thus restricting its visibility to only where it's actually useful.
This would require deeper changes though,
since the matchState is the natural structure to transport parameters into and inside the parser.
ZSTDMT was broken when compiled without ZSTD_MULTITHREAD defined,
because `ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, nbWorkerss)`
failed. It was detected by the MSVC test which runs the fuzzer with
multithreading disabled.
This is a very niche use case of a deprecated API, because the API is
inefficient and synchronous, since `threading.h` will be synchronous.
Users almost certainly don't want this, and anyone who tested their code
should realize that it is broken. Therefore, I think it is safe to
require `ZSTD_MULTITHREAD` to be defined to use ZSTDMT.
`ZSTD_compress2()` wouldn't wait for multithreaded compression to
finish. We didn't find this because ZSTDMT will block when it can
compress all in one go, but it can't do that if it doesn't have enough
output space, or if `ZSTD_c_rsyncable` is enabled.
Since we will already sometimes block when using `ZSTD_e_end`, I've
changed `ZSTD_e_end` and `ZSTD_e_flush` to guarantee maximum forward
progress. This simplifies the API, and helps users avoid the easy bug
that was made in `ZSTD_compress2()`
* Found by the libfuzzer fuzzers.
* Added a test case that catches the problem.
* I will make the fuzzers sometimes allocate less than
`ZSTD_compressBound()` output space.
This PR is based on top of PR #1563.
The optimization is to process two input pointers per loop.
It is based on ideas from [igzip] level 1, and talking to @gbtucker.
| Platform | Silesia | Enwik8 |
|-------------------------|-------------|--------|
| OSX clang-10 | +5.3% | +5.4% |
| i9 5 GHz gcc-8 | +6.6% | +6.6% |
| i9 5 GHz clang-7 | +8.0% | +8.0% |
| Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% |
| Skylake 2.4 GHz clang-7 | +6.2% | +7.5% |
Testing on all Silesia files on my Intel i9-9900k with gcc-8
| Silesia File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| silesia.tar | +0.17% | +6.6% |
| dickens | +0.25% | +7.0% |
| mozilla | +0.02% | +6.8% |
| mr | -0.30% | +10.9% |
| nci | +1.28% | +4.5% |
| ooffice | -0.35% | +10.7% |
| osdb | +0.75% | +9.8% |
| reymont | +0.65% | +4.6% |
| samba | +0.70% | +5.9% |
| sao | -0.01% | +14.0% |
| webster | +0.30% | +5.5% |
| xml | +0.92% | +5.3% |
| x-ray | -0.00% | +1.4% |
Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.
| Calgary File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| calgary.tar | +0.30% | +7.1% |
| geo | -0.14% | +25.0% |
| obj1 | -0.46% | +15.2% |
| obj2 | -0.18% | +6.0% |
| pic | +1.80% | +9.3% |
| trans | -0.35% | +5.5% |
We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.
I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:
| Level | Ratio Change | Speed Change |
|-------|--------------|--------------|
| -1 | +0.13% | +6.4% |
| -2 | +4.6% | -1.5% |
| -3 | +7.5% | -4.8% |
| -4 | +8.5% | -6.9% |
| -5 | +9.1% | -9.1% |
Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller. If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.
[igzip]: https://github.com/01org/isa-l/tree/master/igzip
It wasn't using the ZSTD_CCtx_params correctly. It must actualize
the compression parameters by calling ZSTD_getCParamsFromCCtxParams()
to get the real window log.
Tested by updating the streaming memory usage example in the next
commit. The CHECK() failed before this patch, and passes after.
I also added a unit test to zstreamtest.c that failed before this
patch, and passes after.
* After loading a dictionary only create the cdict once we've started the
compression job. This allows the user to pass the dictionary before they
set other settings, and is in line with the rest of the API.
* Add tests that mix the 3 dictionary loading APIs.
* Add extra tests for `ZSTD_CCtx_loadDictionary()`.
* The first 2 tests added fail before this patch.
* Run the regression test suite.