* Bump `WILDCOPY_OVERLENGTH` to 16 to fix the wildcopy overread.
* Optimize `ZSTD_wildcopy()` by removing unnecessary branches and
unrolling the loop.
* Extract `ZSTD_overlapCopy8()` into its own function.
* Add `ZSTD_safecopy()` for `ZSTD_execSequenceEnd()`. It is
optimized for single long sequences, since that is the important
case that can end up in `ZSTD_execSequenceEnd()`. Without this
optimization, decompressing a block with 1 long match goes
from 5.7 GB/s to 800 MB/s.
* Refactor `ZSTD_execSequenceEnd()`.
* Increase the literal copy shortcut to 16.
* Add a shortcut for offset >= 16.
* Simplify `ZSTD_execSequence()` by pushing more cases into
`ZSTD_execSequenceEnd()`.
* Delete `ZSTD_execSequenceLong()` since it is exactly the
same as `ZSTD_execSequence()`.
clang-8 seeds +17.5% on silesia and +21.8% on enwik8.
gcc-9 sees +12% on silesia and +15.5% on enwik8.
TODO: More detailed measurements, and on more datasets.
Crdit to OSS-Fuzz for finding the wildcopy overread.
This led to a nasty edgecase, where index reduction for modes that don't use
the h3 table would have a degenerate table (size 4) allocated and marked clean,
but which would not be re-indexed.
The source matchState is potentially at a lower current index, which means
that any extra table space not overwritten by the copy may now contain
invalid indices. The simple solution is to unconditionally shrink the valid
table area to just the area overwritten.
Summary: The idea behind wildcopy is that it can be cheaper to copy more bytes (say 8) than it is to copy less (say, 3). This change takes that further by exploiting some properties:
1. it's almost always OK to copy 16 bytes instead of 8, which means fewer copy instructions, and fewer branches
2. A 16 byte chunk size means that ~90% of wildcopy invocations will have a trip count of 1, so branch prediction will be improved.
Speedup on Xeon E5-2680v4 is in the range of 3-5%.
Measured wildcopy length distributions on silesia.tar:
level <=8 <=16 <=24 >24
1 78.05% 11.49% 3.52% 6.94%
3 82.14% 8.99% 2.44% 6.43%
6 85.81% 6.51% 2.92% 4.76%
8 83.02% 7.31% 3.64% 6.03%
10 84.13% 6.67% 3.29% 5.91%
15 77.58% 7.55% 5.21% 9.66%
16 80.07% 7.20% 3.98% 8.75%
Test Plan: benchmark silesia, make check
The COVER and FASTCOVER dictionary builders can deadlock when
dictionary construction errors, likely because there are too few
samples, or too few distinct dmers. The deadlock only occurs when
there are errors.
Fixes#1746.
Extends the fix in PR#1722 to v0.2 and v0.4. These aren't built into
zstd by default, and v0.5 onward are not affected.
I only add the `srcSize > BLOCKSIZE` check to v0.4 because the comments
say that it must hold, but the equivalent comment isn't present in v0.2.
Credit to OSS-Fuzz.
The nbSeq "short" format (1-byte)
is compatible with any value < 128.
However, the code would cautiously only accept values < 127.
This is not an error, because the general 2-bytes format
is compatible with small values < 128.
Hence the inefficiency never triggered any warning.
Spotted by Intel's Smita Kumar.
The ancient GCC 4.x doesn't understand the "optimize" attribute until 4.4.
Fix the build on platforms with GCC 4.x < 4.4 by limiting the DONT_VECTORIZE
definition to GCC 5 and greater.
Noticed and patch proposed by Warner Losh <imp@FreeBSD.org>.
The match length and literal length extra bytes could either
by 2 bytes or 3 bytes in version 0.5. All earlier verions were
always 3 bytes, and later version didn't have dumps.
The bug, introduced by commit 0fd322f812,
was triggered when the last dump was a 2-byte dump, because we didn't
separate that case from a 3-byte dump, and thought we were over-reading.
I've tested this fix with every zstd version < 1.0.0 on the buggy file,
and we are now always successfully decompressing with the right
checksum.
Fixes#1693.
* [ldm] Fix bug in overflow correction with large job size
* [zstdmt] Respect ZSTDMT_JOBSIZE_MAX (1G in 64-bit mode)
* [test] Add test that exposes the bug
Sadly the test fails on our CI because it uses too much memory, so
I had to comment it out.
When we wrote one byte beyond the end of the buffer for RLE
blocks back in 1.3.7, we would then have `op > oend`. That is
a problem when we use `oend - op` for the size of the destination
buffer, and allows further writes beyond the end of the buffer for
the rest of the function. Lets assert that it doesn't happen.
Also : minor speed optimization :
shortcut to ZSTD_reset_matchState() rather than the full reset process.
It still needs to be completed with ZSTD_continueCCtx() for proper initialization.
Also : changed position of LDM hash tables in the context,
so that the "regular" hash tables can be at a predictable position,
hence allowing the shortcut to ZSTD_reset_matchState() without complex conditions.
* Extract the overflow correction into a helper function.
* Load the dictionary `ZSTD_CHUNKSIZE_MAX = 512 MB` bytes at a time
and overflow correct between each chunk.
Data corruption could happen when all these conditions are true:
* You are using multithreading mode
* Your overlap size is >= 512 MB (implies window size >= 512 MB)
* You are using a strategy >= ZSTD_btlazy
* You are compressing more than 4 GB
The problem is that when loading a large dictionary we don't do
overflow correction. We can only load 512 MB at a time, and may
need to do overflow correction before each chunk.
We would only skip at most 192 bytes at a time before this diff.
This was added to optimize long matches and skip the middle of the
match. However, it doesn't handle the case of repetitive data.
This patch keeps the optimization, but also handles repetitive data
by taking the max of the two return values.
```
> for n in $(seq 9); do echo strategy=$n; dd status=none if=/dev/zero bs=1024k count=1000 | command time -f %U ./zstd --zstd=strategy=$n >/dev/null; done
strategy=1
0.27
strategy=2
0.23
strategy=3
0.27
strategy=4
0.43
strategy=5
0.56
strategy=6
0.43
strategy=7
0.34
strategy=8
0.34
strategy=9
0.35
```
At level 19 with multithreading the compressed size of `silesia.tar` regresses 300 bytes, and `enwik8` regresses 100 bytes.
In single threaded mode `enwik8` is also within 100 bytes, and I didn't test `silesia.tar`.
Fixes Issue #1634.
fast mode does the same thing as before :
it pre-emptively invalidates any index that could lead to offset > maxDistance.
It's supposed to help speed.
But this logic is performed inside zstd_fast,
so that other strategies can select a different behavior.
It's re-synchronized with nextToUpdate at beginning of each block.
It only needs to be tracked from within zstd_opt block parser.
Made the logic clear, so that no code tried to maintain this variable.
An even better solution would be to make nextToUpdate3
an internal variable of ZSTD_compressBlock_opt_generic().
That would make it possible to remove it from ZSTD_matchState_t,
thus restricting its visibility to only where it's actually useful.
This would require deeper changes though,
since the matchState is the natural structure to transport parameters into and inside the parser.
* Version <= 0.5 could read beyond the end of `dumps`, which points into
the input buffer.
* Check the validity of `dumps` before using it, if it is out of bounds
return garbage values. There is no return code for this function.
* Introduce `MEM_readLE24()` for simplicity, since I don't want to trust
that there is an extra byte after `dumps`.
ZSTDMT was broken when compiled without ZSTD_MULTITHREAD defined,
because `ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, nbWorkerss)`
failed. It was detected by the MSVC test which runs the fuzzer with
multithreading disabled.
This is a very niche use case of a deprecated API, because the API is
inefficient and synchronous, since `threading.h` will be synchronous.
Users almost certainly don't want this, and anyone who tested their code
should realize that it is broken. Therefore, I think it is safe to
require `ZSTD_MULTITHREAD` to be defined to use ZSTDMT.
Bugs:
* `ZSTD_DCtx_refPrefix()` didn't clear the dictionary after the first
use. Fix and add a test case.
* `ZSTD_DCtx_reset()` always cleared the dictionary. Fix and add a test
case.
* After calling `ZSTD_resetDStream()` you could no longer load a
dictionary, since the stage was set to `zdss_loadHeader`. Fix and add
a test case.
Cleanup:
* Make `ZSTD_initDStream*()` and `ZSTD_resetDStream()` wrap the new
advanced API, and add test cases.
* Document the equivalent of these functions in the advanced API and
document the unstable functions as deprecated.
* `ZSTD_decompressDCtx()` did not use the dictionary loaded by
`ZSTD_DCtx_loadDictionary()`.
* Add a unit test.
* A stacked diff uses `ZSTD_decompressDCtx()` in the
`dictionary_round_trip` and `dictionary_decompress` fuzzers.
`ZSTD_compress2()` wouldn't wait for multithreaded compression to
finish. We didn't find this because ZSTDMT will block when it can
compress all in one go, but it can't do that if it doesn't have enough
output space, or if `ZSTD_c_rsyncable` is enabled.
Since we will already sometimes block when using `ZSTD_e_end`, I've
changed `ZSTD_e_end` and `ZSTD_e_flush` to guarantee maximum forward
progress. This simplifies the API, and helps users avoid the easy bug
that was made in `ZSTD_compress2()`
* Found by the libfuzzer fuzzers.
* Added a test case that catches the problem.
* I will make the fuzzers sometimes allocate less than
`ZSTD_compressBound()` output space.
When `FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION` is defined don't check
the dictID. This check makes the fuzzers job harder, and it is at the
very beginning.
This commit moves the candidate advanced API to the stable section.
It makes some minor whitespace changes, but it doesn't change any
of the wording of the documentation.
I'll put up a separate PR that tweaks some of the documentation
once this lands, so that it is easier to review.
NOTE: Even though these functions are now in stable, they aren't
stable until the next release (in under 1 month). It is possible
that they change until then.
This PR is based on top of PR #1563.
The optimization is to process two input pointers per loop.
It is based on ideas from [igzip] level 1, and talking to @gbtucker.
| Platform | Silesia | Enwik8 |
|-------------------------|-------------|--------|
| OSX clang-10 | +5.3% | +5.4% |
| i9 5 GHz gcc-8 | +6.6% | +6.6% |
| i9 5 GHz clang-7 | +8.0% | +8.0% |
| Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% |
| Skylake 2.4 GHz clang-7 | +6.2% | +7.5% |
Testing on all Silesia files on my Intel i9-9900k with gcc-8
| Silesia File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| silesia.tar | +0.17% | +6.6% |
| dickens | +0.25% | +7.0% |
| mozilla | +0.02% | +6.8% |
| mr | -0.30% | +10.9% |
| nci | +1.28% | +4.5% |
| ooffice | -0.35% | +10.7% |
| osdb | +0.75% | +9.8% |
| reymont | +0.65% | +4.6% |
| samba | +0.70% | +5.9% |
| sao | -0.01% | +14.0% |
| webster | +0.30% | +5.5% |
| xml | +0.92% | +5.3% |
| x-ray | -0.00% | +1.4% |
Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.
| Calgary File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| calgary.tar | +0.30% | +7.1% |
| geo | -0.14% | +25.0% |
| obj1 | -0.46% | +15.2% |
| obj2 | -0.18% | +6.0% |
| pic | +1.80% | +9.3% |
| trans | -0.35% | +5.5% |
We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.
I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:
| Level | Ratio Change | Speed Change |
|-------|--------------|--------------|
| -1 | +0.13% | +6.4% |
| -2 | +4.6% | -1.5% |
| -3 | +7.5% | -4.8% |
| -4 | +8.5% | -6.9% |
| -5 | +9.1% | -9.1% |
Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller. If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.
[igzip]: https://github.com/01org/isa-l/tree/master/igzip
It wasn't using the ZSTD_CCtx_params correctly. It must actualize
the compression parameters by calling ZSTD_getCParamsFromCCtxParams()
to get the real window log.
Tested by updating the streaming memory usage example in the next
commit. The CHECK() failed before this patch, and passes after.
I also added a unit test to zstreamtest.c that failed before this
patch, and passes after.