* All copyright lines now have -2020 instead of -present
* All copyright lines include "Facebook, Inc"
* All licenses are now standardized
The copyright in `threading.{h,c}` is not changed because it comes from
zstdmt.
The copyright and license of `divsufsort.{h,c}` is not changed.
* Adding fail logging for superblock flow
* Dividing by targetCBlockSize instead of blockSize
* Adding new const and using more acurate formula for nbBlocks
* Only do dstCapacity check if using superblock
* Remvoing disabling logic
* Updating test to make it catch more extreme case of previou bug
* Also updating comment
* Only taking compressEnd shortcut on non-superblock
Fixes a fuzz issue where dictionary_round_trip failed because the compressor was generating corrupt files thanks to zero weights in the table.
* Only setting loaded dict huf table to valid on non-zero
* Adding hasNoZeroWeights test to fse tables
* Forbiding nbBits != 0 when weight == 0
* Reverting the last commit
* Setting table log to 0 when weight == 0
* Small (invalid) zero weight dict test
* Small (valid) zero weight dict test
* Initializing repeatMode vars to check before zero check
* Removing FSE changes to seperate pr
* Reverting accidentally changed file
* Negating bool, using unsigned, optimization nit
* Bump `WILDCOPY_OVERLENGTH` to 16 to fix the wildcopy overread.
* Optimize `ZSTD_wildcopy()` by removing unnecessary branches and
unrolling the loop.
* Extract `ZSTD_overlapCopy8()` into its own function.
* Add `ZSTD_safecopy()` for `ZSTD_execSequenceEnd()`. It is
optimized for single long sequences, since that is the important
case that can end up in `ZSTD_execSequenceEnd()`. Without this
optimization, decompressing a block with 1 long match goes
from 5.7 GB/s to 800 MB/s.
* Refactor `ZSTD_execSequenceEnd()`.
* Increase the literal copy shortcut to 16.
* Add a shortcut for offset >= 16.
* Simplify `ZSTD_execSequence()` by pushing more cases into
`ZSTD_execSequenceEnd()`.
* Delete `ZSTD_execSequenceLong()` since it is exactly the
same as `ZSTD_execSequence()`.
clang-8 seeds +17.5% on silesia and +21.8% on enwik8.
gcc-9 sees +12% on silesia and +15.5% on enwik8.
TODO: More detailed measurements, and on more datasets.
Crdit to OSS-Fuzz for finding the wildcopy overread.
Summary: The idea behind wildcopy is that it can be cheaper to copy more bytes (say 8) than it is to copy less (say, 3). This change takes that further by exploiting some properties:
1. it's almost always OK to copy 16 bytes instead of 8, which means fewer copy instructions, and fewer branches
2. A 16 byte chunk size means that ~90% of wildcopy invocations will have a trip count of 1, so branch prediction will be improved.
Speedup on Xeon E5-2680v4 is in the range of 3-5%.
Measured wildcopy length distributions on silesia.tar:
level <=8 <=16 <=24 >24
1 78.05% 11.49% 3.52% 6.94%
3 82.14% 8.99% 2.44% 6.43%
6 85.81% 6.51% 2.92% 4.76%
8 83.02% 7.31% 3.64% 6.03%
10 84.13% 6.67% 3.29% 5.91%
15 77.58% 7.55% 5.21% 9.66%
16 80.07% 7.20% 3.98% 8.75%
Test Plan: benchmark silesia, make check