in this version, literal compression is always disabled for ZSTD_fast strategy.
Performance parity between ZSTD_compress_advanced() and ZSTD_compress_generic()
result of ZSTD_compress_advanced()
is different from ZSTD_compress_generic()
when using negative compression levels
because the disabling of huffman compression is not passed in parameters.
The (pretty old) code inside ZSTD_compress()
was making some pretty bold assumptions
on what's inside a CCtx and how to init it.
This is pretty fragile by design.
CCtx content evolve.
Knowledge of how to handle that should be concentrate in one place.
A side effect of this strategy
is that ZSTD_compress() wouldn't check for BMI2 capability,
and is therefore missing out some potential speed opportunity.
This patch makes ZSTD_compress() use
the same initialization and release functions
as the normal creator / destructor ones.
Measured on my laptop, with a custom version of bench
manually modified to use ZSTD_compress() (instead of the advanced API) :
This patch :
1#silesia.tar : 211984896 -> 73651053 (2.878), 312.2 MB/s , 723.8 MB/s
2#silesia.tar : 211984896 -> 70163650 (3.021), 226.2 MB/s , 649.8 MB/s
3#silesia.tar : 211984896 -> 66996749 (3.164), 169.4 MB/s , 636.7 MB/s
4#silesia.tar : 211984896 -> 65998319 (3.212), 136.7 MB/s , 619.2 MB/s
dev branch :
1#silesia.tar : 211984896 -> 73651053 (2.878), 291.7 MB/s , 727.5 MB/s
2#silesia.tar : 211984896 -> 70163650 (3.021), 216.2 MB/s , 655.7 MB/s
3#silesia.tar : 211984896 -> 66996749 (3.164), 162.2 MB/s , 633.1 MB/s
4#silesia.tar : 211984896 -> 65998319 (3.212), 130.6 MB/s , 618.6 MB/s
when parameters are "equivalent",
the context is re-used in continue mode,
hence needed workspace size is not recalculated.
This incidentally also evades the size-down check and action.
This patch intercepts the "continue mode"
so that the size-down check and action is actually triggered.
recently introduce into the new dictionary mode.
The bug could be reproduced with this command :
./zstreamtest -v --opaqueapi --no-big-tests -s4092 -t639
error was in function ZSTD_count_2segments() :
the beginning of the 2nd segment corresponds to prefixStart
and not the beginning of the current block (istart == src).
This would result in comparing the wrong byte.
removed "cached" structure.
prices are now saved in the optimal table.
Primarily done for simplification.
Might improve speed by a little.
But actually, and surprisingly, also improves ratio in some circumstances.
recent experienced showed that
default distribution table for offset
can get it wrong pretty quickly with the nb of symbols,
while it remains a reasonable choice much longer for lengths symbols.
Changed the formula,
so that dynamic threshold is now 32 symbols for offsets.
It remains at 64 symbols for lengths.
Detection based on defaultNormLog
zstd rejects blocks which do not compress by at least a certain amount.
In which case, such block is simply emitted uncompressed (even if a little bit of compression could be achieved).
This is better for decompression speed, hence for energy.
The logic is controlled by ZSTD_minGain().
The rule is applied uniformly, at all compression levels.
This change makes btultra accepts blocks with poor compression ratios.
We presume that users of btultra mode prefers compression ratio over some decompress speed gains.
The threshold for minimum gain is lowered for btultra
from s>>6 (~1.5% minimum gain)
to s>>7 (~0.8% minimum gain).
This is a prudent change.
Not sure if it's large enough.
ensure that, when frequency[symbol]==0,
result is (tableLog + 1) bits
with both upper-bit and fractional-bit estimates.
Also : enable BIT_DEBUG in /tests
Work around bug in zstd decoder
Pull request #1144 exercised a new path in the zstd decoder that proved to
be buggy. Avoid the extremely rare bug by emitting an uncompressed block.
Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.
* The cost of `set_basic` is computed using the cross-entropy cost
function `ZSTD_crossEntropyCost()`, using the normalized default count
and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
`ZSTD_entropyCost()`, together with the cost of writing the normalized
count `ZSTD_NCountCost()`.
this patch makes btultra do 2 passes on the first block,
the first one being dedicated to collecting statistics
so that the 2nd pass is more accurate.
It translates into a very small compression ratio gain :
enwik7, level 20:
blocks 4K : 2.142 -> 2.153
blocks 16K : 2.447 -> 2.457
blocks 64K : 2.716 -> 2.726
On the other hand, the cpu cost is doubled.
The trade off looks bad.
Though, that's ultimately a price to pay to reach better compression ratio.
So it's only enabled when setting btultra.
this improves compression ratio by a *tiny* amount.
It also reduces speed by a small amount.
Consequently, bit-fractional evaluation is only turned on for btultra.
for FSE symbols.
While it seems to work, the gains are negligible compared to rough maxNbBits evaluation.
There are even a few losses sometimes, that still need to be explained.
Furthermode, there are still cases where btlazy2 does a better job than btopt,
which seems rather strange too.
for proper estimation of symbol's weights
when using dictionary compression.
Note : using only huffman costs is not good enough,
presumably because sequence symbol costs are incorrect.
reported by @let-def.
It's actually a bug in ZSTD_compressBegin_usingCDict()
which would pass a wrong pledgedSrcSize value (0 instead of ZSTD_CONTENTSIZE_UNKNOWN)
resulting in wrong window size, resulting in downsized seqStore,
resulting in segfault when writing into the seqStore later in the process.
Added a test in fuzzer to cover this use case (fails before the patch).
The new advanced API basically set `requestedParams = appliedParams` when
using a dictionary. This halted all parameter adjustment, which can hurt
compression ratio if, for example, the window log is small for the first
call, but the rest of the files are large.
This patch fixes the bug, and checks that the `requestedParams` don't change
in the new advanced API when using a dictionary, and generally in the fuzzer.
Zstdmt uses prefixes to load the overlap between segments. Loading extra
positions makes compression non-deterministic, depending on the previous
job the context was used for. Since loading extra position takes extra
time as well, only do it when creating a `ZSTD_CDict`.
Fixes#1077.
The `avgJobSize` must not be lower than 256 KB for single-pass mode.
In `zstd.h` we say the minimum value for `ZSTD_p_jobSize` is 1 MB,
so ensure that we always pick a size >= 1 MB.
Found by libFuzzer fuzzer tests with large input limits.
this makes it possible to specify extremely large negative compression levels,
achieving the side effect as "no compression".
It will also be possible to define larger targetlength for ultra compression mode.
There is no adverse side effect due to removing this limit.
Integrate ldm into zstdmt by running it in serial and in order in the first
step of each job, in the same place as the hash gets updated. The input
buffer is sized to fit the whole LDM window and 2 full buffers of slack.
Input buffers cannot be reused until the LDM step is done with them.
After the LDM step is finished, the jobs don't actually have access to the
full window, only the overlap.
Tested on a few different multi-GB files with and without sanitizers,
and with different numbers of threads.
* Computes the XXH hash in the worker threads.
* Workers get a sequence number and wait until ther number shows up. On
error, ensures that its sequence is finished, so future threads don't
get blocked.
* Sets up for ldm integration, which will go in the same spot.
Setting `loadedDictEnd` was accidently removed from `ZSTD_loadDictionaryContent()`,
which means that dictionary compression will only be able to reference the parts of
the dictionary within the window. The spec allows us to reference the entire
dictionary so long as even one byte is in the window.
`ZSTD_enforceMaxDist()` incorrectly always allowed offsets up to `loadedDictEnd`
beyond the window, even once the dictionary was out of range.
When overflow protection kicked in, the check `current > loadedDictEnd + maxDist`
is incorrect if `loadedDictEnd` isn't reset back to zero. `current` could be reset
below the value, which would incorrectly allow references beyond the window. This
bug is present in `master`, but is very hard to trigger, since it requires both
dictionaries and data which triggers overflow correction.
Summary:
Allocate a single input buffer large enough to house each job, as well as
enough space for the IO thread to write 2 extra buffers. One goes in the
`POOL` queue, and one to fill, and then block on a full `POOL` queue.
Since we can't overlap with the prefix, we allocate space for 3 extra
input buffers.
Test Plan:
* CI
* With and without ASAN/UBSAN run zstdmt with different number of threads
on two large binaries, and verify that their checksums match.
* Test on the tip of the zstdmt ldm integration.
Reviewers: cyan
Differential Revision: https://phabricator.intern.facebook.com/D7284007
Tasks: T25664120
Summary:
* Expose the reference external sequences API for zstdmt.
Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
Expose reference external sequence API
* Expose the reference external sequences API for zstdmt.
* Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
Test Plan:
* CI
* Test the zstdmt ldm integration stacked on top of this diff
Reviewers: cyan
Differential Revision: https://phabricator.intern.facebook.com/D7283968
Tasks: T25664120
* Expose the reference external sequences API for zstdmt.
Allows external sequences of any length, which get split when necessary.
* Reset the LDM window when the context is reset.
* Store the maximum number of LDM sequences.
* Sequence generation now returns the number of last literals.
* Fix sequence generation to not throw out the last literals when blocks of
more than 1 MB are encountered.
The overflow protection is broken when the window log is `> (3U << 29)`, so 31.
It doesn't work when `current` isn't around `1U << windowLog` ahead of `lowLimit`,
and the the assertion `current > newCurrent` fails. This happens when the same
context is used many times over, but with a large window log, like in zstdmt.
Fix it by triggering correction based on `nextSrc - base` instead of `lowLimit`.
The added test fails before the patch, and passes after.
access negative compression levels from command line
for both compression and benchmark modes.
also : ensure proper propagation of parameters
through ZSTD_compress_generic() interface.
added relevant cli tests.
negative compression level trade compression ratio for more compression speed.
They turn off huffman compression of literals,
and use row 0 as baseline with a stepSize = -cLevel.
added associated test in fuzzer
also added : new advanced parameter ZSTD_p_literalCompression
Update code documentation, and properly names a few "magic constants".
Also, HUF_compress_internal() gets a cleaner way
to determine size of tables inside workspace.
* `ZSTD_ldm_generateSequences()` generates the LDM sequences and
stores them in a table. It should work with any chunk size, but
is currently only called one block at a time.
* `ZSTD_ldm_blockCompress()` emits the pre-defined sequences, and
instead of encoding the literals directly, it passes them to a
secondary block compressor. The code to handle chunk sizes greater
than the block size is currently commented out, since it is unused.
The next PR will uncomment exercise this code.
* During optimal parsing, ensure LDM `minMatchLength` is at least
`targetLength`. Also don't emit repcode matches in the LDM block
compressor. Enabling the LDM with the optimal parser now actually improves
the compression ratio.
* The compression ratio is very similar to before. It is very slightly
different, because the repcode handling is slightly different. If I remove
immediate repcode checking in both branches the compressed size is exactly
the same.
* The speed looks to be the same or better than before.
Up Next (in a separate PR)
--------------------------
Allow sequence generation to happen prior to compression, and produce more
than a block worth of sequences. Expose some API for zstdmt to consume.
This will test out some currently untested code in
`ZSTD_ldm_blockCompress()`.
as it's faster, due to one memory scan instead of two
(confirmed by microbenchmark).
Note : as ZSTD_reduceIndex() is rarely invoked,
it does not translate into a visible gain.
Consider it an exercise in auto-vectorization and micro-benchmarking.
This makes it easier to explain that nbWorkers=0 --> single-threaded mode,
while nbWorkers=1 --> asynchronous mode (one mode thread on top of the "main" caller thread).
No need for an additional asynchronous mode flag.
nbWorkers>=2 works the same as nbThreads>=2 previously.
to avoid confusion with blocks.
also:
- jobs are cut into chunks of 512KB now, to reduce nb of mutex calls.
- fix function declaration ZSTD_getBlockSizeMax()
- fix outdated comment
Other job members are accessed directly.
This avoids a full job copy, which would access everything,
including a few members that are supposed to be used by worker only,
uselessly requiring additional locks to avoid race conditions.
writeLastEmptyBlock() must release srcBuffer
as mtctx assumes it's done by job worker.
minor : changed 2 job member names (src->srcBuffer, srcStart->prefixStart) for clarity
replaced by equivalent signal job->consumer == job->srcSize.
created additional functions
ZSTD_writeLastEmptyBlock()
and
ZSTDMT_writeLastEmptyBlock()
required when it's necessary to finish a frame with a last empty job, to create an "end of frame" marker.
It avoids creating a job with srcSize==0.
When ZSTD_e_end directive is provided,
the question is not only "are internal buffers completely flushed",
it is also "is current frame completed".
In some rare cases,
it was possible for internal buffers to be completely flushed,
triggering a @return == 0,
but frame was not completed as it needed a last null-size block to mark the end,
resulting in an unfinished frame.
no real consequence, but pollute tsan tests :
job->dstBuff is being modified inside worker,
while main thread might read it accidentally
because it copies whole job.
But since it doesn't used dstBuff, there is no real consequence.
Other potential solution : only copy useful data, instead of whole job
When the dictionary is <= 8 bytes, no data is loaded from the dictionary.
In this case the repcodes weren't set, because they were inserted after the
size check. Fix this problem in general by first setting the cdict state to
a clean state of an empty dictionary, then filling the state from there.
Produces 3 statistics for ongoing frame compression :
- ingested
- consumed (effectively compressed)
- produced
Ingested can be larger than consumed due to buffering effect.
For the time being, this patch mostly fixes the % ratio issue,
since it computes consumed / produced,
instead of ingested / produced.
That being said, update is not "smooth",
because on a slow enough setting,
fileio spends most of its time waiting for a worker to complete its job.
This could be improved thanks to more granular flushing
i.e. start flushing before ongoing job is fully completed.
ZSTD_create?Dict() is required to produce a ?Dict* return type
because `free()` does not accept a `const type*` argument.
If it wasn't for this restriction, I would have preferred to create a `const ?Dict*` object
to emphasize the fact that, once created, a dictionary never changes
(hence can be shared concurrently until the end of its lifetime).
There is no such limitation with initStatic?Dict() :
as stated in the doc, there is no corresponding free() function,
since `workspace` is provided, hence allocated, externally,
it can only be free() externally.
Which means, ZSTD_initStatic?Dict() can return a `const ZSTD_?Dict*` pointer.
Tested with `make all`, to catch initStatic's users,
which, incidentally, also updated zstd.h documentation.
would create too large buffers,
since default job size == window size * 4.
This would crash on 32-bit systems.
Also : jobSize being a 32-bit unsigned, it cannot be >= 4 GB,
so the formula was failing for large window sizes >= 1 GB.
Fixed now : max job Size is 2 GB, whatever the window size.
this happened on 32-bits build when requiring a too large input buffer,
typically on wlog=29, creating jobs of 2 GB size.
also : zstd32 now compiles with multithread support enabled by default
(can be disabled with HAVE_THREAD=0)
Shaves 492,076 B off of the `ZSTD_CDict`.
The size of a `ZSTD_CDict` created from a 112,640 B dictionary is:
| Level | Before (B) | After (B) |
|-------|------------|-----------|
| 1 | 648,448 | 156,412 |
| 3 | 1,140,008 | 647,932 |
This new parameter makes it possible to call
streaming ZSTDMT with a single thread set
which is non blocking.
It makes it possible for the main thread to do other tasks in parallel
while the worker thread does compression.
Typically, for zstd cli, it means it can do I/O stuff.
Applied within fileio.c, this patch provides non-negligible gains during compression.
Tested on my laptop, with enwik9 (1000000000 bytes) : time zstd -f enwik9
With traditional single-thread blocking mode :
real 0m9.557s
user 0m8.861s
sys 0m0.538s
With new single-worker non blocking mode :
real 0m7.938s
user 0m8.049s
sys 0m0.514s
=> 20% faster
it still fallbacks to single-thread blocking invocation
when input is small (<1job)
or when invoking ZSTDMT_compress(), which is blocking.
Also : fixed a bug in new block-granular compression routine.
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.