forgot to add the dictionary content
(tests were not failing, just compressing less).
Also : added size protections when adding dict content
since hc/bt table filling would fail if size < 8
The compressor always reuses the existing Huffman table if the literals
size is at most 1 KiB. If the compression strategy is `ZSTD_lazy` or
stronger always check to see if reusing the previous table or creating
a new table is better.
This doesn't yet weigh in decompression speed. I don't want to add any
heuristics there until I have real data to work with to ensure that the
heuristic works for at least one use case, preferably more.
* Compressor saves most recently used Huffman table and reuses it
if it produces better results.
* I attempted to preserve CPU usage profile.
I intentionally left all of the existing heuristics in place.
There is only a speed difference on the second block and later.
When compressing large enough blocks (say >= 4 KiB) there is
no significant difference in compression speed.
Dictionary compression of one block is the same speed for blocks
with literals <= 1 KiB, and after that the difference is not
very significant.
* In the synthetic data, with blocks 10 KB or smaller, most blocks
can't use repeated tables because the previous block did not
contain a symbol that the current block contains.
Once blocks are about 12 KB or more, most previous blocks have
valid Huffman tables for the current block, and the compression
ratio and decompression speed jumped.
* In silesia blocks as small as 4KB can frequently reuse the
previous Huffman table (85%), but it isn't as profitable, and
the previous Huffman table only gets used about 3% of the time.
* Microbenchmarks show that `HUF_validateCTable()` takes ~55 ns
and `HUF_estimateCompressedSize()` takes ~35 ns.
They are decently well optimized, the first versions took 90 ns
and 120 ns respectively. `HUF_validateCTable()` could be twice as
fast, if we cast the `HUF_CElt*` to a `U32*` and compare to 0.
However, `U32` has an alignment of 4 instead of 2, so I think that
might be undefined behavior.
* I've ran `zstreamtest` compiled normally, with UASAN and with MSAN
for 4 hours each.
The worst case for the speed difference is a bunch of small blocks
in the same frame. I modified `bench.c` to compress the input in a
single frame but with blocks of the given block size, set by `-B`.
Benchmarks on level 1:
| Program | Block size | Corpus | Ratio | Compression MB/s | Decompression MB/s |
|-----------|------------|-----------|-------|------------------|--------------------|
| zstd.base | 256 | synthetic | 2.364 | 110.0 | 297.0 |
| zstd | 256 | synthetic | 2.367 | 108.9 | 297.0 |
| zstd.base | 256 | silesia | 2.204 | 93.8 | 415.7 |
| zstd | 256 | silesia | 2.204 | 93.4 | 415.7 |
| zstd.base | 512 | synthetic | 2.594 | 144.2 | 420.0 |
| zstd | 512 | synthetic | 2.599 | 141.5 | 425.7 |
| zstd.base | 512 | silesia | 2.358 | 118.4 | 432.6 |
| zstd | 512 | silesia | 2.358 | 119.8 | 432.6 |
| zstd.base | 1024 | synthetic | 2.790 | 192.3 | 594.1 |
| zstd | 1024 | synthetic | 2.794 | 192.3 | 600.0 |
| zstd.base | 1024 | silesia | 2.524 | 148.2 | 464.2 |
| zstd | 1024 | silesia | 2.525 | 148.2 | 467.6 |
| zstd.base | 4096 | synthetic | 3.023 | 300.0 | 1000.0 |
| zstd | 4096 | synthetic | 3.024 | 300.0 | 1010.1 |
| zstd.base | 4096 | silesia | 2.779 | 223.1 | 623.5 |
| zstd | 4096 | silesia | 2.779 | 223.1 | 636.0 |
| zstd.base | 16384 | synthetic | 3.131 | 350.0 | 1150.1 |
| zstd | 16384 | synthetic | 3.152 | 350.0 | 1630.3 |
| zstd.base | 16384 | silesia | 2.871 | 296.5 | 883.3 |
| zstd | 16384 | silesia | 2.872 | 294.4 | 898.3 |
XXH_STATIC_LINKING_ONLY protection macro is intended to be triggered just before the include.
The main idea is to keep this setting local :
user module shall explicitly understand and accept the static linking restriction
which becomes transparent when triggering the macro at project level.
Global definition also triggers redefinition warnings for user modules which do locally define the macro.
This new version compiles lib and cli without warning when the macro is set globally.
That's not a scenario to be recommended, since it trades a local effect for a global one,
but it was easy enough to provide from zstd side.
There used to be a (very small) chance that
loading prefix from previous segment
would be confused with a real zstd dictionary.
For that to happen, the prefix needs to start
with the same value as dictionary magic.
That's 1 chance in 4 billions if all values have equal probability.
But in fact, since some values are more common (0x00000000 for example)
others are less common, and dictionary magic was selected to be one of them,
so probabilities are likely even lower.
Anyway, this risk is no down to zero
by adding a new CCtx parameter : ZSTD_p_forceRawDict
Current parameter policy : the parameter "stick" to its CCtx,
so any dictionary loading after ZSTD_p_forceRawDict is set
will be loaded in "raw" ("content only") mode,
even if CCtx is re-used multiple times with multiple different dictionary.
It's up to the user to reset this value differently if it needs so.
Reproduction steps:
```
make zstreamtest CC=clang CFLAGS="-O3 -g -fsanitize=memory -fsanitize-memory-track-origins"
./zstreamtest -vv -t4178 -i4178 -s4531
```
How to get to the error in gdb (may be a more efficient way):
* 2 breaks at zstd_compress.c:2418 -- in ZSTD_compressContinue_internal()
* 2 breaks at zstd_compress.c:2276 -- in ZSTD_compressBlock_internal()
* 1 break at zstd_compress.c:1547
Why the error occurred:
When `zc->forceWindow == 1`, after calling `ZSTD_loadDictionaryContent()` we
have `zc->loadedDictEnd == zc->nextToUpdate == 0`. But, we've really loaded up
to `iend` into the dictionary. Then in `ZSTD_compressBlock_internal()` we see
that `current > zc->nextToUpdate + 384`, so we load the last 192 bytes a second
time. In this case the bytes we are loading are a block of all 0s, starting in
the previous block. So when we are loading the last 192 bytes, we find a `match`
in the future, 183 bytes beyond `ip`. Since the block is all 0s, the match
extends to the end of the block. But in `ZSTD_count()` we only check that
`pIn < pInLoopLimit`, but since `pMatch > pIn`, `pMatch` eventually points past
the end of the buffer, causing the MSAN failure.
The fix:
The line changed sets sets `zc->nextToUpdate` to the end of the dictionary.
This is the behavior that existed before `ZSTD_p_forceWindow` was introduced.
This fixes the exposing test case. Since the code doesn't fail without
`zc->forceWindow`, it makes sense that this works. I've run the command
`./zstreamtest -T2mn` 64 times without failures. CI should also verify nothing
obvious broke.
the minimum size condition size is applied transparently (no warning, no error)
like previous minimum section size condition (1 KB) which still applies.
Previous version was requiring a fairly large initial amount of input data
before starting to create compression jobs.
This new version starts the process much sooner.
fileio.c was continually pushing more content without giving a chance to flush compressed one.
It would block the job queue when input data was accumulated too fast (requiring to define many threads).
Fixed : fileio flushes whatever it can after each input attempt.
Sections 2+ read a bit of data from previous section
in order to improve compression ratio.
This also costs some CPU, to reference read data.
Read data is currently fixed to window>>3 size
By default, section sizes are 4x window size.
This new setting allow manual selection of section sizes.
The larger they are, the (slightly) better the compression ratio,
but also the higher the memory allocation cost,
and eventually the lesser the nb of possible threads,
since each section is compressed by a single thread.
It also introduces a prototype to set generic parameters,
ZSTDMT_setMTCtxParameter()
The idea is that it's possible to add enums
to extend the list of parameters that can be set this way.
This is more long-term oriented than a fixed-size struct.
Consider it as a test.
In some (rare) cases, job list could be blocked by a first job still being processed,
while all following ones are completed, waiting to be flushed.
In such case, the current job-table implementation is unable to accept new job.
As a consequence, a call to ZSTDMT_compressStream() can be useless (nothing read, nothing flushed),
with the risk to trigger a busy-wait on the caller side
(needlessly loop over ZSTDMT_compressStream() ).
In such a case, ZSTDMT_compressStream() will block until the first job is completed and ready to flush.
It ensures some forward progress by guaranteeing it will flush at least a part of the completed job.
Energy-wasting busy-wait is avoided.
Like ZSTD_initCStream_usingDict(),
ZSTDMT_initCStream_usingDict() now keep a copy of dict internally.
This way, dict can be released :
it does not longer have to outlive all future compression sessions.
Correctly compress with custom params and dictionary
Added relevant fuzzer test in zstreamtest
Also :
new macro ZSTDMT_SECTION_LOGSIZE_MIN, which sets a minimum size for a full job
(note : a flush() command can still generate a partial job anytime)
Also : fixed corner case, where nb of jobs completed becomes > jobQueueSize
which is possible when many flushes are issued
while there is not enough dst buffer to flush completed ones.
MT compression generates a single frame.
Multi-threading operates by breaking the frames into independent sections.
But from a decoder perspective, there is no difference :
it's just a suite of blocks.
Problem is, decoder preserves repCodes from previous block to start decoding next block.
This is also valid between sections, since they are no different than changing block.
Previous version would incorrectly initialize repcodes to their default value at the beginning of each section.
When using them, there was a mismatch between encoder (default values) and decoder (values from previous block).
This change ensures that repcodes won't be used at the beginning of a new section.
It works by setting them to 0.
This only works with regular (single segment) variants : extDict variants will fail !
Fortunately, sections beyond the 1st one belong to this category.
To be checked : btopt strategy.
This change was only validated from fast to btlazy2 strategies.
In some complex scenarios (free() without finishing compression),
it is possible that some resources are still into jobs
and not collected back into pools.
In which case, previous version of free() would miss them.
This would be equivalent to a leak.
New version ensures that it even foes after such resource.
It requires job consumers to properly mark resources as released,
by replacing entries by NULL after releasing back to the pool.
Obviously, it's not recommended to free() zstdmt context mid-term,
still that's now a supported scenario.
The same methodology is also used to ensure proper resource collection
after an error is detected.
Still to do :
- detect compression errors (not just allocation ones)
- properly manage resource when init() is called without finishing previous compression.
The main issue was to avoid a caller to continually loop on {flush,end}Stream()
when there was nothing ready to be flushed but still some compression work ongoing in a worker thread.
The continuous loop would have resulted in wasted energy.
The new version makes call to {flush,end}Stream blocking when there is nothing ready to be flushed.
Of course, if all worker threads have exhausted job, it will return zero (all flush completed).
Note : There are still some remaining issues to report error codes
and properly collect back resources into pools when an error is triggered.
When porting python-zstandard to use ZSTD_initCStream_usingCDict()
so compression dictionaries could be reused, an automated test
failed due to compressed content changing.
I tracked this down to ZSTD_initCStream_usingCDict() not
setting the dictID field of the ZSTD_CCtx attached to the
ZSTD_CStream instance.
I'm not 100% convinced this is the correct or full solution,
as I'm still seeing one automated test failing with this change.
In previous version, main function would return early when detecting a job error.
Late threads resources were therefore not collected back into pools.
New version just register the error, but continue the collecting process.
All buffers and context should be released back to pool before leaving main function.
Result from getBuffer and getCCtx could be NULL when allocation fails.
Now correctly checks : job creation stop and last job reports an allocation error.
releaseBuffer and releaseCCtx are now also compatible with NULL input.
Identified a new potential issue :
when early job fails, later jobs are not collected for resource retrieval.
Since the result of mt compression is a single frame,
changed naming, which implied the concatenation of multiple frames.
minor : ensures that content size is written in header
The new strategy involves cutting frame at block level.
The result is a single frame, preserving ZSTD_getDecompressedSize()
As a consequence, bench can now make a full round-trip,
since the result is compatible with ZSTD_decompress().
This strategy will not make it possible to decode the frame with multiple threads
since the exact cut between independent blocks is not known.
MT decoding needs further discussions.
use ZSTD_freeCCtxPool() to release the partially created pool.
avoids to duplicate logic.
Also : identified a new difficult corner case :
when freeing the Pool, all CCtx should be previously released back to the pool.
Otherwise, it means some CCtx are still in use.
There is currently no clear policy on what to do in such a case.
Note : it's supposed to never happen.
Since pool creation/usage is static, it has no external user,
which limits risks.
When the overflow protection kicks in, it makes sure that ip - ctx->base
isn't too large. However, it didn't ensure that saved offsets are
still valid. This change ensures that any valid offsets (<= windowLog)
are still representable after the update.
The bug would shop up on line 1056, when `offset_1 > current + 1`, which
causes an underflow. This in turn, would cause a segfault on line 1063.
The input must necessarily be longer than 1 GB for this issue to occur.
Even then, it only occurs if one of the last 3 matches is larger than
the chain size and block size.
* upstream/dev:
added doc\zstd_manual.html
added contrib\gen_html
zstd_compression_format.md moved to doc/
Fix small bug in ZSTD_execSequence()
improved ZSTD_compressBlock_opt_extDict_generic
protect ZSTD_decodeFrameHeader() from invalid usage, as suggested by @spaskob
zstd_opt.h: small improvement in compression ratio
improved dicitonary segment merge
use implicit rules to compile zstd_decompress.c
detect early impossible decompression scenario in legacy decoder v0.5
no repeat mode in legacy v0.5
fixed invalid invocation of dictionary in legacy decoder v0.5
fix edge case
fix command line interpretation
fixed minor corner case
zstd.h: added the Introduction section
fixed clang 3.5 warnings
zstd.h: updated comments
If a dictionary specifies that a symbol has probability zero in its
`matchLength`, `literalLength`, or `offset` FSE table, but the symbol
appears when compressing input, the compressor fails.
Ensure that dictionaries support all `matchLength`, and `literalLength`
codes. They must also support all of the `offset` codes required to
represent every possible offset that can appear in the first block.
If `w ==0` on line 153, then `CTable[n].nbBits == tableLog + 1`.
Then `nbPerRank[CTable[n].nbBits]` and `valPerRank[CTable[n].nbBits]`
are stack buffer overflows.