initially uses calloc() instead of memset().
Performance improvement is unlikely measurable,
since ZSTD_CCtx is now very small,
with all tables transferred into workSpace.
now ZSTD_customCMem is promoted as new default.
Advantages : ZSTD_customCMem = { NULL, NULL, NULL},
so it's natural default after a memset.
ZSTD_customCMem is public constant
(defaultCustomMem was private only).
Also : makes it possible to introduce ZSTD_calloc(),
which can now default to stdlib's calloc()
when it detects system default.
Fixed zlibwrapper which depended on defaultCustomMem.
Everything converge towards ZSTD_compressBegin_internal
which delegated to ZSTD_copyCCtx_internal if cdict!=NULL.
This simplifies routing which was previously depending on cdict.
The following warning appears during build.
../lib/compress/huf_compress.c: In function ‘HUF_compress1X_usingCTable’:
../lib/compress/huf_compress.c:444:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (sizeof((stream)->bitContainer)*8 < HUF_TABLELOG_MAX*4+7) HUF_FLUSHBITS(stream)
^
../lib/compress/huf_compress.c:465:18: note: in expansion of macro ‘HUF_FLUSHBITS_2’
HUF_FLUSHBITS_2(&bitC);
^~~~~~~~~~~~~~~
../lib/compress/huf_compress.c:466:9: note: here
case 2 : HUF_encodeSymbol(&bitC, ip[n+ 1], CTable);
../lib/compress/zstd_compress.c: In function ‘ZSTD_compressStream_generic’:
../lib/compress/zstd_compress.c:3366:34: warning: this statement may fall through [-Wimplicit-fallthrough=]
zcs->streamStage = zcss_flush; /* pass-through to flush stage */
~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
../lib/compress/zstd_compress.c:3369:9: note: here
case zcss_flush:
Signed-off-by: Jos Collin <jcollin@redhat.com>
* upstream/dev: (305 commits)
added test for ZSTD_estimateCStreamSize()
changed variable name, for clarity
fixed ZSTD_estimateCStreamSize()
shortened ZSTD_createCStream_Advanced()
fixed symbols test
added ZSTD_estimateDStreamSize()
changed name frameParams into frameHeader
regroup memory usage function declarations
separated ZSTD_estimateCStreamSize() from ZSTD_estimateCCtxSize()
bumped version number
added ZSTD_estimateCDictSize() and ZSTD_estimateDDictSize()
Updated ZSTD_freeCCtx()
updated ZSTD_estimateCCtxSize()
Updated ZSTD_sizeof_CCtx()
merged CCtx and CStream as a single same object
cli : -d and -t do not stop after a failed decompression
added dev branch CircleCI badge
added dev branch Appveyor badge
keep dev branch status only
creates a binary archive without the `programs` directory
...
added a parameter streaming,
to estimate memory allocation size
when the CCtx is used for streaming (CStream).
Note : this function is not able to estimate
memory cost of a potential internal CDict
which can only happen when starting with ZSTD_initCStream_usingDict()
Previous -18 : 4.7 MB/s, R:3.833
New -18 : 5.1 MB/s. R:3.825
It's a better fit within -17 (6.8 MB/s) and -19 (4.0 MB/s)
The new level 18 also uses significantly less memory.
And, it makes a good transition between level 17 (mml5)
and level 19 (mml3).
Up to now, there was no level with mml4.
(note : minmatch setting can have a large impact on some (specific) datasets)
It now only uses compressionParameters as argument.
It produces many changes throughout user code,
though hopefully they tend to be simple :
just provide the cParams part from existing ZSTD_parameters.
Some programs might depend on ZSTD_createCDict_advanced() to pass frame parameters.
This change will force them to revisit this strategy and fix it,
since frame parameters are effectively silently ignored in current version.
clearer separation between variables and buffers
clearer buffers category
kept static buffers at the beginning, favoring cache locality
(it will be easier to add FSE tables there later)
This break a few assumptions that hashTable was always at the beginning.
This is fixed.
And remaining assumptions (namely that tables stand next to each other in memory)
are now tested with assert.
because by definition srcSize is not known when using this prototype.
added relevant test
Note : this use was already working, because at a later stage
(both ZSTD_compressBegin_usingCDict() and ZSTD_copyCCtx())
pledgedSrcSize=0 is translated into "unknown", no matter the frame parameter.
This is not correct, but of little importance,
as the medium term plan is to no longer set fParams within CDict
This is now the regroup point for ZSTD_initCStream*() functions
ZSTD_initCStream_advanced() now properly checks for parameters validity.
Also : added <assert.h> usage inside zstd_compress.c
Needs ZSTD_DEBUG=1 macro to be triggered.
Will be triggered by default from `tests` directory
does no longer allocate temporary buffers
when there is enough room in dstBuffer to decompress directly there.
(previous method would skip that for 1st chunk only).
Also : fix ZSTD_compressBound() for small srcSize
required so that if Total = A+B
compressBound(Total) <= compressBound(A) + compressBound(B)
under condition of a minimum size for A and B
Will help for ZSTDMT_compress() memory allocation
forgot to add the dictionary content
(tests were not failing, just compressing less).
Also : added size protections when adding dict content
since hc/bt table filling would fail if size < 8
The compressor always reuses the existing Huffman table if the literals
size is at most 1 KiB. If the compression strategy is `ZSTD_lazy` or
stronger always check to see if reusing the previous table or creating
a new table is better.
This doesn't yet weigh in decompression speed. I don't want to add any
heuristics there until I have real data to work with to ensure that the
heuristic works for at least one use case, preferably more.
* Compressor saves most recently used Huffman table and reuses it
if it produces better results.
* I attempted to preserve CPU usage profile.
I intentionally left all of the existing heuristics in place.
There is only a speed difference on the second block and later.
When compressing large enough blocks (say >= 4 KiB) there is
no significant difference in compression speed.
Dictionary compression of one block is the same speed for blocks
with literals <= 1 KiB, and after that the difference is not
very significant.
* In the synthetic data, with blocks 10 KB or smaller, most blocks
can't use repeated tables because the previous block did not
contain a symbol that the current block contains.
Once blocks are about 12 KB or more, most previous blocks have
valid Huffman tables for the current block, and the compression
ratio and decompression speed jumped.
* In silesia blocks as small as 4KB can frequently reuse the
previous Huffman table (85%), but it isn't as profitable, and
the previous Huffman table only gets used about 3% of the time.
* Microbenchmarks show that `HUF_validateCTable()` takes ~55 ns
and `HUF_estimateCompressedSize()` takes ~35 ns.
They are decently well optimized, the first versions took 90 ns
and 120 ns respectively. `HUF_validateCTable()` could be twice as
fast, if we cast the `HUF_CElt*` to a `U32*` and compare to 0.
However, `U32` has an alignment of 4 instead of 2, so I think that
might be undefined behavior.
* I've ran `zstreamtest` compiled normally, with UASAN and with MSAN
for 4 hours each.
The worst case for the speed difference is a bunch of small blocks
in the same frame. I modified `bench.c` to compress the input in a
single frame but with blocks of the given block size, set by `-B`.
Benchmarks on level 1:
| Program | Block size | Corpus | Ratio | Compression MB/s | Decompression MB/s |
|-----------|------------|-----------|-------|------------------|--------------------|
| zstd.base | 256 | synthetic | 2.364 | 110.0 | 297.0 |
| zstd | 256 | synthetic | 2.367 | 108.9 | 297.0 |
| zstd.base | 256 | silesia | 2.204 | 93.8 | 415.7 |
| zstd | 256 | silesia | 2.204 | 93.4 | 415.7 |
| zstd.base | 512 | synthetic | 2.594 | 144.2 | 420.0 |
| zstd | 512 | synthetic | 2.599 | 141.5 | 425.7 |
| zstd.base | 512 | silesia | 2.358 | 118.4 | 432.6 |
| zstd | 512 | silesia | 2.358 | 119.8 | 432.6 |
| zstd.base | 1024 | synthetic | 2.790 | 192.3 | 594.1 |
| zstd | 1024 | synthetic | 2.794 | 192.3 | 600.0 |
| zstd.base | 1024 | silesia | 2.524 | 148.2 | 464.2 |
| zstd | 1024 | silesia | 2.525 | 148.2 | 467.6 |
| zstd.base | 4096 | synthetic | 3.023 | 300.0 | 1000.0 |
| zstd | 4096 | synthetic | 3.024 | 300.0 | 1010.1 |
| zstd.base | 4096 | silesia | 2.779 | 223.1 | 623.5 |
| zstd | 4096 | silesia | 2.779 | 223.1 | 636.0 |
| zstd.base | 16384 | synthetic | 3.131 | 350.0 | 1150.1 |
| zstd | 16384 | synthetic | 3.152 | 350.0 | 1630.3 |
| zstd.base | 16384 | silesia | 2.871 | 296.5 | 883.3 |
| zstd | 16384 | silesia | 2.872 | 294.4 | 898.3 |
XXH_STATIC_LINKING_ONLY protection macro is intended to be triggered just before the include.
The main idea is to keep this setting local :
user module shall explicitly understand and accept the static linking restriction
which becomes transparent when triggering the macro at project level.
Global definition also triggers redefinition warnings for user modules which do locally define the macro.
This new version compiles lib and cli without warning when the macro is set globally.
That's not a scenario to be recommended, since it trades a local effect for a global one,
but it was easy enough to provide from zstd side.
There used to be a (very small) chance that
loading prefix from previous segment
would be confused with a real zstd dictionary.
For that to happen, the prefix needs to start
with the same value as dictionary magic.
That's 1 chance in 4 billions if all values have equal probability.
But in fact, since some values are more common (0x00000000 for example)
others are less common, and dictionary magic was selected to be one of them,
so probabilities are likely even lower.
Anyway, this risk is no down to zero
by adding a new CCtx parameter : ZSTD_p_forceRawDict
Current parameter policy : the parameter "stick" to its CCtx,
so any dictionary loading after ZSTD_p_forceRawDict is set
will be loaded in "raw" ("content only") mode,
even if CCtx is re-used multiple times with multiple different dictionary.
It's up to the user to reset this value differently if it needs so.
Reproduction steps:
```
make zstreamtest CC=clang CFLAGS="-O3 -g -fsanitize=memory -fsanitize-memory-track-origins"
./zstreamtest -vv -t4178 -i4178 -s4531
```
How to get to the error in gdb (may be a more efficient way):
* 2 breaks at zstd_compress.c:2418 -- in ZSTD_compressContinue_internal()
* 2 breaks at zstd_compress.c:2276 -- in ZSTD_compressBlock_internal()
* 1 break at zstd_compress.c:1547
Why the error occurred:
When `zc->forceWindow == 1`, after calling `ZSTD_loadDictionaryContent()` we
have `zc->loadedDictEnd == zc->nextToUpdate == 0`. But, we've really loaded up
to `iend` into the dictionary. Then in `ZSTD_compressBlock_internal()` we see
that `current > zc->nextToUpdate + 384`, so we load the last 192 bytes a second
time. In this case the bytes we are loading are a block of all 0s, starting in
the previous block. So when we are loading the last 192 bytes, we find a `match`
in the future, 183 bytes beyond `ip`. Since the block is all 0s, the match
extends to the end of the block. But in `ZSTD_count()` we only check that
`pIn < pInLoopLimit`, but since `pMatch > pIn`, `pMatch` eventually points past
the end of the buffer, causing the MSAN failure.
The fix:
The line changed sets sets `zc->nextToUpdate` to the end of the dictionary.
This is the behavior that existed before `ZSTD_p_forceWindow` was introduced.
This fixes the exposing test case. Since the code doesn't fail without
`zc->forceWindow`, it makes sense that this works. I've run the command
`./zstreamtest -T2mn` 64 times without failures. CI should also verify nothing
obvious broke.
the minimum size condition size is applied transparently (no warning, no error)
like previous minimum section size condition (1 KB) which still applies.
Previous version was requiring a fairly large initial amount of input data
before starting to create compression jobs.
This new version starts the process much sooner.
fileio.c was continually pushing more content without giving a chance to flush compressed one.
It would block the job queue when input data was accumulated too fast (requiring to define many threads).
Fixed : fileio flushes whatever it can after each input attempt.
Sections 2+ read a bit of data from previous section
in order to improve compression ratio.
This also costs some CPU, to reference read data.
Read data is currently fixed to window>>3 size
By default, section sizes are 4x window size.
This new setting allow manual selection of section sizes.
The larger they are, the (slightly) better the compression ratio,
but also the higher the memory allocation cost,
and eventually the lesser the nb of possible threads,
since each section is compressed by a single thread.
It also introduces a prototype to set generic parameters,
ZSTDMT_setMTCtxParameter()
The idea is that it's possible to add enums
to extend the list of parameters that can be set this way.
This is more long-term oriented than a fixed-size struct.
Consider it as a test.
In some (rare) cases, job list could be blocked by a first job still being processed,
while all following ones are completed, waiting to be flushed.
In such case, the current job-table implementation is unable to accept new job.
As a consequence, a call to ZSTDMT_compressStream() can be useless (nothing read, nothing flushed),
with the risk to trigger a busy-wait on the caller side
(needlessly loop over ZSTDMT_compressStream() ).
In such a case, ZSTDMT_compressStream() will block until the first job is completed and ready to flush.
It ensures some forward progress by guaranteeing it will flush at least a part of the completed job.
Energy-wasting busy-wait is avoided.
Like ZSTD_initCStream_usingDict(),
ZSTDMT_initCStream_usingDict() now keep a copy of dict internally.
This way, dict can be released :
it does not longer have to outlive all future compression sessions.