* `ZSTD_decompressDCtx()` did not use the dictionary loaded by
`ZSTD_DCtx_loadDictionary()`.
* Add a unit test.
* A stacked diff uses `ZSTD_decompressDCtx()` in the
`dictionary_round_trip` and `dictionary_decompress` fuzzers.
Zstd compression sometimes does different stuff when it has at least
`ZSTD_compressBound()` output bytes, or not. Half of the time fuzz with
`ZSTD_compressBound() - 1` output bytes. Ensure that we have at least
one byte of overhead by disabling either the dictionary ID or checksum.
`ZSTD_compress2()` wouldn't wait for multithreaded compression to
finish. We didn't find this because ZSTDMT will block when it can
compress all in one go, but it can't do that if it doesn't have enough
output space, or if `ZSTD_c_rsyncable` is enabled.
Since we will already sometimes block when using `ZSTD_e_end`, I've
changed `ZSTD_e_end` and `ZSTD_e_flush` to guarantee maximum forward
progress. This simplifies the API, and helps users avoid the easy bug
that was made in `ZSTD_compress2()`
* Found by the libfuzzer fuzzers.
* Added a test case that catches the problem.
* I will make the fuzzers sometimes allocate less than
`ZSTD_compressBound()` output space.
It wasn't using the ZSTD_CCtx_params correctly. It must actualize
the compression parameters by calling ZSTD_getCParamsFromCCtxParams()
to get the real window log.
Tested by updating the streaming memory usage example in the next
commit. The CHECK() failed before this patch, and passes after.
I also added a unit test to zstreamtest.c that failed before this
patch, and passes after.
* After loading a dictionary only create the cdict once we've started the
compression job. This allows the user to pass the dictionary before they
set other settings, and is in line with the rest of the API.
* Add tests that mix the 3 dictionary loading APIs.
* Add extra tests for `ZSTD_CCtx_loadDictionary()`.
* The first 2 tests added fail before this patch.
* Run the regression test suite.
The order you set parameters in the advanced API is not supposed to matter.
However, once you call `ZSTD_CCtx_refCDict()` the compression parameters
cannot be changed. Remove that restriction, and document what parameters
are used when using a CDict.
If the CCtx is in dictionary mode, then the CDict's parameters are used.
If the CCtx is not in dictionary mode, then its requested parameters are
used.
Introduces a new utility function `ZSTD_findFrameCompressedSize_internal` which
is equivalent to `ZSTD_findFrameCompressSize`, but accepts an additional output
parameter `bound` that computes an upper-bound for the compressed data in the frame.
The new API function is named `ZSTD_decompressBound` to be consistent with
`zstd_compressBound` (the inverse operation). Clients will now be able to compute an upper-bound for
their compressed payloads instead of guessing a large size.
Implements https://github.com/facebook/zstd/issues/1536.
* Move all ZSTDMT parameter setting code to ZSTD_CCtxParams_*Parameter().
ZSTDMT now calls these functions, so we can keep all the logic in the
same place.
* Clean up `ZSTD_CCtx_setParameter()` to only add extra checks where needed.
* Clean up `ZSTDMT_initJobCCtxParams()` by copying all parameters by default,
and then zeroing the ones that need to be zeroed. We've missed adding several
parameters here, and it makes more sense to only have to update it if you
change something in ZSTDMT.
* Add `ZSTDMT_cParam_clampBounds()` to clamp a parameter into its valid
range. Use this to keep backwards compatibility when setting ZSTDMT parameters,
which clamp into the valid range.
Test a positive compression level with uncompressed literals,
and a negative compression level with compressed literals.
I double checked the `results.csv` and made sure that the compressed
sizes make sense.
Pull request #1499 added a new test, which uses 'head -c'. The '-c'
option is non-portable (not in POSIX). Instead use 'dd'. Similar issue
has been resolved in the past (#1321).
fseek() doesn't indicate when it moves past the end of a file.
Consequently, if a file is truncated within its last block, the error would't be detected.
This PR adds a test scenario that induces this situation using a small compressed file of only one block in size.
This test is added to tests/playTests.sh
Check is implemented by ensuring that the filehandle position is equal to the filesize upon exit.
On Windows, the equivalent of `/dev/null` is `NUL`.
When tests are run under msys2/minGW,
the environment identifies itself as Windows,
hence the script uses `NUL` instead of `/dev/null`
but the environment will consider `NUL` to be a regular file name.
Consequently, `NUL` will be overwritten during tests,
triggering an error.
This patch uses flag `-f` to force such overwrite
passing the test.
as suggested in #1441.
generally U32 and unsigned are the same thing,
except when they are not ...
case : 32-bit compilation for MIPS (uint32_t == unsigned long)
A vast majority of transformation consists in transforming U32 into unsigned.
In rare cases, it's the other way around (typically for internal code, such as seeds).
Among a few issues this patches solves :
- some parameters were declared with type `unsigned` in *.h,
but with type `U32` in their implementation *.c .
- some parameters have type unsigned*,
but the caller user a pointer to U32 instead.
These fixes are useful.
However, the bulk of changes is about %u formating,
which requires unsigned type,
but generally receives U32 values instead,
often just for brevity (U32 is shorter than unsigned).
These changes are generally minor, or even annoying.
As a consequence, the amount of code changed is larger than I would expect for such a patch.
Testing is also a pain :
it requires manually modifying `mem.h`,
in order to lie about `U32`
and force it to be an `unsigned long` typically.
On a 64-bit system, this will break the equivalence unsigned == U32.
Unfortunately, it will also break a few static_assert(), controlling structure sizes.
So it also requires modifying `debug.h` to make `static_assert()` a noop.
And then reverting these changes.
So it's inconvenient, and as a consequence,
this property is currently not checked during CI tests.
Therefore, these problems can emerge again in the future.
I wonder if it is worth ensuring proper distinction of U32 != unsigned in CI tests.
It's another restriction for coding, adding more frustration during merge tests,
since most platforms don't need this distinction (hence contributor will not see it),
and while this can matter in theory, the number of platforms impacted seems minimal.
Thoughts ?
The problem was already masked,
due to no longer accepting tiny blocks for statistics.
But in case it could still happen with not-so-tiny blocks,
there is a stricter control which ensures that
nothing was already loaded prior to statistics collection.
* Add configs that test multithreading, LDM, and setting explicit
parameters.
* Update the `compress cctx` method to accept `ZSTD_parameters`.
* Compile against the multithreaded `libzstd.a`.
* Update `results.csv` for the new configs.
Unless you think there are more configs/methods I should test, I think
we have a fairly wide set of configs/methods, so I'll pause adding
more for now.
Compare the input and output files by their inode number and
refuse to open the output file if the input file is the same.
This doesn't work when (de)compressing multiple files to a single
file, but that is a very uncommon use case, mostly used for
benchmarking by me.
Fixes#1422.
* Fix `ZSTD_estimateCCtxSize()` with negative levels.
* Fix `ZSTD_estimateCStreamSize()` with negative levels.
* Add a unit test to test for this error.
The `--no-progress` flag disables zstd's progress bars, but leaves
the summary.
I've added simple tests to `playTests.sh` to make sure the parsing
works.
When we switched `ZSTD_SKIPPABLEHEADERSIZE` to a macro, the places where we do:
MEM_readLE32(ptr) + ZSTD_SKIPPABLEHEADERSIZE
can now overflow `(unsigned)-8` to `0` and we infinite loop. We now check
the frame size and reject sizes that overflow a U32.
Note that this bug never made it into a release, and was only in the dev branch
for a few days.
Credit to OSS-Fuzz
from overlapSizeLog.
Reasoning :
`overlapLog` is already used everwhere, in the code, command line and documentation.
`ZSTD_c_overlapSizeLog` feels unnecessarily different.
Dictionaries are prebuilt and saved as part of the data object.
The config decides whether or not to use the dictionary if it is
available. Configs that require dictionaries are only run with
data that have dictionaries. The method will skip configs that are
irrelevant, so for example ZSTD_compress() will skip configs with
dictionaries.
I've also trimmed the silesia source to 1MB per file (12 MB total),
and added 500 samples from the github data set with a dictionary.
I've intentionally added an extra line to the `results.csv` to make
the nightly build fail, so that we can see how CircleCI reports it.
Full list of changes:
* Add pre-built dictionaries to the data.
* Add `use_dictionary` and `no_pledged_src_size` flags to the config.
* Add a config using a dictionary for every level.
* Add a config that specifies no pledged source size.
* Support dictionaries and streaming in the `zstdcli` method.
* Add a context-reuse method using `ZSTD_compressCCtx()`.
* Clean up the formatting of the `results.csv` file to align columns.
* Add `--data`, `--config`, and `--method` flags to constrain each
to a particular value. This is useful for debugging a failure
or debugging a particular config/method/data.
ZSTD_compress_generic() is renamed ZSTD_compressStream2().
Note that, for the time being,
the "stable" API and advanced one use different parameter planes :
setting parameters using the advanced API does not influence ZSTD_compressStream()
and using ZSTD_initCStream() does not influence parameters for ZSTD_compressStream2().
The regression tests run nightly or on the `regression`
branch for convenience. The results get uploaded as the
artifacts of the job. If they change, check the diff
printed in the job. If all is well, download the new
results and commit them to the repo.
This code will only run on a UNIX like platform. It
could be made to run on Windows, but I don't think that
it is necessary. It also uses C99.
* data: This module defines the data to run tests on.
It downloads data from a URL into a cache directory,
checks it against a checksum, and unpacks it. It also
provides helpers for accessing the data.
* config: This module defines the configs to run tests
with. A config is a set of API parameters and a set of
CLI flags.
* result: This module is a helper for method that defines
the result type.
* method: This module defines the compression methods
to test. It is what runs the regression test using the
data and the config. It reports the total compressed
size, or an error/skip.
* test: This is the test binary that runs the tests for
every (data, config, method) tuple, and prints the
results to the output file and stderr.
* results.csv: The results that the current commit is
expected to produce.
answering #1407.
Also : removed obsolete function ZSTD_setDStreamParameter()
which could only be used with one parameter (DStream_p_maxWindowSize).
Now replaced by ZSTD_DCtx_setWindowSize() (which exists since a few revisions)
changed workspace parameter convention
to always provide workspaceSize,
so that size can be explicitly checked.
Also, use more enum to make the meaning of some parameters more explicit.
fix#1385
decompressing into NULL was an automatic error.
It is now allowed, as long as the content of the frame is empty.
Seems to simplify things for `arrow`.
Maybe some other projects rely on this behavior ?
fix#1379
decodecorpus was generating one extraneous byte when `nbSeq==0`.
This is disallowed by the specification.
The reference decoder was just skipping the extraneous byte.
It is now stricter, and flag such situation as an error.
and slightly refactored affected function.
Honestly, the formula calculating variance should get a second reviewing round,
it's not clear if it's correct.