Dictionaries are prebuilt and saved as part of the data object.
The config decides whether or not to use the dictionary if it is
available. Configs that require dictionaries are only run with
data that have dictionaries. The method will skip configs that are
irrelevant, so for example ZSTD_compress() will skip configs with
dictionaries.
I've also trimmed the silesia source to 1MB per file (12 MB total),
and added 500 samples from the github data set with a dictionary.
I've intentionally added an extra line to the `results.csv` to make
the nightly build fail, so that we can see how CircleCI reports it.
Full list of changes:
* Add pre-built dictionaries to the data.
* Add `use_dictionary` and `no_pledged_src_size` flags to the config.
* Add a config using a dictionary for every level.
* Add a config that specifies no pledged source size.
* Support dictionaries and streaming in the `zstdcli` method.
* Add a context-reuse method using `ZSTD_compressCCtx()`.
* Clean up the formatting of the `results.csv` file to align columns.
* Add `--data`, `--config`, and `--method` flags to constrain each
to a particular value. This is useful for debugging a failure
or debugging a particular config/method/data.
ZSTD_compress_generic() is renamed ZSTD_compressStream2().
Note that, for the time being,
the "stable" API and advanced one use different parameter planes :
setting parameters using the advanced API does not influence ZSTD_compressStream()
and using ZSTD_initCStream() does not influence parameters for ZSTD_compressStream2().
The regression tests run nightly or on the `regression`
branch for convenience. The results get uploaded as the
artifacts of the job. If they change, check the diff
printed in the job. If all is well, download the new
results and commit them to the repo.
This code will only run on a UNIX like platform. It
could be made to run on Windows, but I don't think that
it is necessary. It also uses C99.
* data: This module defines the data to run tests on.
It downloads data from a URL into a cache directory,
checks it against a checksum, and unpacks it. It also
provides helpers for accessing the data.
* config: This module defines the configs to run tests
with. A config is a set of API parameters and a set of
CLI flags.
* result: This module is a helper for method that defines
the result type.
* method: This module defines the compression methods
to test. It is what runs the regression test using the
data and the config. It reports the total compressed
size, or an error/skip.
* test: This is the test binary that runs the tests for
every (data, config, method) tuple, and prints the
results to the output file and stderr.
* results.csv: The results that the current commit is
expected to produce.
answering #1407.
Also : removed obsolete function ZSTD_setDStreamParameter()
which could only be used with one parameter (DStream_p_maxWindowSize).
Now replaced by ZSTD_DCtx_setWindowSize() (which exists since a few revisions)
changed workspace parameter convention
to always provide workspaceSize,
so that size can be explicitly checked.
Also, use more enum to make the meaning of some parameters more explicit.
fix#1385
decompressing into NULL was an automatic error.
It is now allowed, as long as the content of the frame is empty.
Seems to simplify things for `arrow`.
Maybe some other projects rely on this behavior ?
fix#1379
decodecorpus was generating one extraneous byte when `nbSeq==0`.
This is disallowed by the specification.
The reference decoder was just skipping the extraneous byte.
It is now stricter, and flag such situation as an error.
and slightly refactored affected function.
Honestly, the formula calculating variance should get a second reviewing round,
it's not clear if it's correct.
We could allocate up to 2^28 bytes of memory when using 2 threads with
window log = 24. Now, we limit it to 2^26 bytes of memory when not running
big tests.
I chose max window log = 22 since that is the maximum source size when
big tests are disabled. Hopefully this will be enough to reduce or
eliminate the test failures.
experimental function ZSTD_compressBlock() is designed for very small data in mind,
for situation where saving the ~12 bytes of frame header can actually make a difference.
Some systems though may have to deal with small and large data entangled.
If it's larger than a block (> 128KB), compressBlock() cannot compress them in one round.
That's why it's possible to compress in multiple rounds.
This is a chain of compressed blocks.
Some users push this capability to the limit, encoding gigantic chain of blocks.
On crossing the 4GB limit, some internal overflow occurs.
This fix moves the overflow correction mechanism higher in the call chain,
so that it's applied also to gigantic chains of blocks.
Added a test case in fuzzer.c, which crashes before the fix, and pass now.
* Updates CircleCI to use workflows.
We can now specify any number of test jobs to run in parallel.
* Switch the image to `buildpack-deps:trusty` which is only 500 MB
instead of 7 GB, so that saves 7 minutes to download it if it isn't
already cached on the host.
* Publish the source tarball and sha256sum as artifacts.
* If the `GITHUB_TOKEN` environment variable is set, we will also
add the tarball + sha256sum to the tagged release, after manual
approval.
Sometimes, it's necessary to test that a certain command fail, as expected.
Such failure is actually a success, and must not stop the flow of tests.
Several tests were prefixed with `!` to invert return code.
This does not work : it effectively makes the tests pass no matter what.
Use instead function die(), which is meant to trap successes, and transform them into errors.
which can be probed using new function ZSTD_minCLevel().
Also : redefined ZSTD_TARGETLENGTH_MIN/MAX for consistency
used the opportunity to bump version number to v1.3.6
tests/playTests.sh uses 'head -c' in a couple of tests to truncate the
last byte of a file. The '-c' option is non-portable (not in POSIX).
Instead use a wrapper around dd (truncateLastByte).
they were pretty easy to trigger by the way,
just start an extended paramgrill session
to find a compression table based on any sample,
it would necessarily happen at some point.
assert() in paramgrill are not in the benchmark path.
They should remain active, as they don't impact measurements, and their runtime is insignificant.
* Minor fix
* Run non-optimize FASTCOVER 5 times in benchmark
* Merge fastCover into dictBuilder
* Fix mixed declaration issue
* Add fastcover to symbol.c
* Add fastCover.c and cover.h to build
* Change fastCover.c to fastcover.c
* Update benchmark to run FASTCOVER in dictBuilder
* Undo spliting fastcover_param into cover_param and f
* Remove convert param functions
* Assign f to parameter
* Add zdict.h to Makefile in lib
* Add cover.h to BUCK
* Cast 1 to U64 before shifting
* Remove trimming of zero freq head and tail in selectSegment and rebenchmark
* Remove f as a separate parameter of tryParam
* Read 8 bytes when d is 6
* Add trimming off zero frequency head and tail
* Use best functions from COVER and remove trimming part(which leads to worse compression ratio after previous bugs were fixed)
* Add finalize= argument to FASTCOVER to specify percentage of training samples passed to ZDICT_finalizeDictionary
* Change nbDmer to always read 8 bytes even when d=6
* Add skip=# argument to allow skipping dmers in computeFrequency in FASTCOVER
* Update comments and benchmarking result
* Change default method of ZDICT_trainFromBuffer to ZDICT_optimizeTrainFromBuffer_fastCover
* Add dictType enum and fix bug about passing zParam when converting to coverParam
* Combine finalize and skip into a single parameter
* Update acceleration parameters and benchmark on 3 sample sets
* Change default splitPoint of FASTCOVER to 0.75 and benchmark first 3 sample sets
* Initialize variables outside of for loop in benchmark.c
* Update benchmark result for hg-manifest
* Remove cover.h from install-includes
* Add explanation of f
* Set default compression level for trainFromBuffer to 3
* Add assertion of fastCoverParams in DiB_trainFromFiles
* Add checkTotalCompressedSize function + some minor fixes
* Add test for multithreading fastCovr
* Initialize segmentFreqs in every FASTCOVER_selectSegment and move mutex_unnlock to end of COVER_best_finish
* Free segmentFreqs
* Initialize segmentFreqs before calling FASTCOVER_buildDictionary instead of in FASTCOVER_selectSegment
* Add FASTCOVER_MEMMULT
* Minor fix
* Update benchmarking result
Per warnings from flawfinder: "Does not check for buffer overflows when
copying to destination [MS-banned] (CWE-120). Consider using snprintf,
strcpy_s, or strlcpy (warning: strncpy easily misused).".
Replaced called to strcpy and strcat in `fileio.c` to calls with a
specified size (`strncpy` and `strncat`).
Tested the changes on OSX, Linux, Windows.
On OSX + Linux, changes were tested with ASAN. The following flags were
used: 'check_initialization_order=1:strict_init_order=1:detect_odr_violation=1:detect_stack_use_after_return=1'
To reproduce warning:
./flawfinder.py ./programs/fileio.c