Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.
* The cost of `set_basic` is computed using the cross-entropy cost
function `ZSTD_crossEntropyCost()`, using the normalized default count
and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
`ZSTD_entropyCost()`, together with the cost of writing the normalized
count `ZSTD_NCountCost()`.
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.
The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.
I benchmarked on the following data sets using this command:
```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```
| Data set | k (approx) | Before | After | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub | ~1000 | 738138 | 746610 | +1.14% |
| hg-changelog | ~90 | 4295156 | 4285336 | -0.23% |
| hg-commands | ~500 | 1095580 | 1079814 | -1.44% |
| hg-manifest | ~400 | 16559892 | 16504346 | -0.34% |
There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.
If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.
| Run | Before | After | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 | -0.50% |
| MacBook | 738451 | 737132 | -0.18% |
Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
this patch makes btultra do 2 passes on the first block,
the first one being dedicated to collecting statistics
so that the 2nd pass is more accurate.
It translates into a very small compression ratio gain :
enwik7, level 20:
blocks 4K : 2.142 -> 2.153
blocks 16K : 2.447 -> 2.457
blocks 64K : 2.716 -> 2.726
On the other hand, the cpu cost is doubled.
The trade off looks bad.
Though, that's ultimately a price to pay to reach better compression ratio.
So it's only enabled when setting btultra.
this improves compression ratio by a *tiny* amount.
It also reduces speed by a small amount.
Consequently, bit-fractional evaluation is only turned on for btultra.
ZSTD_decompress() can decompress multiple frames sent as a single input.
But the input size must be the exact sum of all compressed frames, no more.
In the case of a mistake on srcSize, being larger than required,
ZSTD_decompress() will try to decompress a new frame after current one, and fail.
As a consequence, it will issue an error code, ERROR(prefix_unknown).
While the error is technically correct
(the decoder could not recognise the header of _next_ frame),
it's confusing, as users will believe that the first header of the first frame is wrong,
which is not the case (it's correct).
It makes it more difficult to understand that the error is in the source size, which is too large.
This patch changes the error code provided in such a scenario.
If (at least) a first frame was successfully decoded,
and then following bytes are garbage values,
the decoder assumes the provided input size is wrong (too large),
and issue the error code ERROR(srcSize_wrong).
for FSE symbols.
While it seems to work, the gains are negligible compared to rough maxNbBits evaluation.
There are even a few losses sometimes, that still need to be explained.
Furthermode, there are still cases where btlazy2 does a better job than btopt,
which seems rather strange too.
for proper estimation of symbol's weights
when using dictionary compression.
Note : using only huffman costs is not good enough,
presumably because sequence symbol costs are incorrect.
Make sure that $(INCLUDEDIR) exists before copying the headers there.
Otherwise, the contest of header files is copied over
$(DESTDIR)$(INCLUDEDIR), making it a regular file.
While at it, remove $(DESTDIR)$(INCLUDEDIR) from the list of directories
to create in the install-pc target. The install-pc target does not need
this directory.
reported by @let-def.
It's actually a bug in ZSTD_compressBegin_usingCDict()
which would pass a wrong pledgedSrcSize value (0 instead of ZSTD_CONTENTSIZE_UNKNOWN)
resulting in wrong window size, resulting in downsized seqStore,
resulting in segfault when writing into the seqStore later in the process.
Added a test in fuzzer to cover this use case (fails before the patch).
The new advanced API basically set `requestedParams = appliedParams` when
using a dictionary. This halted all parameter adjustment, which can hurt
compression ratio if, for example, the window log is small for the first
call, but the rest of the files are large.
This patch fixes the bug, and checks that the `requestedParams` don't change
in the new advanced API when using a dictionary, and generally in the fuzzer.
Zstdmt uses prefixes to load the overlap between segments. Loading extra
positions makes compression non-deterministic, depending on the previous
job the context was used for. Since loading extra position takes extra
time as well, only do it when creating a `ZSTD_CDict`.
Fixes#1077.
The `avgJobSize` must not be lower than 256 KB for single-pass mode.
In `zstd.h` we say the minimum value for `ZSTD_p_jobSize` is 1 MB,
so ensure that we always pick a size >= 1 MB.
Found by libFuzzer fuzzer tests with large input limits.
this makes it possible to specify extremely large negative compression levels,
achieving the side effect as "no compression".
It will also be possible to define larger targetlength for ultra compression mode.
There is no adverse side effect due to removing this limit.
Integrate ldm into zstdmt by running it in serial and in order in the first
step of each job, in the same place as the hash gets updated. The input
buffer is sized to fit the whole LDM window and 2 full buffers of slack.
Input buffers cannot be reused until the LDM step is done with them.
After the LDM step is finished, the jobs don't actually have access to the
full window, only the overlap.
Tested on a few different multi-GB files with and without sanitizers,
and with different numbers of threads.
* Computes the XXH hash in the worker threads.
* Workers get a sequence number and wait until ther number shows up. On
error, ensures that its sequence is finished, so future threads don't
get blocked.
* Sets up for ldm integration, which will go in the same spot.
Setting `loadedDictEnd` was accidently removed from `ZSTD_loadDictionaryContent()`,
which means that dictionary compression will only be able to reference the parts of
the dictionary within the window. The spec allows us to reference the entire
dictionary so long as even one byte is in the window.
`ZSTD_enforceMaxDist()` incorrectly always allowed offsets up to `loadedDictEnd`
beyond the window, even once the dictionary was out of range.
When overflow protection kicked in, the check `current > loadedDictEnd + maxDist`
is incorrect if `loadedDictEnd` isn't reset back to zero. `current` could be reset
below the value, which would incorrectly allow references beyond the window. This
bug is present in `master`, but is very hard to trigger, since it requires both
dictionaries and data which triggers overflow correction.