Jennifer Liu
015a00af0f
Change cover_sum back to 2 parameters and fix splitPoint issues
2018-07-06 14:24:18 -07:00
Jennifer Liu
0bbff01211
Fix testing parameter
2018-07-05 22:40:32 -07:00
Jennifer Liu
a085d1aae1
Allow splitPoint==1.0 (using all samples for both training and testing)
2018-07-05 10:38:45 -07:00
Jennifer Liu
0881184c89
Some edits based on pull request comments
2018-07-03 17:53:27 -07:00
Jennifer Liu
16e75e8804
Update minimal training sample size
2018-07-03 12:07:06 -07:00
Jennifer Liu
348e5f77a9
Add split=# to cli
2018-06-29 17:54:41 -07:00
Jennifer Liu
52fbbbcb6b
Explicitly cast double to unsigned
2018-06-29 16:17:20 -07:00
Jennifer Liu
f9d19b83fb
Fix variable declaration problem
2018-06-29 15:46:56 -07:00
Jennifer Liu
e061d84016
Another fix to comparator
2018-06-29 15:38:08 -07:00
Jennifer Liu
59797d3328
Fix splitPoint floating point comparison problem
2018-06-29 12:47:03 -07:00
Jennifer Liu
0ef06f2e8a
Split samples into train and test sets
2018-06-29 12:33:34 -07:00
Yann Collet
fa41bcc2c2
grouped debug functions into debug.h
...
There were 2 competing set of debug functions
within zstd_internal.h and bitstream.h.
They were mostly duplicate, and required care to avoid messing with each other.
There is now a single implementation, shared by both.
Significant change :
The macro variable ZSTD_DEBUG does no longer exist,
it has been replaced by DEBUGLEVEL,
which required modifying several source files.
2018-06-13 15:43:09 -04:00
Nick Terrell
7cbb8bbbbf
[cover] Small compression ratio improvement
...
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.
The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.
I benchmarked on the following data sets using this command:
```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```
| Data set | k (approx) | Before | After | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub | ~1000 | 738138 | 746610 | +1.14% |
| hg-changelog | ~90 | 4295156 | 4285336 | -0.23% |
| hg-commands | ~500 | 1095580 | 1079814 | -1.44% |
| hg-manifest | ~400 | 16559892 | 16504346 | -0.34% |
There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.
If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.
| Run | Before | After | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 | -0.50% |
| MacBook | 738451 | 737132 | -0.18% |
Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
Yann Collet
1da629f2ad
Merge pull request #1104 from terrelln/fast-train
...
Allow negative compression levels in training
2018-04-09 14:16:20 -07:00
Nick Terrell
569e2abccd
Allow negative compression levels in training
...
* Set `dictCLevel` in `zstdcli.c`.
* Only set to default level if the compression level `== 0`, not `<= 0`.
2018-04-09 12:12:03 -07:00
Björn Ketelaars
462aed6811
zstd requires a stable sort.
...
On OpenBSD qsort() is not guaranteed to be stable, their mergesort() is.
This fixes issue #1088 . All the hard work has been done by @terrelln.
2018-04-05 07:59:16 +02:00
Yann Collet
9f8ed23b5b
bumped version number to v1.3.4
...
also added a paragraph on using compression level with training mode
as this is a recurrent question (see for example #1004 )
2018-01-27 22:23:26 -08:00
Yann Collet
752bae4a48
added warning message
...
when pathological dataset is detected
(note : cover_optimize needs -v to display the warning)
2018-01-11 11:29:28 -08:00
Yann Collet
e8093dde09
fixed #304
...
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.
2018-01-11 11:16:32 -08:00
Yann Collet
218e9fe0fc
added a test case for dictBuilder failure
...
cyclic data set makes the entropy stage fails
now, onto a fix for #304 ...
2018-01-11 09:42:38 -08:00
Yann Collet
c173dbd6e7
no longer supported starting C++17
2017-12-04 18:00:53 -08:00
Nick Terrell
6c41adfb28
[libzstd] pthread function prefixed with ZSTD_
...
* `sed -i 's/pthread_/ZSTD_pthread_/g' lib/{,common,compress,decompress,dictBuilder}/*.[hc]`
* Fix up `lib/common/threading.[hc]`
* `sed -i s/PTHREAD_MUTEX_LOCK/ZSTD_PTHREAD_MUTEX_LOCK/g lib/compress/zstdmt_compress.c`
2017-09-27 11:48:48 -07:00
Yann Collet
77c137b3ae
minor comment refactor
2017-09-14 15:12:57 -07:00
Yann Collet
3128e03be6
updated license header
...
to clarify dual-license meaning as "or"
2017-09-08 00:09:23 -07:00
Nick Terrell
376f435914
[dictBuilder] Set default compression level to 3
2017-08-24 16:21:05 -07:00
Dmitriy Titarenko
20f715d709
Fix displayLevel overflow
2017-08-23 15:56:15 +05:00
Yann Collet
bd9c8ca146
Merge pull request #811 from terrelln/segmentSize
...
[cover] Fix end condition for small dictionary
2017-08-22 14:36:30 -07:00
Nick Terrell
29c2d9a4d0
[cover] Turn down notification for ZDICT subroutines
2017-08-21 14:28:31 -07:00
Nick Terrell
98de3f6847
[cover] Add dictionary size to compressed size
2017-08-21 14:23:17 -07:00
Nick Terrell
9a54a315aa
[cover] Convert score to U32 and check for zero
2017-08-21 13:30:07 -07:00
Nick Terrell
d49eb40c03
[cover] Stop when segmentSize is less than d
2017-08-21 13:10:03 -07:00
Nick Terrell
f306d400c0
[cover] Fix divide by zero
2017-08-21 11:12:11 -07:00
Yann Collet
32fb407c9d
updated a bunch of headers
...
for the new license
2017-08-18 16:52:05 -07:00
Yann Collet
b71363b967
check pthread_*_init() success condition
2017-07-19 01:05:40 -07:00
Yann Collet
2bd6440be0
pinned down error code enum values
...
Note : all error codes are changed by this new version,
but it's expected to be the last change for existing codes.
Codes are now grouped by category, and receive a manually attributed value.
The objective is to guarantee that
error code values will not change in the future
when introducing new codes.
Intentionnal empty spaces and ranges are defined
in order to keep room for potential new codes.
2017-07-13 17:12:16 -07:00
Yann Collet
590937df20
Merge pull request #739 from facebook/refPrefix
...
ZSTD_refPrefix
2017-06-29 04:36:03 -07:00
Yann Collet
7d3816183f
exposed ZSTD_MAGIC_DICTIONARY in zstd.h
...
makes it easier to explain ZSTD_dictMode
2017-06-27 13:50:34 -07:00
Nick Terrell
5b7fd7c422
[zdict] Make COVER the default algorithm
2017-06-26 21:09:22 -07:00
Yann Collet
ee970398b2
Merge branch 'dev' into advancedAPI2
2017-05-22 12:33:56 -07:00
Nick Terrell
a1280406b0
[libzstd] Allow users to define custom visibility
2017-05-19 18:01:59 -07:00
Yann Collet
fa3671eac7
changed ZSTD_BLOCKSIZE_ABSOLUTEMAX into ZSTD_BLOCKSIZE_MAX
...
Also :
change ZSTD_getBlockSizeMax() into ZSTD_getBlockSize()
created ZSTD_BLOCKSIZELOG_MAX
2017-05-19 10:51:30 -07:00
Nick Terrell
f376d47c11
[CLI] Switch dictionary builder on CLI to cover
2017-05-02 11:18:27 -07:00
Nick Terrell
020b960e13
[cover] Make optimization faster
2017-05-02 11:02:48 -07:00
Nick Terrell
f2d9ef1dc0
[cover] Optimize case where d <= 8
2017-05-02 11:02:43 -07:00
Nick Terrell
865918dd04
Fix typo in zdict.h
2017-05-02 11:02:37 -07:00
Nick Terrell
5152fb2cb2
Convert all tabs to spaces
2017-03-29 18:51:58 -07:00
Yann Collet
4cf0093571
restored bonus rule
2017-03-26 14:51:00 -07:00
Yann Collet
69017bf253
Merge branch 'dev' into LegacyDictBuilder
2017-03-26 14:39:13 -07:00
Yann Collet
582760818f
minor refactor
...
add const
changed if for easier to add new conditions
2017-03-26 03:04:56 -07:00
Yann Collet
858f72eeb8
fixed dictBuilder issue
...
dictionary loading would fail during entropy analysis
2017-03-26 02:50:00 -07:00
Yann Collet
ecee9f2ef8
fixed conversion warnings
2017-03-26 00:59:14 -07:00
Yann Collet
4c41d37fcc
changed test for new syntax
...
--dictID= and --maxdict=
2017-03-24 18:36:56 -07:00
Yann Collet
d41f707e88
minor improvement : remove duplicates with 1 char prefix difference
2017-03-24 17:56:45 -07:00
Yann Collet
96aa3019b2
changed advanced commands --maxdict= and --dictID=
...
now works with the `=` variant, which is the recommended one.
Old variant `--dictID #` still works, for compatibility with existing scripts.
Long term objective is to remove the old variant..
2017-03-24 16:04:29 -07:00
Yann Collet
9da3b215ec
Ensure all limits derived from same constants
...
Now uses ZDICT_DICTSIZE_MIN and ZDICT_CONTENTSIZE_MIN
from zdict.h.
Also : reduced values to 256 and 128 respectively
2017-03-24 15:02:09 -07:00
Yann Collet
f332ece468
dictBuilder fails to create dictionary on certain input
...
Properly expressed with an error code (see zstd_errors.h)
and a cli return code != 0
2017-03-23 16:24:02 -07:00
Sean Purcell
042ba122ae
Change g_displayLevel to int and fix DISPLAYUPDATE flush
2017-03-23 11:21:59 -07:00
Nick Terrell
976e325b2e
Fix COVER_optimizeTrainFromBuffer() resource leaks
...
Thanks to @nemequ for reporting the resource leaks.
2017-03-02 15:54:39 -08:00
Nick Terrell
545987996a
Fix deprecation warnings for clang with C++14
2017-02-08 17:38:17 -08:00
Nick Terrell
71c5263c00
Attribute cover dictionary code
2017-02-07 11:35:07 -08:00
Nick Terrell
43474313f8
Fix documentation about memory usage
2017-01-27 18:43:05 -08:00
Nick Terrell
2fe9126591
Add multithread support to COVER
2017-01-27 11:56:02 -08:00
Nick Terrell
8d984699db
Document memory requirements for COVER algorithm
2017-01-09 18:20:10 -08:00
Nick Terrell
555e281637
Handle large input size in 32-bit mode correctly
2017-01-09 18:20:06 -08:00
Nick Terrell
3a1fefcf00
Simplify COVER parameters
2017-01-02 17:51:38 -08:00
Nick Terrell
96b39f65fa
Add COVER dictionary builder
2017-01-02 13:22:51 -08:00
Yann Collet
aca113f4f5
fixed ZSTD_sizeof_?Dict()
2016-12-23 22:25:03 +01:00
Nick Terrell
1b5d4a7d53
ZDICT_finalizeDictionary() flipped comparison
2016-12-22 18:14:57 -08:00
Nick Terrell
bcbe77e994
ZDICT_finalizeDictionary() flipped comparison
...
`ZDICT_finalizeDictionary()` had a flipped comparison.
I also allowed `dictBufferCapacity == dictContentSize`.
It might be the case that the user wants to fill the dictionary
completely up, and then let zstd take exactly the space it needs
for the entropy tables.
2016-12-22 18:01:14 -08:00
Nick Terrell
78a0072d5a
Fix failing test due to deprecation warning
2016-12-22 17:36:16 -08:00
Yann Collet
d76d1a9ef0
added ZDICT_finalizeDictionary()
2016-12-22 20:18:43 +01:00
Yann Collet
0819abe3c1
added ZSTD_createDDict_byReference() body
2016-12-21 19:25:15 +01:00
Yann Collet
1496c3dc47
Fix : size estimation when some samples are very large
2016-12-18 11:58:23 +01:00
Yann Collet
d46ecb58a5
added dll compilation tests
2016-12-17 16:28:12 +01:00
Nick Terrell
8de46ab51a
Export all API functions
2016-12-16 13:27:30 -08:00
Yann Collet
0a5a5fb7fd
Fix #418 : printing selected segments in zdict debug mode can segfault with certain pathological patterns
2016-11-02 13:57:55 -07:00
Yann Collet
52c1bf93fe
improved dicitonary segment merge
2016-10-18 16:34:58 -07:00
Yann Collet
2b361cf2f1
minor opt
2016-10-14 16:09:07 -07:00
Yann Collet
df6797447f
update dictionary builder warning comments
2016-09-27 15:14:32 +02:00
Yann Collet
47094ea66b
added comment on filePos
2016-09-26 18:03:33 +02:00
Yann Collet
97b378a6f8
Streaming : dictionary compression on multiple files / segments can correctly provide srcSize into header (when provided) using pledgedSrcSize.
2016-09-21 17:20:19 +02:00
Yann Collet
d56dbc02d3
removed g_displayLevel
2016-09-02 17:28:41 -07:00
Yann Collet
855766d73d
clarified dictionary in format description
2016-09-02 17:04:49 -07:00
Yann Collet
d725427a3c
g_time => local displayTime
2016-09-02 15:32:39 -07:00
Yann Collet
4ded9e591c
added boilerplate
2016-08-30 11:06:28 -07:00
Yann Collet
3b15f1f10f
minor refactor
2016-08-30 09:58:50 -07:00
Yann Collet
87c18b2ebd
fixed multiple minor warnings for XCode
2016-08-26 01:43:47 +02:00
Yann Collet
da3fbcb302
Added ZDICT_getDictID()
2016-08-19 14:23:58 +02:00
Yann Collet
a5dbf9f629
Merge pull request #297 from borzunov/dev
...
Export functions related to dictionary compression from DLL
2016-08-18 15:05:01 +02:00
Yann Collet
49d105cfcf
better warning and error messages in case of dictionary training failure ( #292 )
2016-08-18 15:02:11 +02:00
Alexander Borzunov
0f6f17a14f
Rename ZSTDLIB_API to ZDICTLIB_API in zdict.h
2016-08-18 16:47:06 +05:00
Alexander Borzunov
1f48382b1a
Export functions related to dictionary compression from DLL
2016-08-18 16:12:49 +05:00
Yann Collet
e9b414d825
fixed msan warning ( #281 )
2016-08-11 22:09:09 +02:00
Yann Collet
e0b4a2d40f
fixed dictionary generation, reported by Bartosz Taudul
2016-08-03 03:36:03 +02:00
Yann Collet
bf2bc112bb
bench : controlled display update when loading lot of files
2016-08-02 23:48:13 +02:00
Yann Collet
f116e87f59
fixed analyzer warning
2016-08-01 19:15:18 +02:00
Yann Collet
3ca750372d
updated doc ( #269 )
2016-08-01 02:26:20 +02:00
Yann Collet
70a9ff4af3
fixed too large selectivity level, reported by Ilona Papava
2016-07-30 01:09:14 +02:00
Yann Collet
c0ce4f1211
slightly improved compression speed
2016-07-30 00:55:13 +02:00
Yann Collet
ed57d8530a
new seqStore
2016-07-29 21:22:17 +02:00