Commit Graph

40 Commits

Author SHA1 Message Date
Jennifer Liu
5021441d86 Change default splitPoint to 100 2018-07-10 11:19:33 -07:00
Jennifer Liu
456f290e31 Change back to splitPoint<=0 2018-07-09 13:53:25 -07:00
Jennifer Liu
7efabb2cf6 Only make 0.0 default splitPoint 2018-07-09 12:26:53 -07:00
Jennifer Liu
015a00af0f Change cover_sum back to 2 parameters and fix splitPoint issues 2018-07-06 14:24:18 -07:00
Jennifer Liu
0bbff01211 Fix testing parameter 2018-07-05 22:40:32 -07:00
Jennifer Liu
a085d1aae1 Allow splitPoint==1.0 (using all samples for both training and testing) 2018-07-05 10:38:45 -07:00
Jennifer Liu
0881184c89 Some edits based on pull request comments 2018-07-03 17:53:27 -07:00
Jennifer Liu
16e75e8804 Update minimal training sample size 2018-07-03 12:07:06 -07:00
Jennifer Liu
348e5f77a9 Add split=# to cli 2018-06-29 17:54:41 -07:00
Jennifer Liu
52fbbbcb6b Explicitly cast double to unsigned 2018-06-29 16:17:20 -07:00
Jennifer Liu
f9d19b83fb Fix variable declaration problem 2018-06-29 15:46:56 -07:00
Jennifer Liu
e061d84016 Another fix to comparator 2018-06-29 15:38:08 -07:00
Jennifer Liu
59797d3328 Fix splitPoint floating point comparison problem 2018-06-29 12:47:03 -07:00
Jennifer Liu
0ef06f2e8a Split samples into train and test sets 2018-06-29 12:33:34 -07:00
Nick Terrell
7cbb8bbbbf [cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.

The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.

I benchmarked on the following data sets using this command:

```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```

| Data set     | k (approx) |  Before  |  After   | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub       | ~1000      |   738138 |   746610 |       +1.14% |
| hg-changelog | ~90        |  4295156 |  4285336 |       -0.23% |
| hg-commands  | ~500       |  1095580 |  1079814 |       -1.44% |
| hg-manifest  | ~400       | 16559892 | 16504346 |       -0.34% |

There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.

If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.

| Run        | Before | After  | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 |       -0.50% |
| MacBook    | 738451 | 737132 |       -0.18% |

Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
Björn Ketelaars
462aed6811 zstd requires a stable sort.
On OpenBSD qsort() is not guaranteed to be stable, their mergesort() is.
This fixes issue #1088. All the hard work has been done by @terrelln.
2018-04-05 07:59:16 +02:00
Yann Collet
752bae4a48 added warning message
when pathological dataset is detected
(note : cover_optimize needs -v to display the warning)
2018-01-11 11:29:28 -08:00
Yann Collet
218e9fe0fc added a test case for dictBuilder failure
cyclic data set makes the entropy stage fails
now, onto a fix for #304 ...
2018-01-11 09:42:38 -08:00
Nick Terrell
6c41adfb28 [libzstd] pthread function prefixed with ZSTD_
* `sed -i 's/pthread_/ZSTD_pthread_/g' lib/{,common,compress,decompress,dictBuilder}/*.[hc]`
* Fix up `lib/common/threading.[hc]`
* `sed -i s/PTHREAD_MUTEX_LOCK/ZSTD_PTHREAD_MUTEX_LOCK/g lib/compress/zstdmt_compress.c`
2017-09-27 11:48:48 -07:00
Yann Collet
3128e03be6 updated license header
to clarify dual-license meaning as "or"
2017-09-08 00:09:23 -07:00
Dmitriy Titarenko
20f715d709 Fix displayLevel overflow 2017-08-23 15:56:15 +05:00
Yann Collet
bd9c8ca146 Merge pull request #811 from terrelln/segmentSize
[cover] Fix end condition for small dictionary
2017-08-22 14:36:30 -07:00
Nick Terrell
29c2d9a4d0 [cover] Turn down notification for ZDICT subroutines 2017-08-21 14:28:31 -07:00
Nick Terrell
98de3f6847 [cover] Add dictionary size to compressed size 2017-08-21 14:23:17 -07:00
Nick Terrell
9a54a315aa [cover] Convert score to U32 and check for zero 2017-08-21 13:30:07 -07:00
Nick Terrell
d49eb40c03 [cover] Stop when segmentSize is less than d 2017-08-21 13:10:03 -07:00
Nick Terrell
f306d400c0 [cover] Fix divide by zero 2017-08-21 11:12:11 -07:00
Yann Collet
32fb407c9d updated a bunch of headers
for the new license
2017-08-18 16:52:05 -07:00
Yann Collet
b71363b967 check pthread_*_init() success condition 2017-07-19 01:05:40 -07:00
Nick Terrell
5b7fd7c422 [zdict] Make COVER the default algorithm 2017-06-26 21:09:22 -07:00
Nick Terrell
f376d47c11 [CLI] Switch dictionary builder on CLI to cover 2017-05-02 11:18:27 -07:00
Nick Terrell
020b960e13 [cover] Make optimization faster 2017-05-02 11:02:48 -07:00
Nick Terrell
f2d9ef1dc0 [cover] Optimize case where d <= 8 2017-05-02 11:02:43 -07:00
Sean Purcell
042ba122ae Change g_displayLevel to int and fix DISPLAYUPDATE flush 2017-03-23 11:21:59 -07:00
Nick Terrell
976e325b2e Fix COVER_optimizeTrainFromBuffer() resource leaks
Thanks to @nemequ for reporting the resource leaks.
2017-03-02 15:54:39 -08:00
Nick Terrell
71c5263c00 Attribute cover dictionary code 2017-02-07 11:35:07 -08:00
Nick Terrell
2fe9126591 Add multithread support to COVER 2017-01-27 11:56:02 -08:00
Nick Terrell
555e281637 Handle large input size in 32-bit mode correctly 2017-01-09 18:20:06 -08:00
Nick Terrell
3a1fefcf00 Simplify COVER parameters 2017-01-02 17:51:38 -08:00
Nick Terrell
96b39f65fa Add COVER dictionary builder 2017-01-02 13:22:51 -08:00