2017-09-02 01:28:35 +00:00
/*
* Copyright ( c ) 2016 - present , Yann Collet , Facebook , Inc .
* All rights reserved .
*
* This source code is licensed under both the BSD - style license ( found in the
* LICENSE file in the root directory of this source tree ) and the GPLv2 ( found
* in the COPYING file in the root directory of this source tree ) .
2017-09-08 07:09:23 +00:00
* You may select , at your option , one of the above - listed licenses .
2017-09-02 01:28:35 +00:00
*/
2017-11-08 00:15:23 +00:00
# include "zstd_compress_internal.h"
2017-09-02 01:28:35 +00:00
# include "zstd_lazy.h"
/*-*************************************
* Binary Tree search
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
# define ZSTD_DUBT_UNSORTED ((U32)(-1))
void ZSTD_updateDUBT ( ZSTD_CCtx * zc ,
const BYTE * ip , const BYTE * iend ,
U32 mls )
{
U32 * const hashTable = zc - > hashTable ;
U32 const hashLog = zc - > appliedParams . cParams . hashLog ;
U32 * const bt = zc - > chainTable ;
U32 const btLog = zc - > appliedParams . cParams . chainLog - 1 ;
U32 const btMask = ( 1 < < btLog ) - 1 ;
const BYTE * const base = zc - > base ;
U32 const target = ( U32 ) ( ip - base ) ;
U32 idx = zc - > nextToUpdate ;
2017-12-29 16:04:37 +00:00
if ( idx ! = target )
DEBUGLOG ( 2 , " ZSTD_updateDUBT, from %u to %u (dictLimit:%u) " ,
idx , target , zc - > dictLimit ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
assert ( ip + 8 < = iend ) ; /* condition for ZSTD_hashPtr */
( void ) iend ;
assert ( idx > = zc - > dictLimit ) ; /* condition for valid base+idx */
for ( ; idx < target ; idx + + ) {
size_t const h = ZSTD_hashPtr ( base + idx , hashLog , mls ) ; /* assumption : ip + 8 <= iend */
U32 const matchIndex = hashTable [ h ] ;
U32 * const nextCandidatePtr = bt + 2 * ( idx & btMask ) ;
U32 * const sortMarkPtr = nextCandidatePtr + 1 ;
hashTable [ h ] = idx ; /* Update Hash Table */
* nextCandidatePtr = matchIndex ; /* update BT like a chain */
* sortMarkPtr = ZSTD_DUBT_UNSORTED ;
}
zc - > nextToUpdate = target ;
}
/** ZSTD_insertDUBT1() :
* sort one already inserted but unsorted position
* assumption : current > = btlow = = ( current - btmask )
* doesn ' t fail */
static void ZSTD_insertDUBT1 ( ZSTD_CCtx * zc ,
U32 current , const BYTE * iend ,
U32 nbCompares , U32 btLow , int extDict )
2017-09-02 01:28:35 +00:00
{
U32 * const bt = zc - > chainTable ;
U32 const btLog = zc - > appliedParams . cParams . chainLog - 1 ;
U32 const btMask = ( 1 < < btLog ) - 1 ;
size_t commonLengthSmaller = 0 , commonLengthLarger = 0 ;
const BYTE * const base = zc - > base ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
const BYTE * const ip = base + current ;
2017-09-02 01:28:35 +00:00
const BYTE * const dictBase = zc - > dictBase ;
const U32 dictLimit = zc - > dictLimit ;
const BYTE * const dictEnd = dictBase + dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * match ;
U32 * smallerPtr = bt + 2 * ( current & btMask ) ;
U32 * largerPtr = smallerPtr + 1 ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
U32 matchIndex = * smallerPtr ;
2017-09-02 01:28:35 +00:00
U32 dummy32 ; /* to be nullified at the end */
U32 const windowLow = zc - > lowLimit ;
2017-12-29 16:04:37 +00:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1(%u) (dictLimit=%u, lowLimit=%u) " ,
current , dictLimit , windowLow ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
assert ( current > = btLow ) ;
2017-11-15 19:29:24 +00:00
2017-12-29 16:04:37 +00:00
if ( extDict & & ( current < dictLimit ) ) { /* candidates in _extDict are not sorted (simplification, for easier ZSTD_count, detrimental to compression ratio in streaming mode) */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
* largerPtr = * smallerPtr = 0 ;
return ;
}
assert ( current > = dictLimit ) ; /* ip=base+current within current memory segment */
2017-09-02 01:28:35 +00:00
while ( nbCompares - - & & ( matchIndex > windowLow ) ) {
U32 * const nextPtr = bt + 2 * ( matchIndex & btMask ) ;
size_t matchLength = MIN ( commonLengthSmaller , commonLengthLarger ) ; /* guaranteed minimum nb of common bytes */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: comparing %u with %u " , current , matchIndex ) ;
Fixed Btree update
ZSTD_updateTree() expected to be followed by a Bt match finder, which would update zc->nextToUpdate.
With the new optimal match finder, it's not necessarily the case : a match might be found during repcode or hash3, and stops there because it reaches sufficient_len, without even entering the binary tree.
Previous policy was to nonetheless update zc->nextToUpdate, but the current position would not be inserted, creating "holes" in the btree, aka positions that will no longer be searched.
Now, when current position is not inserted, zc->nextToUpdate is not update, expecting ZSTD_updateTree() to fill the tree later on.
Solution selected is that ZSTD_updateTree() takes care of properly setting zc->nextToUpdate,
so that it no longer depends on a future function to do this job.
It took time to get there, as the issue started with a memory sanitizer error.
The pb would have been easier to spot with a proper `assert()`.
So this patch add a few of them.
Additionnally, I discovered that `make test` does not enable `assert()` during CLI tests.
This patch enables them.
Unfortunately, these `assert()` triggered other (unrelated) bugs during CLI tests, mostly within zstdmt.
So this patch also fixes them.
- Changed packed structure for gcc memory access : memory sanitizer would complain that a read "might" reach out-of-bound position on the ground that the `union` is larger than the type accessed.
Now, to avoid this issue, each type is independent.
- ZSTD_CCtxParams_setParameter() : @return provides the value of parameter, clamped/fixed appropriately.
- ZSTDMT : changed constant name to ZSTDMT_JOBSIZE_MIN
- ZSTDMT : multithreading is automatically disabled when srcSize <= ZSTDMT_JOBSIZE_MIN, since only one thread will be used in this case (saves memory and runtime).
- ZSTDMT : nbThreads is automatically clamped on setting the value.
2017-11-16 20:18:56 +00:00
assert ( matchIndex < current ) ;
2017-09-02 01:28:35 +00:00
if ( ( ! extDict ) | | ( matchIndex + matchLength > = dictLimit ) ) {
2017-11-15 21:44:24 +00:00
assert ( matchIndex + matchLength > = dictLimit ) ; /* might be wrong if extDict is incorrectly set to 0 */
2017-09-02 01:28:35 +00:00
match = base + matchIndex ;
2017-11-19 22:40:21 +00:00
matchLength + = ZSTD_count ( ip + matchLength , match + matchLength , iend ) ;
2017-09-02 01:28:35 +00:00
} else {
match = dictBase + matchIndex ;
matchLength + = ZSTD_count_2segments ( ip + matchLength , match + matchLength , iend , dictEnd , prefixStart ) ;
if ( matchIndex + matchLength > = dictLimit )
match = base + matchIndex ; /* to prepare for next usage of match[matchLength] */
}
2017-11-15 19:29:24 +00:00
if ( ip + matchLength = = iend ) { /* equal : no way to know if inf or sup */
2017-09-17 06:40:14 +00:00
break ; /* drop , to guarantee consistency ; miss a bit of compression, but other solutions can corrupt tree */
2017-11-15 19:29:24 +00:00
}
2017-09-02 01:28:35 +00:00
2017-09-17 06:40:14 +00:00
if ( match [ matchLength ] < ip [ matchLength ] ) { /* necessarily within buffer */
2017-11-15 21:44:24 +00:00
/* match is smaller than current */
2017-09-02 01:28:35 +00:00
* smallerPtr = matchIndex ; /* update smaller idx */
commonLengthSmaller = matchLength ; /* all smaller will now have at least this guaranteed common length */
2017-09-17 06:40:14 +00:00
if ( matchIndex < = btLow ) { smallerPtr = & dummy32 ; break ; } /* beyond tree size, stop searching */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: selecting next candidate from %u (>btLow=%u) => %u " ,
matchIndex , btLow , nextPtr [ 1 ] ) ;
2017-11-15 21:44:24 +00:00
smallerPtr = nextPtr + 1 ; /* new "candidate" => larger than match, which was smaller than target */
matchIndex = nextPtr [ 1 ] ; /* new matchIndex, larger than previous and closer to current */
2017-09-02 01:28:35 +00:00
} else {
/* match is larger than current */
* largerPtr = matchIndex ;
commonLengthLarger = matchLength ;
2017-09-17 06:40:14 +00:00
if ( matchIndex < = btLow ) { largerPtr = & dummy32 ; break ; } /* beyond tree size, stop searching */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
DEBUGLOG ( 8 , " ZSTD_insertDUBT1: selecting next candidate from %u (>btLow=%u) => %u " ,
matchIndex , btLow , nextPtr [ 0 ] ) ;
2017-09-02 01:28:35 +00:00
largerPtr = nextPtr ;
matchIndex = nextPtr [ 0 ] ;
} }
* smallerPtr = * largerPtr = 0 ;
Fixed Btree update
ZSTD_updateTree() expected to be followed by a Bt match finder, which would update zc->nextToUpdate.
With the new optimal match finder, it's not necessarily the case : a match might be found during repcode or hash3, and stops there because it reaches sufficient_len, without even entering the binary tree.
Previous policy was to nonetheless update zc->nextToUpdate, but the current position would not be inserted, creating "holes" in the btree, aka positions that will no longer be searched.
Now, when current position is not inserted, zc->nextToUpdate is not update, expecting ZSTD_updateTree() to fill the tree later on.
Solution selected is that ZSTD_updateTree() takes care of properly setting zc->nextToUpdate,
so that it no longer depends on a future function to do this job.
It took time to get there, as the issue started with a memory sanitizer error.
The pb would have been easier to spot with a proper `assert()`.
So this patch add a few of them.
Additionnally, I discovered that `make test` does not enable `assert()` during CLI tests.
This patch enables them.
Unfortunately, these `assert()` triggered other (unrelated) bugs during CLI tests, mostly within zstdmt.
So this patch also fixes them.
- Changed packed structure for gcc memory access : memory sanitizer would complain that a read "might" reach out-of-bound position on the ground that the `union` is larger than the type accessed.
Now, to avoid this issue, each type is independent.
- ZSTD_CCtxParams_setParameter() : @return provides the value of parameter, clamped/fixed appropriately.
- ZSTDMT : changed constant name to ZSTDMT_JOBSIZE_MIN
- ZSTDMT : multithreading is automatically disabled when srcSize <= ZSTDMT_JOBSIZE_MIN, since only one thread will be used in this case (saves memory and runtime).
- ZSTDMT : nbThreads is automatically clamped on setting the value.
2017-11-16 20:18:56 +00:00
}
2017-09-02 01:28:35 +00:00
static size_t ZSTD_insertBtAndFindBestMatch (
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
ZSTD_CCtx * zc ,
const BYTE * const ip , const BYTE * const iend ,
size_t * offsetPtr ,
U32 nbCompares , const U32 mls ,
U32 extDict )
2017-09-02 01:28:35 +00:00
{
U32 * const hashTable = zc - > hashTable ;
U32 const hashLog = zc - > appliedParams . cParams . hashLog ;
size_t const h = ZSTD_hashPtr ( ip , hashLog , mls ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
U32 matchIndex = hashTable [ h ] ;
const BYTE * const base = zc - > base ;
U32 const current = ( U32 ) ( ip - base ) ;
2017-12-29 16:04:37 +00:00
U32 const windowLow = zc - > lowLimit ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
2017-09-02 01:28:35 +00:00
U32 * const bt = zc - > chainTable ;
U32 const btLog = zc - > appliedParams . cParams . chainLog - 1 ;
U32 const btMask = ( 1 < < btLog ) - 1 ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
U32 const btLow = ( btMask > = current ) ? 0 : current - btMask ;
2017-12-29 16:04:37 +00:00
U32 const unsortLimit = MAX ( btLow , windowLow ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
U32 * nextCandidate = bt + 2 * ( matchIndex & btMask ) ;
U32 * unsortedMark = bt + 2 * ( matchIndex & btMask ) + 1 ;
U32 nbCandidates = nbCompares ;
U32 previousCandidate = 0 ;
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
DEBUGLOG ( 7 , " ZSTD_insertBtAndFindBestMatch (%u) " , current ) ;
2017-09-17 06:40:14 +00:00
assert ( ip < = iend - 8 ) ; /* required for h calculation */
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
/* reach end of unsorted candidates list */
2017-12-29 16:04:37 +00:00
while ( ( matchIndex > unsortLimit )
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
& & ( * unsortedMark = = ZSTD_DUBT_UNSORTED )
& & ( nbCandidates > 1 ) ) {
DEBUGLOG ( 8 , " ZSTD_insertBtAndFindBestMatch: candidate %u is unsorted " ,
matchIndex ) ;
* unsortedMark = previousCandidate ;
previousCandidate = matchIndex ;
matchIndex = * nextCandidate ;
nextCandidate = bt + 2 * ( matchIndex & btMask ) ;
unsortedMark = bt + 2 * ( matchIndex & btMask ) + 1 ;
nbCandidates - - ;
}
2017-09-02 01:28:35 +00:00
2017-12-29 16:04:37 +00:00
if ( ( matchIndex > unsortLimit )
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
& & ( * unsortedMark = = ZSTD_DUBT_UNSORTED ) ) {
DEBUGLOG ( 8 , " ZSTD_insertBtAndFindBestMatch: nullify last unsorted candidate %u " ,
matchIndex ) ;
2017-12-29 16:04:37 +00:00
* nextCandidate = * unsortedMark = 0 ; /* nullify next candidate if it's still unsorted (note : simplification, detrimental to compression ratio, beneficial for speed) */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
}
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
/* batch sort stacked candidates */
matchIndex = previousCandidate ;
while ( matchIndex ) { /* will end on matchIndex == 0 */
U32 * const nextCandidateIdxPtr = bt + 2 * ( matchIndex & btMask ) + 1 ;
U32 const nextCandidateIdx = * nextCandidateIdxPtr ;
ZSTD_insertDUBT1 ( zc , matchIndex , iend ,
2017-12-29 16:04:37 +00:00
nbCandidates , unsortLimit , extDict ) ;
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
matchIndex = nextCandidateIdx ;
nbCandidates + + ;
}
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
/* find longest match */
{ size_t commonLengthSmaller = 0 , commonLengthLarger = 0 ;
const BYTE * const dictBase = zc - > dictBase ;
const U32 dictLimit = zc - > dictLimit ;
const BYTE * const dictEnd = dictBase + dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
U32 * smallerPtr = bt + 2 * ( current & btMask ) ;
U32 * largerPtr = bt + 2 * ( current & btMask ) + 1 ;
U32 matchEndIdx = current + 8 + 1 ;
U32 dummy32 ; /* to be nullified at the end */
size_t bestLength = 0 ;
matchIndex = hashTable [ h ] ;
hashTable [ h ] = current ; /* Update Hash Table */
while ( nbCompares - - & & ( matchIndex > windowLow ) ) {
U32 * const nextPtr = bt + 2 * ( matchIndex & btMask ) ;
size_t matchLength = MIN ( commonLengthSmaller , commonLengthLarger ) ; /* guaranteed minimum nb of common bytes */
const BYTE * match ;
if ( ( ! extDict ) | | ( matchIndex + matchLength > = dictLimit ) ) {
match = base + matchIndex ;
matchLength + = ZSTD_count ( ip + matchLength , match + matchLength , iend ) ;
} else {
match = dictBase + matchIndex ;
matchLength + = ZSTD_count_2segments ( ip + matchLength , match + matchLength , iend , dictEnd , prefixStart ) ;
if ( matchIndex + matchLength > = dictLimit )
match = base + matchIndex ; /* to prepare for next usage of match[matchLength] */
}
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
if ( matchLength > bestLength ) {
if ( matchLength > matchEndIdx - matchIndex )
matchEndIdx = matchIndex + ( U32 ) matchLength ;
if ( ( 4 * ( int ) ( matchLength - bestLength ) ) > ( int ) ( ZSTD_highbit32 ( current - matchIndex + 1 ) - ZSTD_highbit32 ( ( U32 ) offsetPtr [ 0 ] + 1 ) ) )
bestLength = matchLength , * offsetPtr = ZSTD_REP_MOVE + current - matchIndex ;
if ( ip + matchLength = = iend ) { /* equal : no way to know if inf or sup */
break ; /* drop, to guarantee consistency (miss a little bit of compression) */
}
}
2017-09-02 01:28:35 +00:00
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
if ( match [ matchLength ] < ip [ matchLength ] ) {
/* match is smaller than current */
* smallerPtr = matchIndex ; /* update smaller idx */
commonLengthSmaller = matchLength ; /* all smaller will now have at least this guaranteed common length */
if ( matchIndex < = btLow ) { smallerPtr = & dummy32 ; break ; } /* beyond tree size, stop the search */
smallerPtr = nextPtr + 1 ; /* new "smaller" => larger of match */
matchIndex = nextPtr [ 1 ] ; /* new matchIndex larger than previous (closer to current) */
} else {
/* match is larger than current */
* largerPtr = matchIndex ;
commonLengthLarger = matchLength ;
if ( matchIndex < = btLow ) { largerPtr = & dummy32 ; break ; } /* beyond tree size, stop the search */
largerPtr = nextPtr ;
matchIndex = nextPtr [ 0 ] ;
} }
* smallerPtr = * largerPtr = 0 ;
assert ( matchEndIdx > current + 8 ) ; /* ensure nextToUpdate is increased */
zc - > nextToUpdate = matchEndIdx - 8 ; /* skip repetitive patterns */
if ( bestLength )
DEBUGLOG ( 7 , " ZSTD_insertBtAndFindBestMatch(%u) : found match of length %u " ,
current , ( U32 ) bestLength ) ;
return bestLength ;
}
2017-09-02 01:28:35 +00:00
}
/** ZSTD_BtFindBestMatch() : Tree updater, providing best match */
static size_t ZSTD_BtFindBestMatch (
ZSTD_CCtx * zc ,
const BYTE * const ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 mls )
{
if ( ip < zc - > base + zc - > nextToUpdate ) return 0 ; /* skipped area */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
ZSTD_updateDUBT ( zc , ip , iLimit , mls ) ;
2017-09-02 01:28:35 +00:00
return ZSTD_insertBtAndFindBestMatch ( zc , ip , iLimit , offsetPtr , maxNbAttempts , mls , 0 ) ;
}
static size_t ZSTD_BtFindBestMatch_selectMLS (
ZSTD_CCtx * zc , /* Index table will be updated */
const BYTE * ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 matchLengthSearch )
{
switch ( matchLengthSearch )
{
default : /* includes case 3 */
case 4 : return ZSTD_BtFindBestMatch ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 4 ) ;
case 5 : return ZSTD_BtFindBestMatch ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 5 ) ;
case 7 :
case 6 : return ZSTD_BtFindBestMatch ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 6 ) ;
}
}
/** Tree updater, providing best match */
static size_t ZSTD_BtFindBestMatch_extDict (
ZSTD_CCtx * zc ,
const BYTE * const ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 mls )
{
if ( ip < zc - > base + zc - > nextToUpdate ) return 0 ; /* skipped area */
first implementation of delayed update for btlazy2
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
2017-12-28 15:58:57 +00:00
ZSTD_updateDUBT ( zc , ip , iLimit , mls ) ;
2017-09-02 01:28:35 +00:00
return ZSTD_insertBtAndFindBestMatch ( zc , ip , iLimit , offsetPtr , maxNbAttempts , mls , 1 ) ;
}
static size_t ZSTD_BtFindBestMatch_selectMLS_extDict (
ZSTD_CCtx * zc , /* Index table will be updated */
const BYTE * ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 matchLengthSearch )
{
switch ( matchLengthSearch )
{
default : /* includes case 3 */
case 4 : return ZSTD_BtFindBestMatch_extDict ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 4 ) ;
case 5 : return ZSTD_BtFindBestMatch_extDict ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 5 ) ;
case 7 :
case 6 : return ZSTD_BtFindBestMatch_extDict ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 6 ) ;
}
}
/* *********************************
* Hash Chain
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
# define NEXT_IN_CHAIN(d, mask) chainTable[(d) & mask]
/* Update chains up to ip (excluded)
Assumption : always within prefix ( i . e . not within extDict ) */
U32 ZSTD_insertAndFindFirstIndex ( ZSTD_CCtx * zc , const BYTE * ip , U32 mls )
{
U32 * const hashTable = zc - > hashTable ;
const U32 hashLog = zc - > appliedParams . cParams . hashLog ;
U32 * const chainTable = zc - > chainTable ;
const U32 chainMask = ( 1 < < zc - > appliedParams . cParams . chainLog ) - 1 ;
const BYTE * const base = zc - > base ;
const U32 target = ( U32 ) ( ip - base ) ;
U32 idx = zc - > nextToUpdate ;
while ( idx < target ) { /* catch up */
size_t const h = ZSTD_hashPtr ( base + idx , hashLog , mls ) ;
NEXT_IN_CHAIN ( idx , chainMask ) = hashTable [ h ] ;
hashTable [ h ] = idx ;
idx + + ;
}
zc - > nextToUpdate = target ;
return hashTable [ ZSTD_hashPtr ( ip , hashLog , mls ) ] ;
}
/* inlining is important to hardwire a hot branch (template emulation) */
FORCE_INLINE_TEMPLATE
size_t ZSTD_HcFindBestMatch_generic (
ZSTD_CCtx * zc , /* Index table will be updated */
const BYTE * const ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 mls , const U32 extDict )
{
U32 * const chainTable = zc - > chainTable ;
const U32 chainSize = ( 1 < < zc - > appliedParams . cParams . chainLog ) ;
const U32 chainMask = chainSize - 1 ;
const BYTE * const base = zc - > base ;
const BYTE * const dictBase = zc - > dictBase ;
const U32 dictLimit = zc - > dictLimit ;
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * const dictEnd = dictBase + dictLimit ;
const U32 lowLimit = zc - > lowLimit ;
const U32 current = ( U32 ) ( ip - base ) ;
const U32 minChain = current > chainSize ? current - chainSize : 0 ;
int nbAttempts = maxNbAttempts ;
size_t ml = 4 - 1 ;
/* HC4 match finder */
U32 matchIndex = ZSTD_insertAndFindFirstIndex ( zc , ip , mls ) ;
for ( ; ( matchIndex > lowLimit ) & ( nbAttempts > 0 ) ; nbAttempts - - ) {
size_t currentMl = 0 ;
if ( ( ! extDict ) | | matchIndex > = dictLimit ) {
2017-11-19 22:40:21 +00:00
const BYTE * const match = base + matchIndex ;
2017-09-02 01:28:35 +00:00
if ( match [ ml ] = = ip [ ml ] ) /* potentially better */
currentMl = ZSTD_count ( ip , match , iLimit ) ;
} else {
2017-11-19 22:40:21 +00:00
const BYTE * const match = dictBase + matchIndex ;
assert ( match + 4 < = dictEnd ) ;
2017-09-02 01:28:35 +00:00
if ( MEM_read32 ( match ) = = MEM_read32 ( ip ) ) /* assumption : matchIndex <= dictLimit-4 (by table construction) */
currentMl = ZSTD_count_2segments ( ip + 4 , match + 4 , iLimit , dictEnd , prefixStart ) + 4 ;
}
/* save best solution */
if ( currentMl > ml ) {
ml = currentMl ;
* offsetPtr = current - matchIndex + ZSTD_REP_MOVE ;
if ( ip + currentMl = = iLimit ) break ; /* best possible, avoids read overflow on next attempt */
}
if ( matchIndex < = minChain ) break ;
matchIndex = NEXT_IN_CHAIN ( matchIndex , chainMask ) ;
}
return ml ;
}
FORCE_INLINE_TEMPLATE size_t ZSTD_HcFindBestMatch_selectMLS (
ZSTD_CCtx * zc ,
const BYTE * ip , const BYTE * const iLimit ,
size_t * offsetPtr ,
const U32 maxNbAttempts , const U32 matchLengthSearch )
{
switch ( matchLengthSearch )
{
default : /* includes case 3 */
case 4 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 4 , 0 ) ;
case 5 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 5 , 0 ) ;
case 7 :
case 6 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 6 , 0 ) ;
}
}
FORCE_INLINE_TEMPLATE size_t ZSTD_HcFindBestMatch_extDict_selectMLS (
2017-11-19 22:40:21 +00:00
ZSTD_CCtx * const zc ,
2017-09-02 01:28:35 +00:00
const BYTE * ip , const BYTE * const iLimit ,
2017-11-19 22:40:21 +00:00
size_t * const offsetPtr ,
U32 const maxNbAttempts , U32 const matchLengthSearch )
2017-09-02 01:28:35 +00:00
{
switch ( matchLengthSearch )
{
default : /* includes case 3 */
case 4 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 4 , 1 ) ;
case 5 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 5 , 1 ) ;
case 7 :
case 6 : return ZSTD_HcFindBestMatch_generic ( zc , ip , iLimit , offsetPtr , maxNbAttempts , 6 , 1 ) ;
}
}
/* *******************************
* Common parser - lazy strategy
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
FORCE_INLINE_TEMPLATE
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy_generic ( ZSTD_CCtx * ctx ,
const void * src , size_t srcSize ,
const U32 searchMethod , const U32 depth )
2017-09-02 01:28:35 +00:00
{
seqStore_t * seqStorePtr = & ( ctx - > seqStore ) ;
const BYTE * const istart = ( const BYTE * ) src ;
const BYTE * ip = istart ;
const BYTE * anchor = istart ;
const BYTE * const iend = istart + srcSize ;
const BYTE * const ilimit = iend - 8 ;
const BYTE * const base = ctx - > base + ctx - > dictLimit ;
U32 const maxSearches = 1 < < ctx - > appliedParams . cParams . searchLog ;
U32 const mls = ctx - > appliedParams . cParams . searchLength ;
typedef size_t ( * searchMax_f ) ( ZSTD_CCtx * zc , const BYTE * ip , const BYTE * iLimit ,
size_t * offsetPtr ,
U32 maxNbAttempts , U32 matchLengthSearch ) ;
searchMax_f const searchMax = searchMethod ? ZSTD_BtFindBestMatch_selectMLS : ZSTD_HcFindBestMatch_selectMLS ;
U32 offset_1 = seqStorePtr - > rep [ 0 ] , offset_2 = seqStorePtr - > rep [ 1 ] , savedOffset = 0 ;
/* init */
ip + = ( ip = = base ) ;
ctx - > nextToUpdate3 = ctx - > nextToUpdate ;
{ U32 const maxRep = ( U32 ) ( ip - base ) ;
if ( offset_2 > maxRep ) savedOffset = offset_2 , offset_2 = 0 ;
if ( offset_1 > maxRep ) savedOffset = offset_1 , offset_1 = 0 ;
}
/* Match Loop */
while ( ip < ilimit ) {
size_t matchLength = 0 ;
size_t offset = 0 ;
const BYTE * start = ip + 1 ;
/* check repCode */
if ( ( offset_1 > 0 ) & ( MEM_read32 ( ip + 1 ) = = MEM_read32 ( ip + 1 - offset_1 ) ) ) {
/* repcode : we take it */
matchLength = ZSTD_count ( ip + 1 + 4 , ip + 1 + 4 - offset_1 , iend ) + 4 ;
if ( depth = = 0 ) goto _storeSequence ;
}
/* first search (depth 0) */
{ size_t offsetFound = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offsetFound , maxSearches , mls ) ;
if ( ml2 > matchLength )
matchLength = ml2 , start = ip , offset = offsetFound ;
}
if ( matchLength < 4 ) {
ip + = ( ( ip - anchor ) > > g_searchStrength ) + 1 ; /* jump faster over incompressible sections */
continue ;
}
/* let's try to find a better solution */
if ( depth > = 1 )
while ( ip < ilimit ) {
ip + + ;
if ( ( offset ) & & ( ( offset_1 > 0 ) & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_1 ) ) ) ) {
size_t const mlRep = ZSTD_count ( ip + 4 , ip + 4 - offset_1 , iend ) + 4 ;
int const gain2 = ( int ) ( mlRep * 3 ) ;
int const gain1 = ( int ) ( matchLength * 3 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 1 ) ;
if ( ( mlRep > = 4 ) & & ( gain2 > gain1 ) )
matchLength = mlRep , offset = 0 , start = ip ;
}
{ size_t offset2 = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offset2 , maxSearches , mls ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) offset2 + 1 ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 4 ) ;
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
matchLength = ml2 , offset = offset2 , start = ip ;
continue ; /* search a better one */
} }
/* let's find an even better one */
if ( ( depth = = 2 ) & & ( ip < ilimit ) ) {
ip + + ;
if ( ( offset ) & & ( ( offset_1 > 0 ) & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_1 ) ) ) ) {
size_t const ml2 = ZSTD_count ( ip + 4 , ip + 4 - offset_1 , iend ) + 4 ;
int const gain2 = ( int ) ( ml2 * 4 ) ;
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 1 ) ;
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) )
matchLength = ml2 , offset = 0 , start = ip ;
}
{ size_t offset2 = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offset2 , maxSearches , mls ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) offset2 + 1 ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 7 ) ;
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
matchLength = ml2 , offset = offset2 , start = ip ;
continue ;
} } }
break ; /* nothing found : store previous solution */
}
/* NOTE:
* start [ - offset + ZSTD_REP_MOVE - 1 ] is undefined behavior .
* ( - offset + ZSTD_REP_MOVE - 1 ) is unsigned , and is added to start , which
* overflows the pointer , which is undefined behavior .
*/
/* catch up */
if ( offset ) {
2017-11-19 22:40:21 +00:00
while ( ( ( start > anchor ) & ( start - ( offset - ZSTD_REP_MOVE ) > base ) )
& & ( start [ - 1 ] = = ( start - ( offset - ZSTD_REP_MOVE ) ) [ - 1 ] ) ) /* only search for offset within prefix */
2017-09-02 01:28:35 +00:00
{ start - - ; matchLength + + ; }
offset_2 = offset_1 ; offset_1 = ( U32 ) ( offset - ZSTD_REP_MOVE ) ;
}
/* store sequence */
_storeSequence :
{ size_t const litLength = start - anchor ;
ZSTD_storeSeq ( seqStorePtr , litLength , anchor , ( U32 ) offset , matchLength - MINMATCH ) ;
anchor = ip = start + matchLength ;
}
/* check immediate repcode */
2017-11-19 22:40:21 +00:00
while ( ( ( ip < = ilimit ) & ( offset_2 > 0 ) )
& & ( MEM_read32 ( ip ) = = MEM_read32 ( ip - offset_2 ) ) ) {
2017-09-02 01:28:35 +00:00
/* store sequence */
matchLength = ZSTD_count ( ip + 4 , ip + 4 - offset_2 , iend ) + 4 ;
offset = offset_2 ; offset_2 = offset_1 ; offset_1 = ( U32 ) offset ; /* swap repcodes */
ZSTD_storeSeq ( seqStorePtr , 0 , anchor , 0 , matchLength - MINMATCH ) ;
ip + = matchLength ;
anchor = ip ;
continue ; /* faster when present ... (?) */
} }
/* Save reps for next block */
seqStorePtr - > repToConfirm [ 0 ] = offset_1 ? offset_1 : savedOffset ;
seqStorePtr - > repToConfirm [ 1 ] = offset_2 ? offset_2 : savedOffset ;
2017-09-06 22:56:32 +00:00
/* Return the last literals size */
return iend - anchor ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_btlazy2 ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_generic ( ctx , src , srcSize , 1 , 2 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy2 ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_generic ( ctx , src , srcSize , 0 , 2 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_generic ( ctx , src , srcSize , 0 , 1 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_greedy ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_generic ( ctx , src , srcSize , 0 , 0 ) ;
2017-09-02 01:28:35 +00:00
}
FORCE_INLINE_TEMPLATE
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy_extDict_generic ( ZSTD_CCtx * ctx ,
2017-09-02 01:28:35 +00:00
const void * src , size_t srcSize ,
const U32 searchMethod , const U32 depth )
{
seqStore_t * seqStorePtr = & ( ctx - > seqStore ) ;
const BYTE * const istart = ( const BYTE * ) src ;
const BYTE * ip = istart ;
const BYTE * anchor = istart ;
const BYTE * const iend = istart + srcSize ;
const BYTE * const ilimit = iend - 8 ;
const BYTE * const base = ctx - > base ;
const U32 dictLimit = ctx - > dictLimit ;
const U32 lowestIndex = ctx - > lowLimit ;
const BYTE * const prefixStart = base + dictLimit ;
const BYTE * const dictBase = ctx - > dictBase ;
const BYTE * const dictEnd = dictBase + dictLimit ;
const BYTE * const dictStart = dictBase + ctx - > lowLimit ;
const U32 maxSearches = 1 < < ctx - > appliedParams . cParams . searchLog ;
const U32 mls = ctx - > appliedParams . cParams . searchLength ;
typedef size_t ( * searchMax_f ) ( ZSTD_CCtx * zc , const BYTE * ip , const BYTE * iLimit ,
size_t * offsetPtr ,
U32 maxNbAttempts , U32 matchLengthSearch ) ;
searchMax_f searchMax = searchMethod ? ZSTD_BtFindBestMatch_selectMLS_extDict : ZSTD_HcFindBestMatch_extDict_selectMLS ;
U32 offset_1 = seqStorePtr - > rep [ 0 ] , offset_2 = seqStorePtr - > rep [ 1 ] ;
/* init */
ctx - > nextToUpdate3 = ctx - > nextToUpdate ;
ip + = ( ip = = prefixStart ) ;
/* Match Loop */
while ( ip < ilimit ) {
size_t matchLength = 0 ;
size_t offset = 0 ;
const BYTE * start = ip + 1 ;
U32 current = ( U32 ) ( ip - base ) ;
/* check repCode */
{ const U32 repIndex = ( U32 ) ( current + 1 - offset_1 ) ;
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) & ( repIndex > lowestIndex ) ) /* intentional overflow */
if ( MEM_read32 ( ip + 1 ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected we should take it */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 1 + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
if ( depth = = 0 ) goto _storeSequence ;
} }
/* first search (depth 0) */
{ size_t offsetFound = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offsetFound , maxSearches , mls ) ;
if ( ml2 > matchLength )
matchLength = ml2 , start = ip , offset = offsetFound ;
}
if ( matchLength < 4 ) {
ip + = ( ( ip - anchor ) > > g_searchStrength ) + 1 ; /* jump faster over incompressible sections */
continue ;
}
/* let's try to find a better solution */
if ( depth > = 1 )
while ( ip < ilimit ) {
ip + + ;
current + + ;
/* check repCode */
if ( offset ) {
const U32 repIndex = ( U32 ) ( current - offset_1 ) ;
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) & ( repIndex > lowestIndex ) ) /* intentional overflow */
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
size_t const repLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
int const gain2 = ( int ) ( repLength * 3 ) ;
int const gain1 = ( int ) ( matchLength * 3 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 1 ) ;
if ( ( repLength > = 4 ) & & ( gain2 > gain1 ) )
matchLength = repLength , offset = 0 , start = ip ;
} }
/* search match, depth 1 */
{ size_t offset2 = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offset2 , maxSearches , mls ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) offset2 + 1 ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 4 ) ;
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
matchLength = ml2 , offset = offset2 , start = ip ;
continue ; /* search a better one */
} }
/* let's find an even better one */
if ( ( depth = = 2 ) & & ( ip < ilimit ) ) {
ip + + ;
current + + ;
/* check repCode */
if ( offset ) {
const U32 repIndex = ( U32 ) ( current - offset_1 ) ;
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) & ( repIndex > lowestIndex ) ) /* intentional overflow */
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
size_t const repLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
int const gain2 = ( int ) ( repLength * 4 ) ;
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 1 ) ;
if ( ( repLength > = 4 ) & & ( gain2 > gain1 ) )
matchLength = repLength , offset = 0 , start = ip ;
} }
/* search match, depth 2 */
{ size_t offset2 = 99999999 ;
size_t const ml2 = searchMax ( ctx , ip , iend , & offset2 , maxSearches , mls ) ;
int const gain2 = ( int ) ( ml2 * 4 - ZSTD_highbit32 ( ( U32 ) offset2 + 1 ) ) ; /* raw approx */
int const gain1 = ( int ) ( matchLength * 4 - ZSTD_highbit32 ( ( U32 ) offset + 1 ) + 7 ) ;
if ( ( ml2 > = 4 ) & & ( gain2 > gain1 ) ) {
matchLength = ml2 , offset = offset2 , start = ip ;
continue ;
} } }
break ; /* nothing found : store previous solution */
}
/* catch up */
if ( offset ) {
U32 const matchIndex = ( U32 ) ( ( start - base ) - ( offset - ZSTD_REP_MOVE ) ) ;
const BYTE * match = ( matchIndex < dictLimit ) ? dictBase + matchIndex : base + matchIndex ;
const BYTE * const mStart = ( matchIndex < dictLimit ) ? dictStart : prefixStart ;
while ( ( start > anchor ) & & ( match > mStart ) & & ( start [ - 1 ] = = match [ - 1 ] ) ) { start - - ; match - - ; matchLength + + ; } /* catch up */
offset_2 = offset_1 ; offset_1 = ( U32 ) ( offset - ZSTD_REP_MOVE ) ;
}
/* store sequence */
_storeSequence :
{ size_t const litLength = start - anchor ;
ZSTD_storeSeq ( seqStorePtr , litLength , anchor , ( U32 ) offset , matchLength - MINMATCH ) ;
anchor = ip = start + matchLength ;
}
/* check immediate repcode */
while ( ip < = ilimit ) {
const U32 repIndex = ( U32 ) ( ( ip - base ) - offset_2 ) ;
const BYTE * const repBase = repIndex < dictLimit ? dictBase : base ;
const BYTE * const repMatch = repBase + repIndex ;
if ( ( ( U32 ) ( ( dictLimit - 1 ) - repIndex ) > = 3 ) & ( repIndex > lowestIndex ) ) /* intentional overflow */
if ( MEM_read32 ( ip ) = = MEM_read32 ( repMatch ) ) {
/* repcode detected we should take it */
const BYTE * const repEnd = repIndex < dictLimit ? dictEnd : iend ;
matchLength = ZSTD_count_2segments ( ip + 4 , repMatch + 4 , iend , repEnd , prefixStart ) + 4 ;
offset = offset_2 ; offset_2 = offset_1 ; offset_1 = ( U32 ) offset ; /* swap offset history */
ZSTD_storeSeq ( seqStorePtr , 0 , anchor , 0 , matchLength - MINMATCH ) ;
ip + = matchLength ;
anchor = ip ;
continue ; /* faster when present ... (?) */
}
break ;
} }
/* Save reps for next block */
seqStorePtr - > repToConfirm [ 0 ] = offset_1 ; seqStorePtr - > repToConfirm [ 1 ] = offset_2 ;
2017-09-06 22:56:32 +00:00
/* Return the last literals size */
return iend - anchor ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_greedy_extDict ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_extDict_generic ( ctx , src , srcSize , 0 , 0 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy_extDict ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_extDict_generic ( ctx , src , srcSize , 0 , 1 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_lazy2_extDict ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_extDict_generic ( ctx , src , srcSize , 0 , 2 ) ;
2017-09-02 01:28:35 +00:00
}
2017-09-06 22:56:32 +00:00
size_t ZSTD_compressBlock_btlazy2_extDict ( ZSTD_CCtx * ctx , const void * src , size_t srcSize )
2017-09-02 01:28:35 +00:00
{
2017-09-06 22:56:32 +00:00
return ZSTD_compressBlock_lazy_extDict_generic ( ctx , src , srcSize , 1 , 2 ) ;
2017-09-02 01:28:35 +00:00
}