as it's faster, due to one memory scan instead of two
(confirmed by microbenchmark).
Note : as ZSTD_reduceIndex() is rarely invoked,
it does not translate into a visible gain.
Consider it an exercise in auto-vectorization and micro-benchmarking.
On my laptop:
Before:
./zstd32 -b --zstd=wlog=27 silesia.tar enwik8 -S
3#silesia.tar : 211984896 -> 66683478 (3.179), 97.6 MB/s , 400.7 MB/s
3#enwik8 : 100000000 -> 35643153 (2.806), 76.5 MB/s , 303.2 MB/s
After:
./zstd32 -b --zstd=wlog=27 silesia.tar enwik8 -S
3#silesia.tar : 211984896 -> 66683478 (3.179), 97.4 MB/s , 435.0 MB/s
3#enwik8 : 100000000 -> 35643153 (2.806), 76.2 MB/s , 338.1 MB/s
Mileage vary, depending on file, and cpu type.
But a generic rule is : x86 benefits less from "long-offset mode" than x64,
maybe due to register pressure.
On "entropy", long-mode is _never_ a win for x86.
On my laptop though, it may, depending on file and compression level
(enwik8 benefits more from "long-mode" than silesia).
This makes it easier to explain that nbWorkers=0 --> single-threaded mode,
while nbWorkers=1 --> asynchronous mode (one mode thread on top of the "main" caller thread).
No need for an additional asynchronous mode flag.
nbWorkers>=2 works the same as nbThreads>=2 previously.
to avoid confusion with blocks.
also:
- jobs are cut into chunks of 512KB now, to reduce nb of mutex calls.
- fix function declaration ZSTD_getBlockSizeMax()
- fix outdated comment
Other job members are accessed directly.
This avoids a full job copy, which would access everything,
including a few members that are supposed to be used by worker only,
uselessly requiring additional locks to avoid race conditions.
writeLastEmptyBlock() must release srcBuffer
as mtctx assumes it's done by job worker.
minor : changed 2 job member names (src->srcBuffer, srcStart->prefixStart) for clarity
replaced by equivalent signal job->consumer == job->srcSize.
created additional functions
ZSTD_writeLastEmptyBlock()
and
ZSTDMT_writeLastEmptyBlock()
required when it's necessary to finish a frame with a last empty job, to create an "end of frame" marker.
It avoids creating a job with srcSize==0.
When ZSTD_e_end directive is provided,
the question is not only "are internal buffers completely flushed",
it is also "is current frame completed".
In some rare cases,
it was possible for internal buffers to be completely flushed,
triggering a @return == 0,
but frame was not completed as it needed a last null-size block to mark the end,
resulting in an unfinished frame.
no real consequence, but pollute tsan tests :
job->dstBuff is being modified inside worker,
while main thread might read it accidentally
because it copies whole job.
But since it doesn't used dstBuff, there is no real consequence.
Other potential solution : only copy useful data, instead of whole job
When the dictionary is <= 8 bytes, no data is loaded from the dictionary.
In this case the repcodes weren't set, because they were inserted after the
size check. Fix this problem in general by first setting the cdict state to
a clean state of an empty dictionary, then filling the state from there.
Produces 3 statistics for ongoing frame compression :
- ingested
- consumed (effectively compressed)
- produced
Ingested can be larger than consumed due to buffering effect.
For the time being, this patch mostly fixes the % ratio issue,
since it computes consumed / produced,
instead of ingested / produced.
That being said, update is not "smooth",
because on a slow enough setting,
fileio spends most of its time waiting for a worker to complete its job.
This could be improved thanks to more granular flushing
i.e. start flushing before ongoing job is fully completed.
ZSTD_create?Dict() is required to produce a ?Dict* return type
because `free()` does not accept a `const type*` argument.
If it wasn't for this restriction, I would have preferred to create a `const ?Dict*` object
to emphasize the fact that, once created, a dictionary never changes
(hence can be shared concurrently until the end of its lifetime).
There is no such limitation with initStatic?Dict() :
as stated in the doc, there is no corresponding free() function,
since `workspace` is provided, hence allocated, externally,
it can only be free() externally.
Which means, ZSTD_initStatic?Dict() can return a `const ZSTD_?Dict*` pointer.
Tested with `make all`, to catch initStatic's users,
which, incidentally, also updated zstd.h documentation.
would create too large buffers,
since default job size == window size * 4.
This would crash on 32-bit systems.
Also : jobSize being a 32-bit unsigned, it cannot be >= 4 GB,
so the formula was failing for large window sizes >= 1 GB.
Fixed now : max job Size is 2 GB, whatever the window size.
this happened on 32-bits build when requiring a too large input buffer,
typically on wlog=29, creating jobs of 2 GB size.
also : zstd32 now compiles with multithread support enabled by default
(can be disabled with HAVE_THREAD=0)
Shaves 492,076 B off of the `ZSTD_CDict`.
The size of a `ZSTD_CDict` created from a 112,640 B dictionary is:
| Level | Before (B) | After (B) |
|-------|------------|-----------|
| 1 | 648,448 | 156,412 |
| 3 | 1,140,008 | 647,932 |
This new parameter makes it possible to call
streaming ZSTDMT with a single thread set
which is non blocking.
It makes it possible for the main thread to do other tasks in parallel
while the worker thread does compression.
Typically, for zstd cli, it means it can do I/O stuff.
Applied within fileio.c, this patch provides non-negligible gains during compression.
Tested on my laptop, with enwik9 (1000000000 bytes) : time zstd -f enwik9
With traditional single-thread blocking mode :
real 0m9.557s
user 0m8.861s
sys 0m0.538s
With new single-worker non blocking mode :
real 0m7.938s
user 0m8.049s
sys 0m0.514s
=> 20% faster
it still fallbacks to single-thread blocking invocation
when input is small (<1job)
or when invoking ZSTDMT_compress(), which is blocking.
Also : fixed a bug in new block-granular compression routine.
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.
constants in zstd.h should not depend on MIN() macro which existence is not guaranteed.
Added a test to check the specific constants.
The test is a bit too specific.
But I have found no way to control a more generic "are all macro already defined" condition,
especially as this is a valid construction (the missing macro might be defined later, intentionnally).
in a new "custom memory allocator" paragraph
which is itself part of "memory management" category.
This makes it simpler to see the relation between the type and its usages.
It used to stop on reaching extDict, for simplification.
As a consequence, there was a small loss of performance each time the round buffer would restart from beginning.
It's not a large difference though, just several hundreds of bytes on silesia.
This patch fixes it.
now selected for levels 13, 14 and 15.
Also : dropped the requirement for monotonic memory budget increase of compression levels,,
which was required for ZSTD_estimateCCtxSize()
in order to ensure that a memory budget for level L is large enough for any level <= L.
This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
we want the dictionary table to be fully sorted,
not just lazily filled.
Dictionary loading is a bit more intensive,
but it saves cpu cycles for match search during compression.
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for btlazy2 on my laptop is :
15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s
15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s
15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)
After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s
15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s
15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s
That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s
15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s
15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
params1 was swapped with params2.
This used to be a non-issue when testing for strict equality,
but now that some tests look for "sufficient size" `<=`, order matters.
The deep fuzzer tests caught a subtle bug that was probably there for a long time.
The impact of the bug is not a crash, or any other clear error signal,
rather, it reduces performance, by cutting data into smaller blocks.
Eventually, the following test would fail because it produces too many 1-byte blocks,
requiring more space than buffer can provide :
`./zstreamtest_asan --mt -s3514 -t1678312 -i1678314`
The root scenario is as follows :
- Create context, initialize it using explicit parameters or a `cdict` to pin them down, set `pledgedSrcSize=1`
- The compression parameters will not be adapted, but `windowSize` and `blockSize` will be automatically set to `1`.
`windowSize` and `blockSize` are dynamic values, set within `ZSTD_resetCCtx_internal()`.
The automatic adaptation makes it possible to generate smaller contexts for smaller input sizes.
- Complete compression
- New compression with same context, using same parameters, but `pledgedSrcSize=ZSTD_CONTENTSIZE_UNKNOWN`
trigger "continue mode"
- Continue mode doesn't modify blockSize, because it used to depend on `windowLog` only,
but in fact, it also depends on `pledgedSrcSize`.
- The "old" blocksize (1) is still there,
next compression will use this value to cut input into blocks,
resulting in more blocks and worse performance than necessary performance.
Given the scenario, and its possible variants, I'm surprised it did not show up before.
But I suspect it did show up, it's just that it never triggered an error, because "worse performance" is not a trigger.
The above test is a special corner case, where performance is so impacted that it reaches an error case.
The fix works, but I'm not completely pleased.
I think the current code relies too much on implied relations between variables.
This will likely break again in the future when some related part of the code change.
Unfortunately, no time to make larger changes if we want to keep the release target for zstd v1.3.3.
So a longer term fix will have to be considered after the release.
To do : create a reliable test case which triggers this scenario for CI tests.
`zstreamtest --newapi` (and `--opaqueapi`) create and destroy way too many threads
resulting in failure of tsan tests,
and potentially connected to the qemu flaky tests.
This is because, at each test, the nb of threads can be changed (random).
The `--no-big-tests` directive reduce this choice to 1/2 threads,
in order to limit memory usage, especially for qemu and 32-bits builds.
Unfortunately, swapping between 1 and 2 threads is enough to constantly create/destroy new mtctx.
This patch takes advantage of the following property :
via compress_generic, no internal mtctx is needed for nbThreads < 2.
As a consequence, when nbThreads == 2, the currently active mtctx is necessarily good.
This dramatically reduces the nb of thread creations when invoking `zstreamtest --newapi --no-big-tests`
(only when parent cctx itself is created, which is randomized to 1/256 tests).
Expected outcome :
- at a minimum : tsan tests shall now work continuously without exploding the thread counter
- at best : flaky qemu tests on `zstreamtest --newapi --no-big-tests` may stop being flaky, due to less stress from constant thread creation/destruction
Real world impact :
minimal, I don't expect users to constantly change `nbThreads` between each invocation.
If `nbThreads` remains stable, existing implementation re-uses existing mtctx.
Also : `zstreamtest --newapi` but without `--no-big-tests` doesn't benefit as much,
since this test can select a random `nbThreads` value between 1 and 4.
The current patch only reduces opportunity to free/create mtctx (for example : 2->1->2 doesn't need a new mtctx)
but doesn't completely eliminate it, since `nbThreads` can still change between 2/3/4.
A more complete solution could be to only use 2 out of 4 allocated threads, thus keeping the pool at a constant size.
This would require a larger change to `POOL_*` api though.
taking advantage of `btopt` improved speed to tune parameters.
Levels 16-19 are stronger than previous release, making the graph more favorable.
In theory, I should also update small-size tables,
but I got lazy on that one ...