Method 1 __packed is always as good or better than memcpy().
But it's not portable, as it depends on compiler extension.
For gcc, __pakced directive works fine.
Furthermore, gcc has serious performance issues with memcpy() on ARM 32 bits.
See #620
The compressor always reuses the existing Huffman table if the literals
size is at most 1 KiB. If the compression strategy is `ZSTD_lazy` or
stronger always check to see if reusing the previous table or creating
a new table is better.
This doesn't yet weigh in decompression speed. I don't want to add any
heuristics there until I have real data to work with to ensure that the
heuristic works for at least one use case, preferably more.
* Compressor saves most recently used Huffman table and reuses it
if it produces better results.
* I attempted to preserve CPU usage profile.
I intentionally left all of the existing heuristics in place.
There is only a speed difference on the second block and later.
When compressing large enough blocks (say >= 4 KiB) there is
no significant difference in compression speed.
Dictionary compression of one block is the same speed for blocks
with literals <= 1 KiB, and after that the difference is not
very significant.
* In the synthetic data, with blocks 10 KB or smaller, most blocks
can't use repeated tables because the previous block did not
contain a symbol that the current block contains.
Once blocks are about 12 KB or more, most previous blocks have
valid Huffman tables for the current block, and the compression
ratio and decompression speed jumped.
* In silesia blocks as small as 4KB can frequently reuse the
previous Huffman table (85%), but it isn't as profitable, and
the previous Huffman table only gets used about 3% of the time.
* Microbenchmarks show that `HUF_validateCTable()` takes ~55 ns
and `HUF_estimateCompressedSize()` takes ~35 ns.
They are decently well optimized, the first versions took 90 ns
and 120 ns respectively. `HUF_validateCTable()` could be twice as
fast, if we cast the `HUF_CElt*` to a `U32*` and compare to 0.
However, `U32` has an alignment of 4 instead of 2, so I think that
might be undefined behavior.
* I've ran `zstreamtest` compiled normally, with UASAN and with MSAN
for 4 hours each.
The worst case for the speed difference is a bunch of small blocks
in the same frame. I modified `bench.c` to compress the input in a
single frame but with blocks of the given block size, set by `-B`.
Benchmarks on level 1:
| Program | Block size | Corpus | Ratio | Compression MB/s | Decompression MB/s |
|-----------|------------|-----------|-------|------------------|--------------------|
| zstd.base | 256 | synthetic | 2.364 | 110.0 | 297.0 |
| zstd | 256 | synthetic | 2.367 | 108.9 | 297.0 |
| zstd.base | 256 | silesia | 2.204 | 93.8 | 415.7 |
| zstd | 256 | silesia | 2.204 | 93.4 | 415.7 |
| zstd.base | 512 | synthetic | 2.594 | 144.2 | 420.0 |
| zstd | 512 | synthetic | 2.599 | 141.5 | 425.7 |
| zstd.base | 512 | silesia | 2.358 | 118.4 | 432.6 |
| zstd | 512 | silesia | 2.358 | 119.8 | 432.6 |
| zstd.base | 1024 | synthetic | 2.790 | 192.3 | 594.1 |
| zstd | 1024 | synthetic | 2.794 | 192.3 | 600.0 |
| zstd.base | 1024 | silesia | 2.524 | 148.2 | 464.2 |
| zstd | 1024 | silesia | 2.525 | 148.2 | 467.6 |
| zstd.base | 4096 | synthetic | 3.023 | 300.0 | 1000.0 |
| zstd | 4096 | synthetic | 3.024 | 300.0 | 1010.1 |
| zstd.base | 4096 | silesia | 2.779 | 223.1 | 623.5 |
| zstd | 4096 | silesia | 2.779 | 223.1 | 636.0 |
| zstd.base | 16384 | synthetic | 3.131 | 350.0 | 1150.1 |
| zstd | 16384 | synthetic | 3.152 | 350.0 | 1630.3 |
| zstd.base | 16384 | silesia | 2.871 | 296.5 | 883.3 |
| zstd | 16384 | silesia | 2.872 | 294.4 | 898.3 |
Previously,
followed by :
would fail to include the static definitions,
because the second include was simply skipped by guard macro.
Now it works as intended :
the missing static part is included during the second include.
XXH_STATIC_LINKING_ONLY protection macro is intended to be triggered just before the include.
The main idea is to keep this setting local :
user module shall explicitly understand and accept the static linking restriction
which becomes transparent when triggering the macro at project level.
Global definition also triggers redefinition warnings for user modules which do locally define the macro.
This new version compiles lib and cli without warning when the macro is set globally.
That's not a scenario to be recommended, since it trades a local effect for a global one,
but it was easy enough to provide from zstd side.
MT compression generates a single frame.
Multi-threading operates by breaking the frames into independent sections.
But from a decoder perspective, there is no difference :
it's just a suite of blocks.
Problem is, decoder preserves repCodes from previous block to start decoding next block.
This is also valid between sections, since they are no different than changing block.
Previous version would incorrectly initialize repcodes to their default value at the beginning of each section.
When using them, there was a mismatch between encoder (default values) and decoder (values from previous block).
This change ensures that repcodes won't be used at the beginning of a new section.
It works by setting them to 0.
This only works with regular (single segment) variants : extDict variants will fail !
Fortunately, sections beyond the 1st one belong to this category.
To be checked : btopt strategy.
This change was only validated from fast to btlazy2 strategies.
execSequence relied on pointer overflow to handle cases where
`sequence.matchLength < 8`. Instead of passing an `size_t` to
wildcopy, pass a `ptrdiff_t`.
The zstd reference compressor will not emit a match in the last 7
bytes of a block. The decompressor will also not accept a match
in the last 7 bytes. This patch makes the decompressor accept a
match in the last 7 bytes.
If `weightTotal == 0`, then `BIT_highbit32(weightTotal)` is
undefined behavior in the case that it calls `__builtin_clz()`.
If `tableLog == HUF_TABLELOG_ABSOLUTEMAX` then we will access one
byte beyond the end of the buffer.