Having a dedicated file for optimal parser
made sense during its creation,
it allowed Przemyslaw to work more freely on lz4opt, with less dependency on lz4hc,
moreover, the optimal parser was more complex, with its own search functions.
Since the optimal was rewritten last year, it's now a lot lighter.
It makes more sense now to integrate it directly inside lz4hc.c,
making it easier to edit (editors are a bit "lost" inside a `*.h` dependent on its #include position),
it also reduces the number of files in the project,
which fits pretty well with lz4 objectives.
(adding lz4hc requires "just" lz4hc.h and lz4hc.c).
The LZ4 block format specification
states that the last match must start
at a minimum distance of 12 bytes from the end of the block.
However, out of an abundance of caution,
the reference implementation would actually stop searching matches
at 13 bytes from the end of the block.
This patch fixes this small detail.
The new version is now able to properly compress a limit case
such as `aaaaaaaabaaa\n`
as reported by Gao Xiang (@hsiangkao).
Obviously, it doesn't change a lot of things.
This is just one additional match candidate per block, with a maximum match length of 7 (since last 5 bytes must remain literals).
With default policy, blocks are 4 MB long, so it doesn't happen too often
Compressing silesia.tar at default level 1 saves 5 bytes (100930101 -> 100930096).
At max level 12, it saves a grand 16 bytes (77389871 -> 77389855).
The impact is a bit more visible when blocks are smaller, hence more numerous.
For example, compressing silesia with blocks of 64 KB (using -12 -B4D) saves 543 bytes (77304583 -> 77304040).
So the smaller the packet size, the more visible the impact.
And it happens we have a ton of scenarios with little blocks using LZ4 compression ...
And a useless "hooray" sidenote :
the patch improves the LZ4 compression record of silesia (using -12 -B7D --no-frame-crc) by 16 bytes (77270672 -> 77270656)
and the record on enwik9 by 44 bytes (371680396 -> 371680352) (previously claimed by [smallz4](http://create.stephan-brumme.com/smallz4/) ).
Pattern analysis (currently limited to long ranges of identical bytes)
is actually detrimental to performance
when `nbSearches` is low.
Reason is : `nbSearches` provides a built-in protection for these cases.
The problem with patterns is that they dramatically increase the number of candidates to visit.
But with a low nbSearches, the match finder just aborts early.
In such cases, pattern analysis adds some complexity without reducing total nb of candidates.
It actually increases compression ratio a little bit, by filtering only "good" candidates,
but at a measurable speed cost, so it's not a good trade-off.
This patch makes pattern analysis optional.
It's enabled for levels 8+ only.
lz4opt is only competitive vs lz4hc level 10.
Below that level, it doesn't match the speed / compression effectiveness of regular hc parser.
This patch propose to extend lz4opt to levels 10-12.
The new level 10 tend to compress a bit better and a bit faster than previous one (mileage vary depending on file)
The only downside is that `limitedDestSize` mode is now limited to max level 9 (vs 10),
since it's only compatible with regular HC parser.
(Note : I suspect it's possible to convert lz4opt to support it too, but haven't spent time into it).
The first byte used to be skipped
to avoid a infinite self-comparison.
This is no longer necessary, since init() ensures that index starts at 64K.
The first byte is also useless to search when each block is independent,
but it's no longer the case when blocks are linked.
Removing the first-byte-skip saves
about 10 bytes / MB on files compressed with -BD4 (linked blocks 64Kb),
which feels correct as each MB has 16 blocks of 64KB.
- works with byte values other than `0`
- works for any repetitive pattern of length 1, 2 or 4 (but not 3!)
- works for little and big endian systems
- preserve speed of previous implementation
previous strategy would leave a few "bad choices"
on the ground they would be fixed later,
but that requires passing through each position to make the fix
and cannot give the end position of the last useful match.
I noticed that, while 'lz4 -12' works fine, 'lz4 -13' does not compress
at all.
$ cat </etc/passwd |wc -c
2565
$ lz4 -12 </etc/passwd |wc -c
1456
$ lz4 -13 </etc/passwd |wc -c
2584
Perhaps the best way to fix this is to clamp the compression level
in the LZ4HC_compress_generic routine.