updated LZ4 block format

rewording the end of block conditions
for clarity and answering related questions.
This commit is contained in:
Yann Collet 2019-01-02 14:36:12 -08:00
parent 6e24ef902a
commit 7a4e04e6a6

View File

@ -1,6 +1,6 @@
LZ4 Block Format Description
============================
Last revised: 2018-04-25.
Last revised: 2018-12-30.
Author : Yann Collet
@ -10,7 +10,8 @@ using any programming language.
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
There is no entropy encoder back-end nor framing layer.
The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
The latter is assumed to be handled by other parts of the system
(see [LZ4 Frame format]).
This design is assumed to favor simplicity and speed.
It helps later on for optimizations, compactness, and features.
@ -104,45 +105,41 @@ A common case is an offset of 1,
meaning the last byte is repeated `matchlength` times.
Parsing restrictions
End of block restrictions
-----------------------
There are specific parsing rules to respect in order to remain compatible
with assumptions made by the decoder :
There are specific rules required to terminate a block.
1. The last 5 bytes are always literals. In other words, the last five bytes
from the uncompressed input (or all bytes, if the input has less than five
bytes) must be encoded as literals on behalf of the last sequence.
The last sequence is incomplete, and stops right after the literals.
2. The last match must start at least 12 bytes before end of block.
The last match is part of the penultimate sequence,
since the last sequence stops right after literals.
1. The last sequence only contains literals. The block ends right after them.
1. The last 5 bytes of input are always literals.
Therefore, the last sequence contains at least 5 bytes,
or all input bytes if input is smaller than 5 bytes
(empty input can be represented with a zero byte,
interpreted as a token without literal and without a match).
2. The last match must start at least 12 bytes before the end of block.
The last match is part of the penultimate sequence.
It is followed by the last sequence, which only contains literals.
Note that, as a consequence, blocks < 13 bytes cannot be compressed.
These rules are in place to ensure that the decoder
can speculatively execute copy instructions
without ever reading nor writing beyond provided I/O buffers.
1. To copy literals from a non-last sequence, an 8-byte copy instruction
can always be safely issued (without reading past the input),
because literals are followed by a 2-byte offset,
and last sequence is at least 1+5 bytes long.
2. Similarly, a match operation can speculatively copy up to 12 bytes
while remaining within output buffer boundaries.
Empty inputs can be represented with a zero byte,
interpreted as a token without literals and without a match.
These rules are in place to ensure that a compatible decoder
can be designed for speed, issuing speculatively instructions,
while never reading nor writing beyond provided I/O buffers.
Additional notes
-----------------------
There is no assumption nor limits to the way the compressor
If the decoder will decompress data from an external source,
it is recommended to ensure that the decoder will not be vulnerable to
buffer overflow manipulations.
Always ensure that read and write operations
remain within the limits of provided buffers.
Test the decoder with fuzzers
to ensure it's resilient to improbable combinations.
The format makes no assumption nor limits to the way the compressor
searches and selects matches within the source data block.
It could be a fast scan, a multi-probe, a full search using BST,
standard hash chains or MMC, well whatever.
Advanced parsing strategies can also be implemented, such as lazy match,
or full optimal parsing.
All these trade-off offer distinctive speed/memory/compression advantages.
Whatever the method used by the compressor, its result will be decodable
by any LZ4 decoder if it follows the format specification described above.
Multiple techniques can be considered,
featuring distinct time / performance trade offs.
As long as the format is respected,
the result will be compatible and decodable by any compliant decoder.
An upper compression limit can be reached,
using a technique called "full optimal parsing", at high cpu cost.