updated LZ4 block format
rewording the end of block conditions for clarity and answering related questions.
This commit is contained in:
parent
6e24ef902a
commit
7a4e04e6a6
@ -1,6 +1,6 @@
|
||||
LZ4 Block Format Description
|
||||
============================
|
||||
Last revised: 2018-04-25.
|
||||
Last revised: 2018-12-30.
|
||||
Author : Yann Collet
|
||||
|
||||
|
||||
@ -10,7 +10,8 @@ using any programming language.
|
||||
|
||||
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
|
||||
There is no entropy encoder back-end nor framing layer.
|
||||
The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
|
||||
The latter is assumed to be handled by other parts of the system
|
||||
(see [LZ4 Frame format]).
|
||||
This design is assumed to favor simplicity and speed.
|
||||
It helps later on for optimizations, compactness, and features.
|
||||
|
||||
@ -104,45 +105,41 @@ A common case is an offset of 1,
|
||||
meaning the last byte is repeated `matchlength` times.
|
||||
|
||||
|
||||
Parsing restrictions
|
||||
End of block restrictions
|
||||
-----------------------
|
||||
There are specific parsing rules to respect in order to remain compatible
|
||||
with assumptions made by the decoder :
|
||||
There are specific rules required to terminate a block.
|
||||
|
||||
1. The last 5 bytes are always literals. In other words, the last five bytes
|
||||
from the uncompressed input (or all bytes, if the input has less than five
|
||||
bytes) must be encoded as literals on behalf of the last sequence.
|
||||
The last sequence is incomplete, and stops right after the literals.
|
||||
2. The last match must start at least 12 bytes before end of block.
|
||||
The last match is part of the penultimate sequence,
|
||||
since the last sequence stops right after literals.
|
||||
1. The last sequence only contains literals. The block ends right after them.
|
||||
1. The last 5 bytes of input are always literals.
|
||||
Therefore, the last sequence contains at least 5 bytes,
|
||||
or all input bytes if input is smaller than 5 bytes
|
||||
(empty input can be represented with a zero byte,
|
||||
interpreted as a token without literal and without a match).
|
||||
2. The last match must start at least 12 bytes before the end of block.
|
||||
The last match is part of the penultimate sequence.
|
||||
It is followed by the last sequence, which only contains literals.
|
||||
Note that, as a consequence, blocks < 13 bytes cannot be compressed.
|
||||
|
||||
These rules are in place to ensure that the decoder
|
||||
can speculatively execute copy instructions
|
||||
without ever reading nor writing beyond provided I/O buffers.
|
||||
|
||||
1. To copy literals from a non-last sequence, an 8-byte copy instruction
|
||||
can always be safely issued (without reading past the input),
|
||||
because literals are followed by a 2-byte offset,
|
||||
and last sequence is at least 1+5 bytes long.
|
||||
2. Similarly, a match operation can speculatively copy up to 12 bytes
|
||||
while remaining within output buffer boundaries.
|
||||
|
||||
Empty inputs can be represented with a zero byte,
|
||||
interpreted as a token without literals and without a match.
|
||||
These rules are in place to ensure that a compatible decoder
|
||||
can be designed for speed, issuing speculatively instructions,
|
||||
while never reading nor writing beyond provided I/O buffers.
|
||||
|
||||
|
||||
Additional notes
|
||||
-----------------------
|
||||
There is no assumption nor limits to the way the compressor
|
||||
If the decoder will decompress data from an external source,
|
||||
it is recommended to ensure that the decoder will not be vulnerable to
|
||||
buffer overflow manipulations.
|
||||
Always ensure that read and write operations
|
||||
remain within the limits of provided buffers.
|
||||
Test the decoder with fuzzers
|
||||
to ensure it's resilient to improbable combinations.
|
||||
|
||||
The format makes no assumption nor limits to the way the compressor
|
||||
searches and selects matches within the source data block.
|
||||
It could be a fast scan, a multi-probe, a full search using BST,
|
||||
standard hash chains or MMC, well whatever.
|
||||
|
||||
Advanced parsing strategies can also be implemented, such as lazy match,
|
||||
or full optimal parsing.
|
||||
|
||||
All these trade-off offer distinctive speed/memory/compression advantages.
|
||||
Whatever the method used by the compressor, its result will be decodable
|
||||
by any LZ4 decoder if it follows the format specification described above.
|
||||
Multiple techniques can be considered,
|
||||
featuring distinct time / performance trade offs.
|
||||
As long as the format is respected,
|
||||
the result will be compatible and decodable by any compliant decoder.
|
||||
An upper compression limit can be reached,
|
||||
using a technique called "full optimal parsing", at high cpu cost.
|
||||
|
Loading…
Reference in New Issue
Block a user