minor edit of block format
clarifying parsing restrictions near end of block.
This commit is contained in:
parent
c67cc0e8dd
commit
bd92689798
@ -1,6 +1,6 @@
|
||||
LZ4 Block Format Description
|
||||
============================
|
||||
Last revised: 2015-05-07.
|
||||
Last revised: 2018-04-25.
|
||||
Author : Yann Collet
|
||||
|
||||
|
||||
@ -29,8 +29,8 @@ An LZ4 compressed block is composed of sequences.
|
||||
A sequence is a suite of literals (not-compressed bytes),
|
||||
followed by a match copy.
|
||||
|
||||
Each sequence starts with a token.
|
||||
The token is a one byte value, separated into two 4-bits fields.
|
||||
Each sequence starts with a `token`.
|
||||
The `token` is a one byte value, separated into two 4-bits fields.
|
||||
Therefore each field ranges from 0 to 15.
|
||||
|
||||
|
||||
@ -42,46 +42,46 @@ If it is 15, then we need to add some more bytes to indicate the full length.
|
||||
Each additional byte then represent a value from 0 to 255,
|
||||
which is added to the previous value to produce a total length.
|
||||
When the byte value is 255, another byte is output.
|
||||
There can be any number of bytes following the token. There is no "size limit".
|
||||
There can be any number of bytes following `token`. There is no "size limit".
|
||||
(Side note : this is why a not-compressible input block is expanded by 0.4%).
|
||||
|
||||
Example 1 : A length of 48 will be represented as :
|
||||
Example 1 : A literal length of 48 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 33 : (=48-15) remaining length to reach 48
|
||||
|
||||
Example 2 : A length of 280 will be represented as :
|
||||
Example 2 : A literal length of 280 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 255 : following byte is maxed, since 280-15 >= 255
|
||||
- 10 : (=280 - 15 - 255) ) remaining length to reach 280
|
||||
|
||||
Example 3 : A length of 15 will be represented as :
|
||||
Example 3 : A literal length of 15 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 0 : (=15-15) yes, the zero must be output
|
||||
|
||||
Following the token and optional length bytes, are the literals themselves.
|
||||
Following `token` and optional length bytes, are the literals themselves.
|
||||
They are exactly as numerous as previously decoded (length of literals).
|
||||
It's possible that there are zero literal.
|
||||
|
||||
|
||||
Following the literals is the match copy operation.
|
||||
|
||||
It starts by the offset.
|
||||
It starts by the `offset`.
|
||||
This is a 2 bytes value, in little endian format
|
||||
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
|
||||
|
||||
The offset represents the position of the match to be copied from.
|
||||
The `offset` represents the position of the match to be copied from.
|
||||
1 means "current position - 1 byte".
|
||||
The maximum offset value is 65535, 65536 cannot be coded.
|
||||
The maximum `offset` value is 65535, 65536 cannot be coded.
|
||||
Note that 0 is an invalid value, not used.
|
||||
|
||||
Then we need to extract the match length.
|
||||
Then we need to extract the `matchlength`.
|
||||
For this, we use the second token field, the low 4-bits.
|
||||
Value, obviously, ranges from 0 to 15.
|
||||
However here, 0 means that the copy operation will be minimal.
|
||||
The minimum length of a match, called minmatch, is 4.
|
||||
The minimum length of a match, called `minmatch`, is 4.
|
||||
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
|
||||
Similar to literal length, on reaching the highest possible value (15),
|
||||
we output additional bytes, one at a time, with values ranging from 0 to 255.
|
||||
@ -90,18 +90,18 @@ A 255 value means there is another byte to read and add.
|
||||
There is no limit to the number of optional bytes that can be output this way.
|
||||
(This points towards a maximum achievable compression ratio of about 250).
|
||||
|
||||
Decoding the matchlength reaches the end of current sequence.
|
||||
Decoding the `matchlength` reaches the end of current sequence.
|
||||
Next byte will be the start of another sequence.
|
||||
But before moving to next sequence,
|
||||
it's time to use the decoded match position and length.
|
||||
The decoder copies matchlength bytes from match position to current position.
|
||||
The decoder copies `matchlength` bytes from match position to current position.
|
||||
|
||||
In some cases, matchlength is larger than offset.
|
||||
Therefore, match pos + match length > current pos,
|
||||
In some cases, `matchlength` is larger than `offset`.
|
||||
Therefore, `match_pos + matchlength > current_pos`,
|
||||
which means that later bytes to copy are not yet decoded.
|
||||
This is called an "overlap match", and must be handled with special care.
|
||||
The most common case is an offset of 1,
|
||||
meaning the last byte is repeated matchlength times.
|
||||
A common case is an offset of 1,
|
||||
meaning the last byte is repeated `matchlength` times.
|
||||
|
||||
|
||||
Parsing restrictions
|
||||
@ -114,19 +114,23 @@ with assumptions made by the decoder :
|
||||
bytes) must be encoded as literals on behalf of the last sequence.
|
||||
The last sequence is incomplete, and stops right after the literals.
|
||||
2. The last match must start at least 12 bytes before end of block.
|
||||
Consequently, a block with less than 13 bytes cannot be compressed.
|
||||
The last match is part of the penultimate sequence,
|
||||
since the last sequence stops right after literals.
|
||||
Note that, as a consequence, blocks < 13 bytes cannot be compressed.
|
||||
|
||||
These rules are in place to ensure that the decoder
|
||||
will never read beyond the input buffer, nor write beyond the output buffer.
|
||||
can speculatively execute copy instructions
|
||||
without ever reading nor writing beyond provided I/O buffers.
|
||||
|
||||
1. To copy literals from a non-last sequence, an 8-byte copy instruction
|
||||
can always be safely issued (without reading past the input), because
|
||||
the literals are followed by a 2-byte offset, and the last sequence
|
||||
is at least 1+5 bytes long.
|
||||
2. TODO: explain the benefits of the second restriction.
|
||||
can always be safely issued (without reading past the input),
|
||||
because literals are followed by a 2-byte offset,
|
||||
and last sequence is at least 1+5 bytes long.
|
||||
2. Similarly, a match operation can speculatively copy up to 12 bytes
|
||||
while remaining within output buffer boundaries.
|
||||
|
||||
Empty inputs are either unrepresentable or can be represented with a null byte,
|
||||
which can be interpreted as a token without literals and without a match.
|
||||
Empty inputs can be represented with a zero byte,
|
||||
interpreted as a token without literals and without a match.
|
||||
|
||||
|
||||
Additional notes
|
||||
|
Loading…
Reference in New Issue
Block a user