changed format of lz4 block doc

This commit is contained in:
Yann Collet 2015-03-30 13:06:43 +01:00
parent d8bb7a5bc3
commit 7b463b63c3
2 changed files with 14 additions and 11 deletions

View File

@ -60,7 +60,7 @@ Benchmark evaluates the compression of reference [Silesia Corpus](http://sun.aei
</tr> </tr>
</table> </table>
The LZ4 block compression format is detailed within [lz4_block_format.txt](lz4_block_format.txt). The LZ4 block compression format is detailed within [lz4_Block_format](lz4_Block_format.md).
For streaming unknown amount of data, and compress files of any size, a frame format has been published, and can be consulted within the file [LZ4_Frame_Format.html](LZ4_Frame_Format.html). For streaming unknown amount of data, and compress files of any size, a frame format has been published, and can be consulted within the file [LZ4_Frame_Format.html](LZ4_Frame_Format.html).

View File

@ -4,7 +4,6 @@ Last revised: 2015-03-26.
Author : Yann Collet Author : Yann Collet
This small specification intents to provide enough information This small specification intents to provide enough information
to anyone willing to produce LZ4-compatible compressed data blocks to anyone willing to produce LZ4-compatible compressed data blocks
using any programming language. using any programming language.
@ -26,7 +25,8 @@ on implementation details of the compressor, and vice versa.
Compressed block format Compressed block format
----------------------- -----------------------
An LZ4 compressed block is composed of sequences. An LZ4 compressed block is composed of sequences.
Schematically, a sequence is a suite of literals, followed by a match copy. A sequence is a suite of literals (not-compressed bytes),
followed by a match copy.
Each sequence starts with a token. Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields. The token is a one byte value, separated into two 4-bits fields.
@ -35,14 +35,14 @@ Therefore each field ranges from 0 to 15.
The first field uses the 4 high-bits of the token. The first field uses the 4 high-bits of the token.
It provides the length of literals to follow. It provides the length of literals to follow.
(Note : a literal is a not-compressed byte).
If the field value is 0, then there is no literal. If the field value is 0, then there is no literal.
If it is 15, then we need to add some more bytes to indicate the full length. If it is 15, then we need to add some more bytes to indicate the full length.
Each additionnal byte then represent a value from 0 to 255, Each additional byte then represent a value from 0 to 255,
which is added to the previous value to produce a total length. which is added to the previous value to produce a total length.
When the byte value is 255, another byte is output. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". There can be any number of bytes following the token. There is no "size limit".
(Sidenote this is why a not-compressible input block is expanded by 0.4%). (Side note : this is why a not-compressible input block is expanded by 0.4%).
Example 1 : A length of 48 will be represented as : Example 1 : A length of 48 will be represented as :
- 15 : value for the 4-bits High field - 15 : value for the 4-bits High field
@ -65,7 +65,8 @@ It's possible that there are zero literal.
Following the literals is the match copy operation. Following the literals is the match copy operation.
It starts by the offset. It starts by the offset.
This is a 2 bytes value, in little endian format. This is a 2 bytes value, in little endian format
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
The offset represents the position of the match to be copied from. The offset represents the position of the match to be copied from.
1 means "current position - 1 byte". 1 means "current position - 1 byte".
@ -95,9 +96,12 @@ Parsing restrictions
----------------------- -----------------------
There are specific parsing rules to respect in order to remain compatible There are specific parsing rules to respect in order to remain compatible
with assumptions made by the decoder : with assumptions made by the decoder :
1) The last 5 bytes are always literals
2) The last match must start at least 12 bytes before end of block 1. The last 5 bytes are always literals
Consequently, a block with less than 13 bytes cannot be compressed. 2. The last match must start at least 12 bytes before end of block.
Consequently, a block with less than 13 bytes cannot be compressed.
These rules are in place to ensure that the decoder These rules are in place to ensure that the decoder
will never read beyond the input buffer, nor write beyond the output buffer. will never read beyond the input buffer, nor write beyond the output buffer.
@ -118,4 +122,3 @@ or full optimal parsing.
All these trade-off offer distinctive speed/memory/compression advantages. All these trade-off offer distinctive speed/memory/compression advantages.
Whatever the method used by the compressor, its result will be decodable Whatever the method used by the compressor, its result will be decodable
by any LZ4 decoder if it follows the format specification described above. by any LZ4 decoder if it follows the format specification described above.