lz4/lz4_Block_format.md

LZ4 Block Format Description
============================
Last revised: 2015-03-26.
Author : Yann Collet


This small specification intents to provide enough information
to anyone willing to produce LZ4-compatible compressed data blocks
using any programming language.

LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
The most important design principle behind LZ4 is simplicity.
It helps to create an easy to read and maintain source code.
It also helps later on for optimizations, compactness, and speed.
There is no entropy encoder back-end nor framing layer.
The latter is assumed to be handled by other parts of the system.

This document only describes the block format,
not how the LZ4 compressor nor decompressor actually work.
The correctness of the decompressor should not depend
on implementation details of the compressor, and vice versa.


Compressed block format
-----------------------
An LZ4 compressed block is composed of sequences.
A sequence is a suite of literals (not-compressed bytes),
followed by a match copy.

Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields.
Therefore each field ranges from 0 to 15.


The first field uses the 4 high-bits of the token.
It provides the length of literals to follow.

If the field value is 0, then there is no literal.
If it is 15, then we need to add some more bytes to indicate the full length.
Each additional byte then represent a value from 0 to 255,
which is added to the previous value to produce a total length.
When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit".
(Side note : this is why a not-compressible input block is expanded by 0.4%).

Example 1 : A length of 48 will be represented as :
- 15 : value for the 4-bits High field
- 33 : (=48-15) remaining length to reach 48

Example 2 : A length of 280 will be represented as :
- 15  : value for the 4-bits High field
- 255 : following byte is maxed, since 280-15 >= 255
- 10  : (=280 - 15 - 255) ) remaining length to reach 280

Example 3 : A length of 15 will be represented as :
- 15 : value for the 4-bits High field
- 0  : (=15-15) yes, the zero must be output

Following the token and optional length bytes, are the literals themselves.
They are exactly as numerous as previously decoded (length of literals).
It's possible that there are zero literal.


Following the literals is the match copy operation.

It starts by the offset.
This is a 2 bytes value, in little endian format
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).

The offset represents the position of the match to be copied from.
1 means "current position - 1 byte".
The maximum offset value is 65535, 65536 cannot be coded.
Note that 0 is an invalid value, not used. 

Then we need to extract the match length.
For this, we use the second token field, the low 4-bits.
Value, obviously, ranges from 0 to 15.
However here, 0 means that the copy operation will be minimal.
The minimum length of a match, called minmatch, is 4. 
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
Similar to literal length, on reaching the highest possible value (15), 
we output additional bytes, one at a time, with values ranging from 0 to 255.
They are added to total to provide the final match length.
A 255 value means there is another byte to read and add.
There is no limit to the number of optional bytes that can be output this way.
(This points towards a maximum achievable compression ratio of ~250).

With the offset and the matchlength,
the decoder can now proceed to copy the data from the already decoded buffer.
On decoding the matchlength, we reach the end of the compressed sequence,
and therefore start another one.


Parsing restrictions
-----------------------
There are specific parsing rules to respect in order to remain compatible
with assumptions made by the decoder :

1. The last 5 bytes are always literals
2. The last match must start at least 12 bytes before end of block.
   
   Consequently, a block with less than 13 bytes cannot be compressed.

These rules are in place to ensure that the decoder
will never read beyond the input buffer, nor write beyond the output buffer.

Note that the last sequence is also incomplete,
and stops right after literals.


Additional notes
-----------------------
There is no assumption nor limits to the way the compressor
searches and selects matches within the source data block.
It could be a fast scan, a multi-probe, a full search using BST,
standard hash chains or MMC, well whatever.

Advanced parsing strategies can also be implemented, such as lazy match,
or full optimal parsing.

All these trade-off offer distinctive speed/memory/compression advantages.
Whatever the method used by the compressor, its result will be decodable
by any LZ4 decoder if it follows the format specification described above.
converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`LZ4 Block Format Description`
			`============================`
Updated documentation 2015-03-30 17:32:21 +00:00			`Last revised: 2015-03-26.`
converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`Author : Yann Collet`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00

			`This small specification intents to provide enough information`
			`to anyone willing to produce LZ4-compatible compressed data blocks`
			`using any programming language.`

			`LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.`
			`The most important design principle behind LZ4 is simplicity.`
			`It helps to create an easy to read and maintain source code.`
converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`It also helps later on for optimizations, compactness, and speed.`
			`There is no entropy encoder back-end nor framing layer.`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`The latter is assumed to be handled by other parts of the system.`

converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`This document only describes the block format,`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`not how the LZ4 compressor nor decompressor actually work.`
			`The correctness of the decompressor should not depend`
			`on implementation details of the compressor, and vice versa.`



converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`Compressed block format`
			`-----------------------`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`An LZ4 compressed block is composed of sequences.`
Updated documentation 2015-03-30 17:32:21 +00:00			`A sequence is a suite of literals (not-compressed bytes),`
			`followed by a match copy.`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00
			`Each sequence starts with a token.`
			`The token is a one byte value, separated into two 4-bits fields.`
			`Therefore each field ranges from 0 to 15.`


			`The first field uses the 4 high-bits of the token.`
			`It provides the length of literals to follow.`
Updated documentation 2015-03-30 17:32:21 +00:00
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`If the field value is 0, then there is no literal.`
			`If it is 15, then we need to add some more bytes to indicate the full length.`
Updated documentation 2015-03-30 17:32:21 +00:00			`Each additional byte then represent a value from 0 to 255,`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`which is added to the previous value to produce a total length.`
			`When the byte value is 255, another byte is output.`
			`There can be any number of bytes following the token. There is no "size limit".`
Updated documentation 2015-03-30 17:32:21 +00:00			`(Side note : this is why a not-compressible input block is expanded by 0.4%).`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00
			`Example 1 : A length of 48 will be represented as :`
			`- 15 : value for the 4-bits High field`
			`- 33 : (=48-15) remaining length to reach 48`

			`Example 2 : A length of 280 will be represented as :`
			`- 15 : value for the 4-bits High field`
			`- 255 : following byte is maxed, since 280-15 >= 255`
			`- 10 : (=280 - 15 - 255) ) remaining length to reach 280`

			`Example 3 : A length of 15 will be represented as :`
			`- 15 : value for the 4-bits High field`
			`- 0 : (=15-15) yes, the zero must be output`

			`Following the token and optional length bytes, are the literals themselves.`
			`They are exactly as numerous as previously decoded (length of literals).`
			`It's possible that there are zero literal.`


			`Following the literals is the match copy operation.`

			`It starts by the offset.`
Updated documentation 2015-03-30 17:32:21 +00:00			`This is a 2 bytes value, in little endian format`
			`(the 1st byte is the "low" byte, the 2nd one is the "high" byte).`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00
			`The offset represents the position of the match to be copied from.`
			`1 means "current position - 1 byte".`
			`The maximum offset value is 65535, 65536 cannot be coded.`
			`Note that 0 is an invalid value, not used.`

			`Then we need to extract the match length.`
			`For this, we use the second token field, the low 4-bits.`
			`Value, obviously, ranges from 0 to 15.`
			`However here, 0 means that the copy operation will be minimal.`
			`The minimum length of a match, called minmatch, is 4.`
			`As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.`
			`Similar to literal length, on reaching the highest possible value (15),`
			`we output additional bytes, one at a time, with values ranging from 0 to 255.`
			`They are added to total to provide the final match length.`
			`A 255 value means there is another byte to read and add.`
			`There is no limit to the number of optional bytes that can be output this way.`
			`(This points towards a maximum achievable compression ratio of ~250).`

			`With the offset and the matchlength,`
			`the decoder can now proceed to copy the data from the already decoded buffer.`
			`On decoding the matchlength, we reach the end of the compressed sequence,`
			`and therefore start another one.`


converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`Parsing restrictions`
			`-----------------------`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`There are specific parsing rules to respect in order to remain compatible`
			`with assumptions made by the decoder :`
Updated documentation 2015-03-30 17:32:21 +00:00
			`1. The last 5 bytes are always literals`
			`2. The last match must start at least 12 bytes before end of block.`

			`Consequently, a block with less than 13 bytes cannot be compressed.`

Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`These rules are in place to ensure that the decoder`
			`will never read beyond the input buffer, nor write beyond the output buffer.`

			`Note that the last sequence is also incomplete,`
			`and stops right after literals.`


converted to markdown friendly syntax 2015-03-26 18:58:19 +00:00			`Additional notes`
			`-----------------------`
Makefile : added capability to install libraries Modified Directory tree, to better separate libraries from programs. git-svn-id: https://lz4.googlecode.com/svn/trunk@111 650e7d94-2a16-8b24-b05c-7c0b3f6821cd 2014-01-07 18:47:50 +00:00			`There is no assumption nor limits to the way the compressor`
			`searches and selects matches within the source data block.`
			`It could be a fast scan, a multi-probe, a full search using BST,`
			`standard hash chains or MMC, well whatever.`

			`Advanced parsing strategies can also be implemented, such as lazy match,`
			`or full optimal parsing.`

			`All these trade-off offer distinctive speed/memory/compression advantages.`
			`Whatever the method used by the compressor, its result will be decodable`
			`by any LZ4 decoder if it follows the format specification described above.`