89767cc280
Fixed : LZ4_uncompress_unknownOutputSize(), now protected against a malicious attack type, where a crafted input deliberately attempts to make it read outside of input buffer. Thanks Steinar H. Gunderson for report. git-svn-id: https://lz4.googlecode.com/svn/trunk@58 650e7d94-2a16-8b24-b05c-7c0b3f6821cd
122 lines
4.9 KiB
Plaintext
122 lines
4.9 KiB
Plaintext
LZ4 Format Description
|
|
Last revised: 2012-02-27
|
|
Author : Y. Collet
|
|
|
|
|
|
|
|
This small specification intents to provide enough information
|
|
to anyone willing to produce LZ4-compatible compressed streams
|
|
using any programming language.
|
|
|
|
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
|
|
The most important design principle behind LZ4 is simplicity.
|
|
It helps to create an easy to read and maintain source code.
|
|
It also helps later on for optimisations, compactness, and speed.
|
|
There is no entropy encoder backend nor framing layer.
|
|
The latter is assumed to be handled by other parts of the system.
|
|
|
|
This document only describes the format,
|
|
not how the LZ4 compressor nor decompressor actually work.
|
|
The correctness of the decompressor should not depend
|
|
on implementation details of the compressor, and vice versa.
|
|
|
|
|
|
|
|
-- Compressed stream format --
|
|
|
|
An LZ4 compressed stream is composed of sequences.
|
|
Schematically, a sequence is a suite of literals, followed by a match copy.
|
|
|
|
Each sequence starts with a token.
|
|
The token is a one byte value, separated into two 4-bits fields.
|
|
Therefore each field ranges from 0 to 15.
|
|
|
|
|
|
The first field uses the 4 high-bits of the token.
|
|
It provides the length of literals to follow.
|
|
(Note : a literal is a not-compressed byte).
|
|
If the field value is 0, then there is no literal.
|
|
If it is 15, then we need to add some more bytes to indicate the full length.
|
|
Each additionnal byte then represent a value from 0 to 255,
|
|
which is added to the previous value to produce a total length.
|
|
When the byte value is 255, another byte is output.
|
|
There can be any number of bytes following the token. There is no "size limit".
|
|
(Sidenote this is why a not-compressible input stream is expanded by 0.4%).
|
|
|
|
Example 1 : A length of 48 will be represented as :
|
|
- 15 : value for the 4-bits High field
|
|
- 33 : (=48-15) remaining length to reach 48
|
|
|
|
Example 2 : A length of 280 will be represented as :
|
|
- 15 : value for the 4-bits High field
|
|
- 255 : following byte is maxed, since 280-15 >= 255
|
|
- 10 : (=280 - 15 - 255) ) remaining length to reach 280
|
|
|
|
Example 3 : A length of 15 will be represented as :
|
|
- 15 : value for the 4-bits High field
|
|
- 0 : (=15-15) yes, the zero must be output
|
|
|
|
Following the token and optional length bytes, are the literals themselves.
|
|
They are exactly as numerous as previously decoded (length of literals).
|
|
It's possible that there are zero literal.
|
|
|
|
|
|
Following the literals is the match copy operation.
|
|
|
|
It starts by the offset.
|
|
This is a 2 bytes value, in little endian format :
|
|
the lower byte is the first one in the stream.
|
|
|
|
The offset represents the position of the match to be copied from.
|
|
1 means "current position - 1 byte".
|
|
The maximum offset value is 65535, 65536 cannot be coded.
|
|
Note that 0 is an invalid value, not used.
|
|
|
|
Then we need to extract the match length.
|
|
For this, we use the second token field, the low 4-bits.
|
|
Value, obviously, ranges from 0 to 15.
|
|
However here, 0 means that the copy operation will be minimal.
|
|
The minimum length of a match, called minmatch, is 4.
|
|
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
|
|
Similar to literal length, on reaching the highest possible value (15),
|
|
we output additional bytes, one at a time, with values ranging from 0 to 255.
|
|
They are added to total to provide the final match length.
|
|
A 255 value means there is another byte to read and add.
|
|
There is no limit to the number of optional bytes that can be output this way.
|
|
(This points towards a maximum achievable compression ratio of ~250).
|
|
|
|
With the offset and the matchlength,
|
|
the decoder can now proceed to copy the data from the already decoded buffer.
|
|
On decoding the matchlength, we reach the end of the compressed sequence,
|
|
and therefore start another one.
|
|
|
|
|
|
-- Parsing restrictions --
|
|
|
|
There are specific parsing rules to respect in order to remain compatible
|
|
with assumptions made by the decoder :
|
|
1) The last 5 bytes are always literals
|
|
2) The last match must start at least 12 bytes before end of stream
|
|
Consequently, a file with less than 13 bytes cannot be compressed.
|
|
These rules are in place to ensure that the decoder
|
|
will never read beyond the input buffer, nor write beyond the output buffer.
|
|
|
|
Note that the last sequence is also incomplete,
|
|
and stops right after literals.
|
|
|
|
|
|
-- Additional notes --
|
|
|
|
There is no assumption nor limits to the way the compressor
|
|
searches and selects matches within the source stream.
|
|
It could be a fast scan, a multi-probe, a full search using BST,
|
|
standard hash chains or MMC, well whatever.
|
|
|
|
Advanced parsing strategies can also be implemented, such as lazy match,
|
|
or full optimal parsing.
|
|
|
|
All these trade-off offer distinctive speed/memory/compression advantages.
|
|
Whatever the method used by the compressor, its result will be decodable
|
|
by any LZ4 decoder if it follows the format specification described above.
|
|
|