Merge pull request #288 from inikep/Other

improved zstd_compression_format.md
This commit is contained in:
Yann Collet 2016-08-13 13:59:29 +02:00 committed by GitHub
commit d6ad8fed80

View File

@ -60,8 +60,9 @@ explaining which parameter is unsupported.
Overall conventions Overall conventions
----------- -----------
In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. In this document:
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
- a naming convention for identifiers is `Mixed_Case_With_Underscores`
Definitions Definitions
----------- -----------
@ -106,7 +107,7 @@ The structure of a single Zstandard frame is following:
__`Magic_Number`__ __`Magic_Number`__
4 Bytes, Little-endian format. 4 Bytes, little-endian format.
Value : 0xFD2FB527 Value : 0xFD2FB527
__`Frame_Header`__ __`Frame_Header`__
@ -122,7 +123,7 @@ __`Content_Checksum`__
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set. An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
The content checksum is the result The content checksum is the result
of [xxh64() hash function](https://www.xxHash.com) of [xxh64() hash function](http://www.xxhash.org)
digesting the original (decoded) data as input, and a seed of zero. digesting the original (decoded) data as input, and a seed of zero.
The low 4 bytes of the checksum are stored in little endian format. The low 4 bytes of the checksum are stored in little endian format.
@ -214,11 +215,11 @@ __`Dictionary_ID_flag`__
This is a 2-bits flag (`= FHD & 3`), This is a 2-bits flag (`= FHD & 3`),
telling if a dictionary ID is provided within the header. telling if a dictionary ID is provided within the header.
It also specifies the size of this field. It also specifies the size of this field as `Field_Size`.
| Value | 0 | 1 | 2 | 3 | |`Flag_Value`| 0 | 1 | 2 | 3 |
| -------- | --- | --- | --- | --- | | ---------- | --- | --- | --- | --- |
|Field size| 0 | 1 | 2 | 4 | |`Field_Size`| 0 | 1 | 2 | 4 |
### `Window_Descriptor` ### `Window_Descriptor`
@ -230,22 +231,22 @@ The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag
In this case, the maximum back-reference distance is the content size itself, In this case, the maximum back-reference distance is the content size itself,
which can be any value from 1 to 2^64-1 bytes (16 EB). which can be any value from 1 to 2^64-1 bytes (16 EB).
| Bit numbers | 7-3 | 0-2 | | Bit numbers | 7-3 | 0-2 |
| ----------- | -------- | -------- | | ----------- | ---------- | ---------- |
| Field name | Exponent | Mantissa | | Field name | `Exponent` | `Mantissa` |
Maximum distance is given by the following formulae : Maximum distance is given by the following formulae :
``` ```
windowLog = 10 + Exponent; windowLog = 10 + Exponent;
windowBase = 1 << windowLog; windowBase = 1 << windowLog;
windowAdd = (windowBase / 8) * Mantissa; windowAdd = (windowBase / 8) * Mantissa;
windowSize = windowBase + windowAdd; Window_Size = windowBase + windowAdd;
``` ```
The minimum window size is 1 KB. The minimum window size is 1 KB.
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB. The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
To properly decode compressed data, To properly decode compressed data,
a decoder will need to allocate a buffer of at least `windowSize` bytes. a decoder will need to allocate a buffer of at least `Window_Size` bytes.
In order to preserve decoder from unreasonable memory requirements, In order to preserve decoder from unreasonable memory requirements,
a decoder can refuse a compressed frame a decoder can refuse a compressed frame
@ -287,7 +288,7 @@ the following ranges are reserved for future use and should not be used :
This is the original (uncompressed) size. This information is optional. This is the original (uncompressed) size. This information is optional.
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`. The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Format is Little-endian. Format is little-endian.
| `Field_Size` | Range | | `Field_Size` | Range |
| ------------ | ---------- | | ------------ | ---------- |
@ -309,7 +310,7 @@ The structure of `Data_Block` is following:
|:------------:|:------------:|:------------:|:---------------:| |:------------:|:------------:|:------------:|:---------------:|
| 1 bit | 2 bits | 21 bits | n bytes | | 1 bit | 2 bits | 21 bits | n bytes |
The block header uses 3-bytes. The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
__`Last_Block`__ __`Last_Block`__
@ -379,7 +380,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
__`Magic_Number`__ __`Magic_Number`__
4 Bytes, Little-endian format. 4 Bytes, little-endian format.
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
All 16 values are valid to identify a skippable frame. All 16 values are valid to identify a skippable frame.
@ -387,7 +388,7 @@ __`Frame_Size`__
This is the size, in bytes, of the following `User_Data` This is the size, in bytes, of the following `User_Data`
(without including the magic number nor the size field itself). (without including the magic number nor the size field itself).
This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits. This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
This means `User_Data` cant be bigger than (2^32-1) bytes. This means `User_Data` cant be bigger than (2^32-1) bytes.
__`User_Data`__ __`User_Data`__
@ -408,7 +409,7 @@ A compressed block consists of 2 sections :
### Prerequisites ### Prerequisites
To decode a compressed block, the following elements are necessary : To decode a compressed block, the following elements are necessary :
- Previous decoded blocks, up to a distance of `windowSize`, - Previous decoded blocks, up to a distance of `Window_Size`,
or all previous blocks when `Single_Segment_flag` is set. or all previous blocks when `Single_Segment_flag` is set.
- List of "recent offsets" from previous compressed block. - List of "recent offsets" from previous compressed block.
- Decoding tables of previous compressed block for each symbol type - Decoding tables of previous compressed block for each symbol type
@ -467,19 +468,19 @@ __`Size_Format`__
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams. and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`. - For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
For values spanning several bytes, convention is Little-endian. For values spanning several bytes, convention is little-endian.
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31). - Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
Total literal header size is 1 byte. `Literals_Section_Header` has 1 byte.
`size = h[0]>>3;` `Regenerated_Size = Header[0]>>3`
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095). - Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
Total literal header size is 2 bytes. `Literals_Section_Header` has 2 bytes.
`size = (h[0]>>4) + (h[1]<<4);` `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575). - Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
Total literal header size is 3 bytes. `Literals_Section_Header` has 3 bytes.
`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);` `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
Note : it's allowed to represent a short value (ex : `13`) Note : it's allowed to represent a short value (ex : `13`)
using a long format, accepting the reduced compacity. using a long format, accepting the reduced compacity.
@ -487,19 +488,19 @@ using a long format, accepting the reduced compacity.
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ : __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
- Value : 00 : _Single stream_. - Value : 00 : _Single stream_.
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Total literal header size is 3 bytes. `Literals_Section_Header` has 3 bytes.
- Value : 01 : 4 streams. - Value : 01 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
Total literal header size is 3 bytes. `Literals_Section_Header` has 3 bytes.
- Value : 10 : 4 streams. - Value : 10 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383). Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
Total literal header size is 4 bytes. `Literals_Section_Header` has 4 bytes.
- Value : 11 : 4 streams. - Value : 11 : 4 streams.
`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143). Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
Total literal header size is 5 bytes. `Literals_Section_Header` has 5 bytes.
`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention. Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
#### `Huffman_Tree_Description` #### `Huffman_Tree_Description`
@ -526,53 +527,53 @@ This specification limits maximum code length to 11 bits.
##### Representation ##### Representation
All literal values from zero (included) to last present one (excluded) All literal values from zero (included) to last present one (excluded)
are represented by `weight` values, from 0 to `maxBits`. are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
Transformation from `weight` to `nbBits` follows this formulae : Transformation from `Weight` to `Number_of_Bits` follows this formulae :
`nbBits = weight ? maxBits + 1 - weight : 0;` . `Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
The last symbol's weight is deduced from previously decoded ones, The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2. by completing to the nearest power of 2.
This power of 2 gives `maxBits`, the depth of the current tree. This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
__Example__ : __Example__ :
Let's presume the following Huffman tree must be described : Let's presume the following Huffman tree must be described :
| literal | 0 | 1 | 2 | 3 | 4 | 5 | | literal | 0 | 1 | 2 | 3 | 4 | 5 |
| ------- | --- | --- | --- | --- | --- | --- | | ---------------- | --- | --- | --- | --- | --- | --- |
| nbBits | 1 | 2 | 3 | 0 | 4 | 4 | | `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
The tree depth is 4, since its smallest element uses 4 bits. The tree depth is 4, since its smallest element uses 4 bits.
Value `5` will not be listed, nor will values above `5`. Value `5` will not be listed, nor will values above `5`.
Values from `0` to `4` will be listed using `weight` instead of `nbBits`. Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;` Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
It gives the following serie of weights : It gives the following serie of weights :
| weights | 4 | 3 | 2 | 0 | 1 | | `Weight` | 4 | 3 | 2 | 0 | 1 |
| ------- | --- | --- | --- | --- | --- | | -------- | --- | --- | --- | --- | --- |
| literal | 0 | 1 | 2 | 3 | 4 | | literal | 0 | 1 | 2 | 3 | 4 |
The decoder will do the inverse operation : The decoder will do the inverse operation :
having collected weights of literals from `0` to `4`, having collected weights of literals from `0` to `4`,
it knows the last literal, `5`, is present with a non-zero weight. it knows the last literal, `5`, is present with a non-zero weight.
The weight of `5` can be deducted by joining to the nearest power of 2. The weight of `5` can be deducted by joining to the nearest power of 2.
Sum of 2^(weight-1) (excluding 0) is : Sum of `2^(Weight-1)` (excluding 0) is :
`8 + 4 + 2 + 0 + 1 = 15` `8 + 4 + 2 + 0 + 1 = 15`.
Nearest power of 2 is 16. Nearest power of 2 is 16.
Therefore, `maxBits = 4` and `weight[5] = 1`. Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
##### Huffman Tree header ##### Huffman Tree header
This is a single byte value (0-255), This is a single byte value (0-255),
which tells how to decode the list of weights. which tells how to decode the list of weights.
- if headerByte >= 128 : this is a direct representation, - if `headerByte` >= 128 : this is a direct representation,
where each weight is written directly as a 4 bits field (0-15). where each `Weight` is written directly as a 4 bits field (0-15).
The full representation occupies `((nbSymbols+1)/2)` bytes, The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
meaning it uses a last full byte even if nbSymbols is odd. meaning it uses a last full byte even if `Number_of_Symbols` is odd.
`nbSymbols = headerByte - 127;`. `Number_of_Symbols = headerByte - 127`.
Note that maximum nbSymbols is 255-127 = 128. Note that maximum `Number_of_Symbols` is 255-127 = 128.
A larger serie must necessarily use FSE compression. A larger serie must necessarily use FSE compression.
- if headerByte < 128 : - if `headerByte` < 128 :
the serie of weights is compressed by FSE. the serie of weights is compressed by FSE.
The length of the FSE-compressed serie is `headerByte` (0-127). The length of the FSE-compressed serie is `headerByte` (0-127).
@ -605,29 +606,29 @@ When both states have overflowed the bitstream, end is reached.
##### Conversion from weights to Huffman prefix codes ##### Conversion from weights to Huffman prefix codes
All present symbols shall now have a `weight` value. All present symbols shall now have a `Weight` value.
It is possible to transform weights into nbBits, using this formula : It is possible to transform weights into Number_of_Bits, using this formula :
`nbBits = nbBits ? maxBits + 1 - weight : 0;` . `Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
Symbols are sorted by weight. Within same weight, symbols keep natural order. Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
Symbols with a weight of zero are removed. Symbols with a `Weight` of zero are removed.
Then, starting from lowest weight, prefix codes are distributed in order. Then, starting from lowest weight, prefix codes are distributed in order.
__Example__ : __Example__ :
Let's presume the following list of weights has been decoded : Let's presume the following list of weights has been decoded :
| Literal | 0 | 1 | 2 | 3 | 4 | 5 | | Literal | 0 | 1 | 2 | 3 | 4 | 5 |
| ------- | --- | --- | --- | --- | --- | --- | | -------- | --- | --- | --- | --- | --- | --- |
| weight | 4 | 3 | 2 | 0 | 1 | 1 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural order, Sorted by weight and then natural order,
it gives the following distribution : it gives the following distribution :
| Literal | 3 | 4 | 5 | 2 | 1 | 0 | | Literal | 3 | 4 | 5 | 2 | 1 | 0 |
| ------------ | --- | --- | --- | --- | --- | ---- | | ---------------- | --- | --- | --- | --- | --- | ---- |
| weight | 0 | 1 | 1 | 2 | 3 | 4 | | `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
| nb bits | 0 | 4 | 4 | 3 | 2 | 1 | | `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
#### Literals bitstreams #### Literals bitstreams
@ -638,7 +639,7 @@ As seen in a previous paragraph,
there are 2 flavors of Huffman-compressed literals : there are 2 flavors of Huffman-compressed literals :
single stream, and 4-streams. single stream, and 4-streams.
4-streams is useful for CPU with multiple execution units and OoO operations. 4-streams is useful for CPU with multiple execution units and out-of-order operations.
Since each stream can be decoded independently, Since each stream can be decoded independently,
it's possible to decode them up to 4x faster than a single stream, it's possible to decode them up to 4x faster than a single stream,
presuming the CPU has enough parallelism available. presuming the CPU has enough parallelism available.
@ -653,11 +654,11 @@ Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`. except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
Compressed size is provided explicitly : in the 4-streams variant, Compressed size is provided explicitly : in the 4-streams variant,
bitstreams are preceded by 3 unsigned Little-Endian 16-bits values. bitstreams are preceded by 3 unsigned little-endian 16-bits values.
Each value represents the compressed size of one stream, in order. Each value represents the compressed size of one stream, in order.
The last stream size is deducted from total compressed size The last stream size is deducted from total compressed size
and from previously decoded stream sizes : and from previously decoded stream sizes :
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;` `stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
##### Bitstreams read and decode ##### Bitstreams read and decode
@ -676,7 +677,7 @@ Starting from the end,
it's possible to read the bitstream in a little-endian fashion, it's possible to read the bitstream in a little-endian fashion,
keeping track of already used bits. keeping track of already used bits.
Reading the last `maxBits` bits, Reading the last `Max_Number_of_Bits` bits,
it's then possible to compare extracted value to decoding table, it's then possible to compare extracted value to decoding table,
determining the symbol to decode and number of bits to discard. determining the symbol to decode and number of bits to discard.
@ -696,7 +697,7 @@ A match copy command specifies an offset and a length.
The offset gives the position to copy from, The offset gives the position to copy from,
which can be within a previous block. which can be within a previous block.
There are 3 symbol types, `literalLength`, `matchLength` and `offset`, There are 3 symbol types, literals lengths, offsets and match lengths,
which are encoded together, interleaved in a single _bitstream_. which are encoded together, interleaved in a single _bitstream_.
Each symbol is a _code_ in its own context, Each symbol is a _code_ in its own context,
@ -708,126 +709,125 @@ The Sequences section starts by a header,
followed by optional Probability tables for each symbol type, followed by optional Probability tables for each symbol type,
followed by the bitstream. followed by the bitstream.
| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream | | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
| ------ | ---------------- | ------------- | ------------------ | --------- | | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
To decode the Sequence section, it's required to know its size. To decode the Sequence section, it's required to know its size.
This size is deducted from `blockSize - literalSectionSize`. This size is deducted from `blockSize - literalSectionSize`.
#### Sequences section header #### `Sequences_Section_Header`
Consists in 2 items : Consists in 2 items :
- Nb of Sequences - `Number_of_Sequences`
- Flags providing Symbol compression types - Symbol compression modes
__Nb of Sequences__ __`Number_of_Sequences`__
This is a variable size field, `nbSeqs`, using between 1 and 3 bytes. This is a variable size field using between 1 and 3 bytes.
Let's call its first byte `byte0`. Let's call its first byte `byte0`.
- `if (byte0 == 0)` : there are no sequences. - `if (byte0 == 0)` : there are no sequences.
The sequence section stops there. The sequence section stops there.
Regenerated content is defined entirely by literals section. Regenerated content is defined entirely by literals section.
- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte. - `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes. - `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes. - `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
__Symbol encoding modes__ __Symbol compression modes__
This is a single byte, defining the compression mode of each symbol type. This is a single byte, defining the compression mode of each symbol type.
| BitNb | 7-6 | 5-4 | 3-2 | 1-0 | |Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
| ------- | ------ | ------ | ------ | -------- | | -------- | ----------------------- | -------------- | -------------------- | ---------- |
|FieldName| LLType | OFType | MLType | Reserved | |Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
The last field, `Reserved`, must be all-zeroes. The last field, `Reserved`, must be all-zeroes.
`LLType`, `OFType` and `MLType` define the compression mode of `Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
Literal Lengths, Offsets and Match Lengths respectively. literals lengths, offsets and match lengths respectively.
They follow the same enumeration : They follow the same enumeration :
| Value | 0 | 1 | 2 | 3 | | Value | 0 | 1 | 2 | 3 |
| ---------------- | ------ | --- | ---------- | ------ | | ------------------ | ----------------- | ---------- | --------------------- | ------------- |
| Compression Mode | predef | RLE | Compressed | Repeat | | `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
- "predef" : uses a pre-defined distribution table. - `Predefined_Mode` : uses a predefined distribution table.
- "RLE" : it's a single code, repeated `nbSeqs` times. - `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
- "Repeat" : re-use distribution table from previous compressed block. - `Repeat_Mode` : re-use distribution table from previous compressed block.
- "Compressed" : standard FSE compression. - `FSE_Compressed_Mode` : standard FSE compression.
A distribution table will be present. A distribution table will be present.
It will be described in [next part](#distribution-tables). It will be described in [next part](#distribution-tables).
#### Symbols decoding #### Symbols decoding
##### Literal Lengths codes ##### Literals Length codes
Literal lengths codes are values ranging from `0` to `35` included. Literals length codes are values ranging from `0` to `35` included.
They define lengths from 0 to 131071 bytes. They define lengths from 0 to 131071 bytes.
| Code | 0-15 | | `Literals_Length_Code` | 0-15 |
| ------ | ---- | | ---------------------- | ---------------------- |
| length | Code | | length | `Literals_Length_Code` |
| nbBits | 0 | | `Number_of_Bits` | 0 |
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | | `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | | Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | | `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | `Literals_Length_Code` | 32 | 33 | 34 | 35 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | ---------------------- | ---- | ---- | ---- | ---- |
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | | Baseline | 8192 |16384 |32768 |65536 |
| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | `Number_of_Bits` | 13 | 14 | 15 | 16 |
| Code | 32 | 33 | 34 | 35 |
| -------- | ---- | ---- | ---- | ---- |
| Baseline | 8192 |16384 |32768 |65536 |
| nb Bits | 13 | 14 | 15 | 16 |
__Default distribution__ __Default distribution__
When "compression mode" is "predef"", When `Compression_Mode` is `Predefined_Mode`,
a pre-defined distribution is used for FSE compression. a predefined distribution is used for FSE compression.
Below is its definition. It uses an accuracy of 6 bits (64 states). Below is its definition. It uses an accuracy of 6 bits (64 states).
``` ```
short literalLengths_defaultDistribution[36] = short literalsLength_defaultDistribution[36] =
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
-1,-1,-1,-1 }; -1,-1,-1,-1 };
``` ```
##### Match Lengths codes ##### Match Length codes
Match lengths codes are values ranging from `0` to `52` included. Match length codes are values ranging from `0` to `52` included.
They define lengths from 3 to 131074 bytes. They define lengths from 3 to 131074 bytes.
| Code | 0-31 | | `Match_Length_Code` | 0-31 |
| ------ | -------- | | ------------------- | ----------------------- |
| value | Code + 3 | | value | `Match_Length_Code` + 3 |
| nbBits | 0 | | `Number_of_Bits` | 0 |
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | | `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | | Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | | `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 | | Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | | `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
| Code | 48 | 49 | 50 | 51 | 52 | | `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
| -------- | ---- | ---- | ---- | ---- | ---- | | ------------------- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 4098 | 8194 |16486 |32770 |65538 | | Baseline | 4098 | 8194 |16486 |32770 |65538 |
| nb Bits | 12 | 13 | 14 | 15 | 16 | | `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
__Default distribution__ __Default distribution__
When "compression mode" is defined as "predef", When `Compression_Mode` is defined as `Predefined_Mode`,
a pre-defined distribution is used for FSE compression. a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 6 bits (64 states). Here is its definition. It uses an accuracy of 6 bits (64 states).
``` ```
@ -840,31 +840,32 @@ short matchLengths_defaultDistribution[53] =
##### Offset codes ##### Offset codes
Offset codes are values ranging from `0` to `N`, Offset codes are values ranging from `0` to `N`.
with `N` being limited by maximum backreference distance.
A decoder is free to limit its maximum `N` supported. A decoder is free to limit its maximum `N` supported.
Recommendation is to support at least up to `22`. Recommendation is to support at least up to `22`.
For information, at the time of this writing. For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `28` in 64-bits mode. the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
An offset code is also the nb of additional bits to read, An offset code is also the number of additional bits to read,
and can be translated into an `OFValue` using the following formulae : and can be translated into an `Offset_Value` using the following formulae :
``` ```
OFValue = (1 << offsetCode) + readNBits(offsetCode); Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (OFValue > 3) offset = OFValue - 3; if (Offset_Value > 3) offset = Offset_Value - 3;
``` ```
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
but is limited by [maximum back-reference distance](#window_descriptor).
OFValue from 1 to 3 are special : they define "repeat codes", Offset_Value from 1 to 3 are special : they define "repeat codes",
which means one of the previous offsets will be repeated. which means one of the previous offsets will be repeated.
They are sorted in recency order, with 1 meaning the most recent one. They are sorted in recency order, with 1 meaning the most recent one.
See [Repeat offsets](#repeat-offsets) paragraph. See [Repeat offsets](#repeat-offsets) paragraph.
__Default distribution__ __Default distribution__
When "compression mode" is defined as "predef", When `Compression_Mode` is defined as `Predefined_Mode`,
a pre-defined distribution is used for FSE compression. a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 5 bits (32 states), Here is its definition. It uses an accuracy of 5 bits (32 states),
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 . and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
@ -882,38 +883,38 @@ short offsetCodes_defaultDistribution[53] =
Following the header, up to 3 distribution tables can be described. Following the header, up to 3 distribution tables can be described.
When present, they are in this order : When present, they are in this order :
- Literal lengthes - Literals lengths
- Offsets - Offsets
- Match Lengthes - Match Lengths
The content to decode depends on their respective encoding mode : The content to decode depends on their respective encoding mode :
- Predef : no content. Use pre-defined distribution table. - `Predefined_Mode` : no content. Use predefined distribution table.
- RLE : 1 byte. This is the only code to use across the whole compressed block. - `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
- FSE : A distribution table is present. - `FSE_Compressed_Mode` : A distribution table is present.
- Repeat mode : no content. Re-use distribution from previous compressed block. - `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
##### FSE distribution table : condensed format ##### FSE distribution table : condensed format
An FSE distribution table describes the probabilities of all symbols An FSE distribution table describes the probabilities of all symbols
from `0` to the last present one (included) from `0` to the last present one (included)
on a normalized scale of `1 << AccuracyLog` . on a normalized scale of `1 << Accuracy_Log` .
It's a bitstream which is read forward, in little-endian fashion. It's a bitstream which is read forward, in little-endian fashion.
It's not necessary to know its exact size, It's not necessary to know its exact size,
since it will be discovered and reported by the decoding process. since it will be discovered and reported by the decoding process.
The bitstream starts by reporting on which scale it operates. The bitstream starts by reporting on which scale it operates.
`AccuracyLog = low4bits + 5;` `Accuracy_Log = low4bits + 5`.
Note that maximum `AccuracyLog` for literal and match lengthes is `9`, Note that maximum `Accuracy_Log` for literal and match length is `9`,
and for offsets it is `8`. Higher values are considered errors. and for offsets it is `8`. Higher values are considered errors.
Then follow each symbol value, from `0` to last present one. Then follow each symbol value, from `0` to last present one.
The nb of bits used by each field is variable. The number of bits used by each field is variable.
It depends on : It depends on :
- Remaining probabilities + 1 : - Remaining probabilities + 1 :
__example__ : __example__ :
Presuming an AccuracyLog of 8, Presuming an `Accuracy_Log` of 8,
and presuming 100 probabilities points have already been distributed, and presuming 100 probabilities points have already been distributed,
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included). the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Therefore, it must read `log2sup(156) == 8` bits. Therefore, it must read `log2sup(156) == 8` bits.
@ -927,17 +928,17 @@ It depends on :
values from 99 to 156 use 8 bits. values from 99 to 156 use 8 bits.
This is achieved through this scheme : This is achieved through this scheme :
| Value read | Value decoded | nb Bits used | | Value read | Value decoded | Number of bits used |
| ---------- | ------------- | ------------ | | ---------- | ------------- | ------------------- |
| 0 - 98 | 0 - 98 | 7 | | 0 - 98 | 0 - 98 | 7 |
| 99 - 127 | 99 - 127 | 8 | | 99 - 127 | 99 - 127 | 8 |
| 128 - 226 | 0 - 98 | 7 | | 128 - 226 | 0 - 98 | 7 |
| 227 - 255 | 128 - 156 | 8 | | 227 - 255 | 128 - 156 | 8 |
Symbols probabilities are read one by one, in order. Symbols probabilities are read one by one, in order.
Probability is obtained from Value decoded by following formulae : Probability is obtained from Value decoded by following formulae :
`Proba = value - 1;` `Proba = value - 1`
It means value `0` becomes negative probability `-1`. It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means `less than 1`. `-1` is a special probability, which means `less than 1`.
@ -952,9 +953,9 @@ This repeat flag tells how many probabilities of zeroes follow the current one.
It provides a number ranging from 0 to 3. It provides a number ranging from 0 to 3.
If it is a 3, another 2-bits repeat flag follows, and so on. If it is a 3, another 2-bits repeat flag follows, and so on.
When last symbol reaches cumulated total of `1 << AccuracyLog`, When last symbol reaches cumulated total of `1 << Accuracy_Log`,
decoding is complete. decoding is complete.
If the last symbol makes cumulated total go above `1 << AccuracyLog`, If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
distribution is considered corrupted. distribution is considered corrupted.
Then the decoder can tell how many bytes were used in this process, Then the decoder can tell how many bytes were used in this process,
@ -969,14 +970,14 @@ to create a unique decoding table.
It follows the following build rule : It follows the following build rule :
The table has a size of `tableSize = 1 << AccuracyLog;`. The table has a size of `tableSize = 1 << Accuracy_Log`.
Each cell describes the symbol decoded, Each cell describes the symbol decoded,
and instructions to get the next state. and instructions to get the next state.
Symbols are scanned in their natural order for `less than 1` probabilities. Symbols are scanned in their natural order for `less than 1` probabilities.
Symbols with this probability are being attributed a single cell, Symbols with this probability are being attributed a single cell,
starting from the end of the table. starting from the end of the table.
These symbols define a full state reset, reading `AccuracyLog` bits. These symbols define a full state reset, reading `Accuracy_Log` bits.
All remaining symbols are sorted in their natural order. All remaining symbols are sorted in their natural order.
Starting from symbol `0` and table position `0`, Starting from symbol `0` and table position `0`,
@ -1005,7 +1006,7 @@ It receives 5 state values. States are sorted in natural order.
Next power of 2 is 8. Next power of 2 is 8.
Space of probabilities is divided into 8 equal parts. Space of probabilities is divided into 8 equal parts.
Presuming the AccuracyLog is 7, it defines 128 states. Presuming the `Accuracy_Log` is 7, it defines 128 states.
Divided by 8, each share is 16 large. Divided by 8, each share is 16 large.
In order to reach 8, 8-5=3 lowest states will count "double", In order to reach 8, 8-5=3 lowest states will count "double",
@ -1014,13 +1015,13 @@ requiring one more bit in the process.
Numbering starts from higher states using less bits. Numbering starts from higher states using less bits.
| state order | 0 | 1 | 2 | 3 | 4 | | state order | 0 | 1 | 2 | 3 | 4 |
| ----------- | ----- | ----- | ------ | ---- | ----- | | ---------------- | ----- | ----- | ------ | ---- | ----- |
| width | 32 | 32 | 32 | 16 | 16 | | width | 32 | 32 | 32 | 16 | 16 |
| nb Bits | 5 | 5 | 5 | 4 | 4 | | `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
| range nb | 2 | 4 | 6 | 0 | 1 | | range number | 2 | 4 | 6 | 0 | 1 |
| baseline | 32 | 64 | 96 | 0 | 16 | | baseline | 32 | 64 | 96 | 0 | 16 |
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
Next state is determined from current state Next state is determined from current state
by reading the required number of bits, and adding the specified baseline. by reading the required number of bits, and adding the specified baseline.
@ -1044,9 +1045,9 @@ The bitstream starts with initial state values,
each using the required number of bits in their respective _accuracy_, each using the required number of bits in their respective _accuracy_,
decoded previously from their normalized distribution. decoded previously from their normalized distribution.
It starts by `Literal Length State`, It starts by `Literals_Length_State`,
followed by `Offset State`, followed by `Offset_State`,
and finally `Match Length State`. and finally `Match_Length_State`.
Reminder : always keep in mind that all values are read _backward_. Reminder : always keep in mind that all values are read _backward_.
@ -1056,9 +1057,9 @@ A state gives a code.
A code provides a baseline and number of bits to add. A code provides a baseline and number of bits to add.
See [Symbol Decoding] section for details on each symbol. See [Symbol Decoding] section for details on each symbol.
Decoding starts by reading the nb of bits required to decode offset. Decoding starts by reading the number of bits required to decode offset.
It then does the same for match length, It then does the same for match length,
and then for literal length. and then for literals length.
Offset / matchLength / litLength define a sequence. Offset / matchLength / litLength define a sequence.
It starts by inserting the number of literals defined by `litLength`, It starts by inserting the number of literals defined by `litLength`,
@ -1066,11 +1067,11 @@ then continue by copying `matchLength` bytes from `currentPos - offset`.
The next operation is to update states. The next operation is to update states.
Using rules pre-calculated in the decoding tables, Using rules pre-calculated in the decoding tables,
`Literal Length State` is updated, `Literals_Length_State` is updated,
followed by `Match Length State`, followed by `Match_Length_State`,
and then `Offset State`. and then `Offset_State`.
This operation will be repeated `NbSeqs` times. This operation will be repeated `Number_of_Sequences` times.
At the end, the bitstream shall be entirely consumed, At the end, the bitstream shall be entirely consumed,
otherwise bitstream is considered corrupted. otherwise bitstream is considered corrupted.
@ -1081,7 +1082,7 @@ otherwise bitstream is considered corrupted.
As seen in [Offset Codes], the first 3 values define a repeated offset. As seen in [Offset Codes], the first 3 values define a repeated offset.
They are sorted in recency order, with 1 meaning "most recent one". They are sorted in recency order, with 1 meaning "most recent one".
There is an exception though, when current sequence's literal length is `0`. There is an exception though, when current sequence's literals length is `0`.
In which case, repcodes are "pushed by one", In which case, repcodes are "pushed by one",
so 1 becomes 2, 2 becomes 3, so 1 becomes 2, 2 becomes 3,
and 3 becomes "offset_1 - 1_byte". and 3 becomes "offset_1 - 1_byte".
@ -1112,31 +1113,32 @@ But dictionaries created by `zstd --train` follow a format, described here.
__Pre-requisites__ : a dictionary has a known length, __Pre-requisites__ : a dictionary has a known length,
defined either by a buffer limit, or a file size. defined either by a buffer limit, or a file size.
| Header | DictID | Stats | Content | | `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
| ------ | ------ | ----- | ------- | | -------------- | --------------- | ---------------- | --------- |
__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format __`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
__Dict_ID__ : 4 bytes, stored in Little-Endian format. __`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
DictID can be any value, except 0 (which means no DictID). `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
It's used by decoders to check if they use the correct dictionary. It's used by decoders to check if they use the correct dictionary.
_Reserved ranges :_
_Reserved ranges :_
If the frame is going to be distributed in a private environment, If the frame is going to be distributed in a private environment,
any dictionary ID can be used. any `Dictionary_ID` can be used.
However, for public distribution of compressed frames, However, for public distribution of compressed frames,
some ranges are reserved for future use : the following ranges are reserved for future use and should not be used :
- low range : 1 - 32767 : reserved - low range : 1 - 32767
- high range : >= (2^31) : reserved - high range : >= (2^31)
__Stats__ : Entropy tables, following the same format as a [compressed blocks]. __`Entropy_Tables`__ : following the same format as a [compressed blocks].
They are stored in following order : They are stored in following order :
Huffman tables for literals, FSE table for offset, Huffman tables for literals, FSE table for offset,
FSE table for matchLenth, and FSE table for litLength. FSE table for matchLenth, and FSE table for litLength.
It's finally followed by 3 offset values, populating recent offsets, It's finally followed by 3 offset values, populating recent offsets,
stored in order, 4-bytes little-endian each, for a total of 12 bytes. stored in order, 4-bytes little-endian each, for a total of 12 bytes.
__Content__ : Where the actual dictionary content is. __`Content`__ : Where the actual dictionary content is.
Content size depends on Dictionary size. Content size depends on Dictionary size.
[compressed blocks]: #the-format-of-compressed_block [compressed blocks]: #the-format-of-compressed_block