Merge pull request #288 from inikep/Other
improved zstd_compression_format.md
This commit is contained in:
commit
d6ad8fed80
@ -60,8 +60,9 @@ explaining which parameter is unsupported.
|
|||||||
|
|
||||||
Overall conventions
|
Overall conventions
|
||||||
-----------
|
-----------
|
||||||
In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
|
In this document:
|
||||||
|
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
|
||||||
|
- a naming convention for identifiers is `Mixed_Case_With_Underscores`
|
||||||
|
|
||||||
Definitions
|
Definitions
|
||||||
-----------
|
-----------
|
||||||
@ -106,7 +107,7 @@ The structure of a single Zstandard frame is following:
|
|||||||
|
|
||||||
__`Magic_Number`__
|
__`Magic_Number`__
|
||||||
|
|
||||||
4 Bytes, Little-endian format.
|
4 Bytes, little-endian format.
|
||||||
Value : 0xFD2FB527
|
Value : 0xFD2FB527
|
||||||
|
|
||||||
__`Frame_Header`__
|
__`Frame_Header`__
|
||||||
@ -122,7 +123,7 @@ __`Content_Checksum`__
|
|||||||
|
|
||||||
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
|
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
|
||||||
The content checksum is the result
|
The content checksum is the result
|
||||||
of [xxh64() hash function](https://www.xxHash.com)
|
of [xxh64() hash function](http://www.xxhash.org)
|
||||||
digesting the original (decoded) data as input, and a seed of zero.
|
digesting the original (decoded) data as input, and a seed of zero.
|
||||||
The low 4 bytes of the checksum are stored in little endian format.
|
The low 4 bytes of the checksum are stored in little endian format.
|
||||||
|
|
||||||
@ -214,11 +215,11 @@ __`Dictionary_ID_flag`__
|
|||||||
|
|
||||||
This is a 2-bits flag (`= FHD & 3`),
|
This is a 2-bits flag (`= FHD & 3`),
|
||||||
telling if a dictionary ID is provided within the header.
|
telling if a dictionary ID is provided within the header.
|
||||||
It also specifies the size of this field.
|
It also specifies the size of this field as `Field_Size`.
|
||||||
|
|
||||||
| Value | 0 | 1 | 2 | 3 |
|
|`Flag_Value`| 0 | 1 | 2 | 3 |
|
||||||
| -------- | --- | --- | --- | --- |
|
| ---------- | --- | --- | --- | --- |
|
||||||
|Field size| 0 | 1 | 2 | 4 |
|
|`Field_Size`| 0 | 1 | 2 | 4 |
|
||||||
|
|
||||||
### `Window_Descriptor`
|
### `Window_Descriptor`
|
||||||
|
|
||||||
@ -230,22 +231,22 @@ The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag
|
|||||||
In this case, the maximum back-reference distance is the content size itself,
|
In this case, the maximum back-reference distance is the content size itself,
|
||||||
which can be any value from 1 to 2^64-1 bytes (16 EB).
|
which can be any value from 1 to 2^64-1 bytes (16 EB).
|
||||||
|
|
||||||
| Bit numbers | 7-3 | 0-2 |
|
| Bit numbers | 7-3 | 0-2 |
|
||||||
| ----------- | -------- | -------- |
|
| ----------- | ---------- | ---------- |
|
||||||
| Field name | Exponent | Mantissa |
|
| Field name | `Exponent` | `Mantissa` |
|
||||||
|
|
||||||
Maximum distance is given by the following formulae :
|
Maximum distance is given by the following formulae :
|
||||||
```
|
```
|
||||||
windowLog = 10 + Exponent;
|
windowLog = 10 + Exponent;
|
||||||
windowBase = 1 << windowLog;
|
windowBase = 1 << windowLog;
|
||||||
windowAdd = (windowBase / 8) * Mantissa;
|
windowAdd = (windowBase / 8) * Mantissa;
|
||||||
windowSize = windowBase + windowAdd;
|
Window_Size = windowBase + windowAdd;
|
||||||
```
|
```
|
||||||
The minimum window size is 1 KB.
|
The minimum window size is 1 KB.
|
||||||
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
|
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
|
||||||
|
|
||||||
To properly decode compressed data,
|
To properly decode compressed data,
|
||||||
a decoder will need to allocate a buffer of at least `windowSize` bytes.
|
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
|
||||||
|
|
||||||
In order to preserve decoder from unreasonable memory requirements,
|
In order to preserve decoder from unreasonable memory requirements,
|
||||||
a decoder can refuse a compressed frame
|
a decoder can refuse a compressed frame
|
||||||
@ -287,7 +288,7 @@ the following ranges are reserved for future use and should not be used :
|
|||||||
This is the original (uncompressed) size. This information is optional.
|
This is the original (uncompressed) size. This information is optional.
|
||||||
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
|
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
|
||||||
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
||||||
Format is Little-endian.
|
Format is little-endian.
|
||||||
|
|
||||||
| `Field_Size` | Range |
|
| `Field_Size` | Range |
|
||||||
| ------------ | ---------- |
|
| ------------ | ---------- |
|
||||||
@ -309,7 +310,7 @@ The structure of `Data_Block` is following:
|
|||||||
|:------------:|:------------:|:------------:|:---------------:|
|
|:------------:|:------------:|:------------:|:---------------:|
|
||||||
| 1 bit | 2 bits | 21 bits | n bytes |
|
| 1 bit | 2 bits | 21 bits | n bytes |
|
||||||
|
|
||||||
The block header uses 3-bytes.
|
The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
|
||||||
|
|
||||||
__`Last_Block`__
|
__`Last_Block`__
|
||||||
|
|
||||||
@ -379,7 +380,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
|
|||||||
|
|
||||||
__`Magic_Number`__
|
__`Magic_Number`__
|
||||||
|
|
||||||
4 Bytes, Little-endian format.
|
4 Bytes, little-endian format.
|
||||||
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
|
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
|
||||||
All 16 values are valid to identify a skippable frame.
|
All 16 values are valid to identify a skippable frame.
|
||||||
|
|
||||||
@ -387,7 +388,7 @@ __`Frame_Size`__
|
|||||||
|
|
||||||
This is the size, in bytes, of the following `User_Data`
|
This is the size, in bytes, of the following `User_Data`
|
||||||
(without including the magic number nor the size field itself).
|
(without including the magic number nor the size field itself).
|
||||||
This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
|
This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
|
||||||
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
||||||
|
|
||||||
__`User_Data`__
|
__`User_Data`__
|
||||||
@ -408,7 +409,7 @@ A compressed block consists of 2 sections :
|
|||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
To decode a compressed block, the following elements are necessary :
|
To decode a compressed block, the following elements are necessary :
|
||||||
- Previous decoded blocks, up to a distance of `windowSize`,
|
- Previous decoded blocks, up to a distance of `Window_Size`,
|
||||||
or all previous blocks when `Single_Segment_flag` is set.
|
or all previous blocks when `Single_Segment_flag` is set.
|
||||||
- List of "recent offsets" from previous compressed block.
|
- List of "recent offsets" from previous compressed block.
|
||||||
- Decoding tables of previous compressed block for each symbol type
|
- Decoding tables of previous compressed block for each symbol type
|
||||||
@ -467,19 +468,19 @@ __`Size_Format`__
|
|||||||
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
|
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
|
||||||
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
|
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
|
||||||
|
|
||||||
For values spanning several bytes, convention is Little-endian.
|
For values spanning several bytes, convention is little-endian.
|
||||||
|
|
||||||
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
||||||
|
|
||||||
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
|
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
|
||||||
Total literal header size is 1 byte.
|
`Literals_Section_Header` has 1 byte.
|
||||||
`size = h[0]>>3;`
|
`Regenerated_Size = Header[0]>>3`
|
||||||
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
|
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
|
||||||
Total literal header size is 2 bytes.
|
`Literals_Section_Header` has 2 bytes.
|
||||||
`size = (h[0]>>4) + (h[1]<<4);`
|
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
|
||||||
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
|
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
|
||||||
Total literal header size is 3 bytes.
|
`Literals_Section_Header` has 3 bytes.
|
||||||
`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
|
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
|
||||||
|
|
||||||
Note : it's allowed to represent a short value (ex : `13`)
|
Note : it's allowed to represent a short value (ex : `13`)
|
||||||
using a long format, accepting the reduced compacity.
|
using a long format, accepting the reduced compacity.
|
||||||
@ -487,19 +488,19 @@ using a long format, accepting the reduced compacity.
|
|||||||
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
|
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
|
||||||
|
|
||||||
- Value : 00 : _Single stream_.
|
- Value : 00 : _Single stream_.
|
||||||
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||||
Total literal header size is 3 bytes.
|
`Literals_Section_Header` has 3 bytes.
|
||||||
- Value : 01 : 4 streams.
|
- Value : 01 : 4 streams.
|
||||||
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||||
Total literal header size is 3 bytes.
|
`Literals_Section_Header` has 3 bytes.
|
||||||
- Value : 10 : 4 streams.
|
- Value : 10 : 4 streams.
|
||||||
`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
|
Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
|
||||||
Total literal header size is 4 bytes.
|
`Literals_Section_Header` has 4 bytes.
|
||||||
- Value : 11 : 4 streams.
|
- Value : 11 : 4 streams.
|
||||||
`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
|
Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
|
||||||
Total literal header size is 5 bytes.
|
`Literals_Section_Header` has 5 bytes.
|
||||||
|
|
||||||
`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
||||||
|
|
||||||
|
|
||||||
#### `Huffman_Tree_Description`
|
#### `Huffman_Tree_Description`
|
||||||
@ -526,53 +527,53 @@ This specification limits maximum code length to 11 bits.
|
|||||||
##### Representation
|
##### Representation
|
||||||
|
|
||||||
All literal values from zero (included) to last present one (excluded)
|
All literal values from zero (included) to last present one (excluded)
|
||||||
are represented by `weight` values, from 0 to `maxBits`.
|
are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
|
||||||
Transformation from `weight` to `nbBits` follows this formulae :
|
Transformation from `Weight` to `Number_of_Bits` follows this formulae :
|
||||||
`nbBits = weight ? maxBits + 1 - weight : 0;` .
|
`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||||
The last symbol's weight is deduced from previously decoded ones,
|
The last symbol's `Weight` is deduced from previously decoded ones,
|
||||||
by completing to the nearest power of 2.
|
by completing to the nearest power of 2.
|
||||||
This power of 2 gives `maxBits`, the depth of the current tree.
|
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
|
||||||
|
|
||||||
__Example__ :
|
__Example__ :
|
||||||
Let's presume the following Huffman tree must be described :
|
Let's presume the following Huffman tree must be described :
|
||||||
|
|
||||||
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||||
| ------- | --- | --- | --- | --- | --- | --- |
|
| ---------------- | --- | --- | --- | --- | --- | --- |
|
||||||
| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
|
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
|
||||||
|
|
||||||
The tree depth is 4, since its smallest element uses 4 bits.
|
The tree depth is 4, since its smallest element uses 4 bits.
|
||||||
Value `5` will not be listed, nor will values above `5`.
|
Value `5` will not be listed, nor will values above `5`.
|
||||||
Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
|
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
|
||||||
Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
|
Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
|
||||||
It gives the following serie of weights :
|
It gives the following serie of weights :
|
||||||
|
|
||||||
| weights | 4 | 3 | 2 | 0 | 1 |
|
| `Weight` | 4 | 3 | 2 | 0 | 1 |
|
||||||
| ------- | --- | --- | --- | --- | --- |
|
| -------- | --- | --- | --- | --- | --- |
|
||||||
| literal | 0 | 1 | 2 | 3 | 4 |
|
| literal | 0 | 1 | 2 | 3 | 4 |
|
||||||
|
|
||||||
The decoder will do the inverse operation :
|
The decoder will do the inverse operation :
|
||||||
having collected weights of literals from `0` to `4`,
|
having collected weights of literals from `0` to `4`,
|
||||||
it knows the last literal, `5`, is present with a non-zero weight.
|
it knows the last literal, `5`, is present with a non-zero weight.
|
||||||
The weight of `5` can be deducted by joining to the nearest power of 2.
|
The weight of `5` can be deducted by joining to the nearest power of 2.
|
||||||
Sum of 2^(weight-1) (excluding 0) is :
|
Sum of `2^(Weight-1)` (excluding 0) is :
|
||||||
`8 + 4 + 2 + 0 + 1 = 15`
|
`8 + 4 + 2 + 0 + 1 = 15`.
|
||||||
Nearest power of 2 is 16.
|
Nearest power of 2 is 16.
|
||||||
Therefore, `maxBits = 4` and `weight[5] = 1`.
|
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
|
||||||
|
|
||||||
##### Huffman Tree header
|
##### Huffman Tree header
|
||||||
|
|
||||||
This is a single byte value (0-255),
|
This is a single byte value (0-255),
|
||||||
which tells how to decode the list of weights.
|
which tells how to decode the list of weights.
|
||||||
|
|
||||||
- if headerByte >= 128 : this is a direct representation,
|
- if `headerByte` >= 128 : this is a direct representation,
|
||||||
where each weight is written directly as a 4 bits field (0-15).
|
where each `Weight` is written directly as a 4 bits field (0-15).
|
||||||
The full representation occupies `((nbSymbols+1)/2)` bytes,
|
The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
|
||||||
meaning it uses a last full byte even if nbSymbols is odd.
|
meaning it uses a last full byte even if `Number_of_Symbols` is odd.
|
||||||
`nbSymbols = headerByte - 127;`.
|
`Number_of_Symbols = headerByte - 127`.
|
||||||
Note that maximum nbSymbols is 255-127 = 128.
|
Note that maximum `Number_of_Symbols` is 255-127 = 128.
|
||||||
A larger serie must necessarily use FSE compression.
|
A larger serie must necessarily use FSE compression.
|
||||||
|
|
||||||
- if headerByte < 128 :
|
- if `headerByte` < 128 :
|
||||||
the serie of weights is compressed by FSE.
|
the serie of weights is compressed by FSE.
|
||||||
The length of the FSE-compressed serie is `headerByte` (0-127).
|
The length of the FSE-compressed serie is `headerByte` (0-127).
|
||||||
|
|
||||||
@ -605,29 +606,29 @@ When both states have overflowed the bitstream, end is reached.
|
|||||||
|
|
||||||
##### Conversion from weights to Huffman prefix codes
|
##### Conversion from weights to Huffman prefix codes
|
||||||
|
|
||||||
All present symbols shall now have a `weight` value.
|
All present symbols shall now have a `Weight` value.
|
||||||
It is possible to transform weights into nbBits, using this formula :
|
It is possible to transform weights into Number_of_Bits, using this formula :
|
||||||
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
|
`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||||
|
|
||||||
Symbols are sorted by weight. Within same weight, symbols keep natural order.
|
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
|
||||||
Symbols with a weight of zero are removed.
|
Symbols with a `Weight` of zero are removed.
|
||||||
Then, starting from lowest weight, prefix codes are distributed in order.
|
Then, starting from lowest weight, prefix codes are distributed in order.
|
||||||
|
|
||||||
__Example__ :
|
__Example__ :
|
||||||
Let's presume the following list of weights has been decoded :
|
Let's presume the following list of weights has been decoded :
|
||||||
|
|
||||||
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||||
| ------- | --- | --- | --- | --- | --- | --- |
|
| -------- | --- | --- | --- | --- | --- | --- |
|
||||||
| weight | 4 | 3 | 2 | 0 | 1 | 1 |
|
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
|
||||||
|
|
||||||
Sorted by weight and then natural order,
|
Sorted by weight and then natural order,
|
||||||
it gives the following distribution :
|
it gives the following distribution :
|
||||||
|
|
||||||
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
|
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
|
||||||
| ------------ | --- | --- | --- | --- | --- | ---- |
|
| ---------------- | --- | --- | --- | --- | --- | ---- |
|
||||||
| weight | 0 | 1 | 1 | 2 | 3 | 4 |
|
| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
|
||||||
| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
|
| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
|
||||||
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
||||||
|
|
||||||
|
|
||||||
#### Literals bitstreams
|
#### Literals bitstreams
|
||||||
@ -638,7 +639,7 @@ As seen in a previous paragraph,
|
|||||||
there are 2 flavors of Huffman-compressed literals :
|
there are 2 flavors of Huffman-compressed literals :
|
||||||
single stream, and 4-streams.
|
single stream, and 4-streams.
|
||||||
|
|
||||||
4-streams is useful for CPU with multiple execution units and OoO operations.
|
4-streams is useful for CPU with multiple execution units and out-of-order operations.
|
||||||
Since each stream can be decoded independently,
|
Since each stream can be decoded independently,
|
||||||
it's possible to decode them up to 4x faster than a single stream,
|
it's possible to decode them up to 4x faster than a single stream,
|
||||||
presuming the CPU has enough parallelism available.
|
presuming the CPU has enough parallelism available.
|
||||||
@ -653,11 +654,11 @@ Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
|
|||||||
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
|
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
|
||||||
|
|
||||||
Compressed size is provided explicitly : in the 4-streams variant,
|
Compressed size is provided explicitly : in the 4-streams variant,
|
||||||
bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
|
bitstreams are preceded by 3 unsigned little-endian 16-bits values.
|
||||||
Each value represents the compressed size of one stream, in order.
|
Each value represents the compressed size of one stream, in order.
|
||||||
The last stream size is deducted from total compressed size
|
The last stream size is deducted from total compressed size
|
||||||
and from previously decoded stream sizes :
|
and from previously decoded stream sizes :
|
||||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
|
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
|
||||||
|
|
||||||
##### Bitstreams read and decode
|
##### Bitstreams read and decode
|
||||||
|
|
||||||
@ -676,7 +677,7 @@ Starting from the end,
|
|||||||
it's possible to read the bitstream in a little-endian fashion,
|
it's possible to read the bitstream in a little-endian fashion,
|
||||||
keeping track of already used bits.
|
keeping track of already used bits.
|
||||||
|
|
||||||
Reading the last `maxBits` bits,
|
Reading the last `Max_Number_of_Bits` bits,
|
||||||
it's then possible to compare extracted value to decoding table,
|
it's then possible to compare extracted value to decoding table,
|
||||||
determining the symbol to decode and number of bits to discard.
|
determining the symbol to decode and number of bits to discard.
|
||||||
|
|
||||||
@ -696,7 +697,7 @@ A match copy command specifies an offset and a length.
|
|||||||
The offset gives the position to copy from,
|
The offset gives the position to copy from,
|
||||||
which can be within a previous block.
|
which can be within a previous block.
|
||||||
|
|
||||||
There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
|
There are 3 symbol types, literals lengths, offsets and match lengths,
|
||||||
which are encoded together, interleaved in a single _bitstream_.
|
which are encoded together, interleaved in a single _bitstream_.
|
||||||
|
|
||||||
Each symbol is a _code_ in its own context,
|
Each symbol is a _code_ in its own context,
|
||||||
@ -708,126 +709,125 @@ The Sequences section starts by a header,
|
|||||||
followed by optional Probability tables for each symbol type,
|
followed by optional Probability tables for each symbol type,
|
||||||
followed by the bitstream.
|
followed by the bitstream.
|
||||||
|
|
||||||
| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
|
| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
|
||||||
| ------ | ---------------- | ------------- | ------------------ | --------- |
|
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
|
||||||
|
|
||||||
To decode the Sequence section, it's required to know its size.
|
To decode the Sequence section, it's required to know its size.
|
||||||
This size is deducted from `blockSize - literalSectionSize`.
|
This size is deducted from `blockSize - literalSectionSize`.
|
||||||
|
|
||||||
|
|
||||||
#### Sequences section header
|
#### `Sequences_Section_Header`
|
||||||
|
|
||||||
Consists in 2 items :
|
Consists in 2 items :
|
||||||
- Nb of Sequences
|
- `Number_of_Sequences`
|
||||||
- Flags providing Symbol compression types
|
- Symbol compression modes
|
||||||
|
|
||||||
__Nb of Sequences__
|
__`Number_of_Sequences`__
|
||||||
|
|
||||||
This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
|
This is a variable size field using between 1 and 3 bytes.
|
||||||
Let's call its first byte `byte0`.
|
Let's call its first byte `byte0`.
|
||||||
- `if (byte0 == 0)` : there are no sequences.
|
- `if (byte0 == 0)` : there are no sequences.
|
||||||
The sequence section stops there.
|
The sequence section stops there.
|
||||||
Regenerated content is defined entirely by literals section.
|
Regenerated content is defined entirely by literals section.
|
||||||
- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
|
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
|
||||||
- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
|
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
|
||||||
- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
|
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
|
||||||
|
|
||||||
__Symbol encoding modes__
|
__Symbol compression modes__
|
||||||
|
|
||||||
This is a single byte, defining the compression mode of each symbol type.
|
This is a single byte, defining the compression mode of each symbol type.
|
||||||
|
|
||||||
| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
|
|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
|
||||||
| ------- | ------ | ------ | ------ | -------- |
|
| -------- | ----------------------- | -------------- | -------------------- | ---------- |
|
||||||
|FieldName| LLType | OFType | MLType | Reserved |
|
|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
|
||||||
|
|
||||||
The last field, `Reserved`, must be all-zeroes.
|
The last field, `Reserved`, must be all-zeroes.
|
||||||
|
|
||||||
`LLType`, `OFType` and `MLType` define the compression mode of
|
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
|
||||||
Literal Lengths, Offsets and Match Lengths respectively.
|
literals lengths, offsets and match lengths respectively.
|
||||||
|
|
||||||
They follow the same enumeration :
|
They follow the same enumeration :
|
||||||
|
|
||||||
| Value | 0 | 1 | 2 | 3 |
|
| Value | 0 | 1 | 2 | 3 |
|
||||||
| ---------------- | ------ | --- | ---------- | ------ |
|
| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
|
||||||
| Compression Mode | predef | RLE | Compressed | Repeat |
|
| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
|
||||||
|
|
||||||
- "predef" : uses a pre-defined distribution table.
|
- `Predefined_Mode` : uses a predefined distribution table.
|
||||||
- "RLE" : it's a single code, repeated `nbSeqs` times.
|
- `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
|
||||||
- "Repeat" : re-use distribution table from previous compressed block.
|
- `Repeat_Mode` : re-use distribution table from previous compressed block.
|
||||||
- "Compressed" : standard FSE compression.
|
- `FSE_Compressed_Mode` : standard FSE compression.
|
||||||
A distribution table will be present.
|
A distribution table will be present.
|
||||||
It will be described in [next part](#distribution-tables).
|
It will be described in [next part](#distribution-tables).
|
||||||
|
|
||||||
#### Symbols decoding
|
#### Symbols decoding
|
||||||
|
|
||||||
##### Literal Lengths codes
|
##### Literals Length codes
|
||||||
|
|
||||||
Literal lengths codes are values ranging from `0` to `35` included.
|
Literals length codes are values ranging from `0` to `35` included.
|
||||||
They define lengths from 0 to 131071 bytes.
|
They define lengths from 0 to 131071 bytes.
|
||||||
|
|
||||||
| Code | 0-15 |
|
| `Literals_Length_Code` | 0-15 |
|
||||||
| ------ | ---- |
|
| ---------------------- | ---------------------- |
|
||||||
| length | Code |
|
| length | `Literals_Length_Code` |
|
||||||
| nbBits | 0 |
|
| `Number_of_Bits` | 0 |
|
||||||
|
|
||||||
|
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
||||||
|
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||||
|
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
||||||
|
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||||
|
|
||||||
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||||
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
||||||
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
||||||
|
|
||||||
| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
|
||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
| ---------------------- | ---- | ---- | ---- | ---- |
|
||||||
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
| Baseline | 8192 |16384 |32768 |65536 |
|
||||||
| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
| `Number_of_Bits` | 13 | 14 | 15 | 16 |
|
||||||
|
|
||||||
| Code | 32 | 33 | 34 | 35 |
|
|
||||||
| -------- | ---- | ---- | ---- | ---- |
|
|
||||||
| Baseline | 8192 |16384 |32768 |65536 |
|
|
||||||
| nb Bits | 13 | 14 | 15 | 16 |
|
|
||||||
|
|
||||||
__Default distribution__
|
__Default distribution__
|
||||||
|
|
||||||
When "compression mode" is "predef"",
|
When `Compression_Mode` is `Predefined_Mode`,
|
||||||
a pre-defined distribution is used for FSE compression.
|
a predefined distribution is used for FSE compression.
|
||||||
|
|
||||||
Below is its definition. It uses an accuracy of 6 bits (64 states).
|
Below is its definition. It uses an accuracy of 6 bits (64 states).
|
||||||
```
|
```
|
||||||
short literalLengths_defaultDistribution[36] =
|
short literalsLength_defaultDistribution[36] =
|
||||||
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
|
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
|
||||||
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
|
||||||
-1,-1,-1,-1 };
|
-1,-1,-1,-1 };
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Match Lengths codes
|
##### Match Length codes
|
||||||
|
|
||||||
Match lengths codes are values ranging from `0` to `52` included.
|
Match length codes are values ranging from `0` to `52` included.
|
||||||
They define lengths from 3 to 131074 bytes.
|
They define lengths from 3 to 131074 bytes.
|
||||||
|
|
||||||
| Code | 0-31 |
|
| `Match_Length_Code` | 0-31 |
|
||||||
| ------ | -------- |
|
| ------------------- | ----------------------- |
|
||||||
| value | Code + 3 |
|
| value | `Match_Length_Code` + 3 |
|
||||||
| nbBits | 0 |
|
| `Number_of_Bits` | 0 |
|
||||||
|
|
||||||
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||||
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
||||||
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||||
|
|
||||||
| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||||
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
||||||
| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
||||||
|
|
||||||
| Code | 48 | 49 | 50 | 51 | 52 |
|
| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
|
||||||
| -------- | ---- | ---- | ---- | ---- | ---- |
|
| ------------------- | ---- | ---- | ---- | ---- | ---- |
|
||||||
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
||||||
| nb Bits | 12 | 13 | 14 | 15 | 16 |
|
| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
|
||||||
|
|
||||||
__Default distribution__
|
__Default distribution__
|
||||||
|
|
||||||
When "compression mode" is defined as "predef",
|
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||||
a pre-defined distribution is used for FSE compression.
|
a predefined distribution is used for FSE compression.
|
||||||
|
|
||||||
Here is its definition. It uses an accuracy of 6 bits (64 states).
|
Here is its definition. It uses an accuracy of 6 bits (64 states).
|
||||||
```
|
```
|
||||||
@ -840,31 +840,32 @@ short matchLengths_defaultDistribution[53] =
|
|||||||
|
|
||||||
##### Offset codes
|
##### Offset codes
|
||||||
|
|
||||||
Offset codes are values ranging from `0` to `N`,
|
Offset codes are values ranging from `0` to `N`.
|
||||||
with `N` being limited by maximum backreference distance.
|
|
||||||
|
|
||||||
A decoder is free to limit its maximum `N` supported.
|
A decoder is free to limit its maximum `N` supported.
|
||||||
Recommendation is to support at least up to `22`.
|
Recommendation is to support at least up to `22`.
|
||||||
For information, at the time of this writing.
|
For information, at the time of this writing.
|
||||||
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
||||||
|
|
||||||
An offset code is also the nb of additional bits to read,
|
An offset code is also the number of additional bits to read,
|
||||||
and can be translated into an `OFValue` using the following formulae :
|
and can be translated into an `Offset_Value` using the following formulae :
|
||||||
|
|
||||||
```
|
```
|
||||||
OFValue = (1 << offsetCode) + readNBits(offsetCode);
|
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
|
||||||
if (OFValue > 3) offset = OFValue - 3;
|
if (Offset_Value > 3) offset = Offset_Value - 3;
|
||||||
```
|
```
|
||||||
|
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
|
||||||
|
but is limited by [maximum back-reference distance](#window_descriptor).
|
||||||
|
|
||||||
OFValue from 1 to 3 are special : they define "repeat codes",
|
Offset_Value from 1 to 3 are special : they define "repeat codes",
|
||||||
which means one of the previous offsets will be repeated.
|
which means one of the previous offsets will be repeated.
|
||||||
They are sorted in recency order, with 1 meaning the most recent one.
|
They are sorted in recency order, with 1 meaning the most recent one.
|
||||||
See [Repeat offsets](#repeat-offsets) paragraph.
|
See [Repeat offsets](#repeat-offsets) paragraph.
|
||||||
|
|
||||||
__Default distribution__
|
__Default distribution__
|
||||||
|
|
||||||
When "compression mode" is defined as "predef",
|
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||||
a pre-defined distribution is used for FSE compression.
|
a predefined distribution is used for FSE compression.
|
||||||
|
|
||||||
Here is its definition. It uses an accuracy of 5 bits (32 states),
|
Here is its definition. It uses an accuracy of 5 bits (32 states),
|
||||||
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
|
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
|
||||||
@ -882,38 +883,38 @@ short offsetCodes_defaultDistribution[53] =
|
|||||||
|
|
||||||
Following the header, up to 3 distribution tables can be described.
|
Following the header, up to 3 distribution tables can be described.
|
||||||
When present, they are in this order :
|
When present, they are in this order :
|
||||||
- Literal lengthes
|
- Literals lengths
|
||||||
- Offsets
|
- Offsets
|
||||||
- Match Lengthes
|
- Match Lengths
|
||||||
|
|
||||||
The content to decode depends on their respective encoding mode :
|
The content to decode depends on their respective encoding mode :
|
||||||
- Predef : no content. Use pre-defined distribution table.
|
- `Predefined_Mode` : no content. Use predefined distribution table.
|
||||||
- RLE : 1 byte. This is the only code to use across the whole compressed block.
|
- `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
|
||||||
- FSE : A distribution table is present.
|
- `FSE_Compressed_Mode` : A distribution table is present.
|
||||||
- Repeat mode : no content. Re-use distribution from previous compressed block.
|
- `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
|
||||||
|
|
||||||
##### FSE distribution table : condensed format
|
##### FSE distribution table : condensed format
|
||||||
|
|
||||||
An FSE distribution table describes the probabilities of all symbols
|
An FSE distribution table describes the probabilities of all symbols
|
||||||
from `0` to the last present one (included)
|
from `0` to the last present one (included)
|
||||||
on a normalized scale of `1 << AccuracyLog` .
|
on a normalized scale of `1 << Accuracy_Log` .
|
||||||
|
|
||||||
It's a bitstream which is read forward, in little-endian fashion.
|
It's a bitstream which is read forward, in little-endian fashion.
|
||||||
It's not necessary to know its exact size,
|
It's not necessary to know its exact size,
|
||||||
since it will be discovered and reported by the decoding process.
|
since it will be discovered and reported by the decoding process.
|
||||||
|
|
||||||
The bitstream starts by reporting on which scale it operates.
|
The bitstream starts by reporting on which scale it operates.
|
||||||
`AccuracyLog = low4bits + 5;`
|
`Accuracy_Log = low4bits + 5`.
|
||||||
Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
|
Note that maximum `Accuracy_Log` for literal and match length is `9`,
|
||||||
and for offsets it is `8`. Higher values are considered errors.
|
and for offsets it is `8`. Higher values are considered errors.
|
||||||
|
|
||||||
Then follow each symbol value, from `0` to last present one.
|
Then follow each symbol value, from `0` to last present one.
|
||||||
The nb of bits used by each field is variable.
|
The number of bits used by each field is variable.
|
||||||
It depends on :
|
It depends on :
|
||||||
|
|
||||||
- Remaining probabilities + 1 :
|
- Remaining probabilities + 1 :
|
||||||
__example__ :
|
__example__ :
|
||||||
Presuming an AccuracyLog of 8,
|
Presuming an `Accuracy_Log` of 8,
|
||||||
and presuming 100 probabilities points have already been distributed,
|
and presuming 100 probabilities points have already been distributed,
|
||||||
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
|
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
|
||||||
Therefore, it must read `log2sup(156) == 8` bits.
|
Therefore, it must read `log2sup(156) == 8` bits.
|
||||||
@ -927,17 +928,17 @@ It depends on :
|
|||||||
values from 99 to 156 use 8 bits.
|
values from 99 to 156 use 8 bits.
|
||||||
This is achieved through this scheme :
|
This is achieved through this scheme :
|
||||||
|
|
||||||
| Value read | Value decoded | nb Bits used |
|
| Value read | Value decoded | Number of bits used |
|
||||||
| ---------- | ------------- | ------------ |
|
| ---------- | ------------- | ------------------- |
|
||||||
| 0 - 98 | 0 - 98 | 7 |
|
| 0 - 98 | 0 - 98 | 7 |
|
||||||
| 99 - 127 | 99 - 127 | 8 |
|
| 99 - 127 | 99 - 127 | 8 |
|
||||||
| 128 - 226 | 0 - 98 | 7 |
|
| 128 - 226 | 0 - 98 | 7 |
|
||||||
| 227 - 255 | 128 - 156 | 8 |
|
| 227 - 255 | 128 - 156 | 8 |
|
||||||
|
|
||||||
Symbols probabilities are read one by one, in order.
|
Symbols probabilities are read one by one, in order.
|
||||||
|
|
||||||
Probability is obtained from Value decoded by following formulae :
|
Probability is obtained from Value decoded by following formulae :
|
||||||
`Proba = value - 1;`
|
`Proba = value - 1`
|
||||||
|
|
||||||
It means value `0` becomes negative probability `-1`.
|
It means value `0` becomes negative probability `-1`.
|
||||||
`-1` is a special probability, which means `less than 1`.
|
`-1` is a special probability, which means `less than 1`.
|
||||||
@ -952,9 +953,9 @@ This repeat flag tells how many probabilities of zeroes follow the current one.
|
|||||||
It provides a number ranging from 0 to 3.
|
It provides a number ranging from 0 to 3.
|
||||||
If it is a 3, another 2-bits repeat flag follows, and so on.
|
If it is a 3, another 2-bits repeat flag follows, and so on.
|
||||||
|
|
||||||
When last symbol reaches cumulated total of `1 << AccuracyLog`,
|
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
|
||||||
decoding is complete.
|
decoding is complete.
|
||||||
If the last symbol makes cumulated total go above `1 << AccuracyLog`,
|
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
|
||||||
distribution is considered corrupted.
|
distribution is considered corrupted.
|
||||||
|
|
||||||
Then the decoder can tell how many bytes were used in this process,
|
Then the decoder can tell how many bytes were used in this process,
|
||||||
@ -969,14 +970,14 @@ to create a unique decoding table.
|
|||||||
|
|
||||||
It follows the following build rule :
|
It follows the following build rule :
|
||||||
|
|
||||||
The table has a size of `tableSize = 1 << AccuracyLog;`.
|
The table has a size of `tableSize = 1 << Accuracy_Log`.
|
||||||
Each cell describes the symbol decoded,
|
Each cell describes the symbol decoded,
|
||||||
and instructions to get the next state.
|
and instructions to get the next state.
|
||||||
|
|
||||||
Symbols are scanned in their natural order for `less than 1` probabilities.
|
Symbols are scanned in their natural order for `less than 1` probabilities.
|
||||||
Symbols with this probability are being attributed a single cell,
|
Symbols with this probability are being attributed a single cell,
|
||||||
starting from the end of the table.
|
starting from the end of the table.
|
||||||
These symbols define a full state reset, reading `AccuracyLog` bits.
|
These symbols define a full state reset, reading `Accuracy_Log` bits.
|
||||||
|
|
||||||
All remaining symbols are sorted in their natural order.
|
All remaining symbols are sorted in their natural order.
|
||||||
Starting from symbol `0` and table position `0`,
|
Starting from symbol `0` and table position `0`,
|
||||||
@ -1005,7 +1006,7 @@ It receives 5 state values. States are sorted in natural order.
|
|||||||
|
|
||||||
Next power of 2 is 8.
|
Next power of 2 is 8.
|
||||||
Space of probabilities is divided into 8 equal parts.
|
Space of probabilities is divided into 8 equal parts.
|
||||||
Presuming the AccuracyLog is 7, it defines 128 states.
|
Presuming the `Accuracy_Log` is 7, it defines 128 states.
|
||||||
Divided by 8, each share is 16 large.
|
Divided by 8, each share is 16 large.
|
||||||
|
|
||||||
In order to reach 8, 8-5=3 lowest states will count "double",
|
In order to reach 8, 8-5=3 lowest states will count "double",
|
||||||
@ -1014,13 +1015,13 @@ requiring one more bit in the process.
|
|||||||
|
|
||||||
Numbering starts from higher states using less bits.
|
Numbering starts from higher states using less bits.
|
||||||
|
|
||||||
| state order | 0 | 1 | 2 | 3 | 4 |
|
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||||
| ----------- | ----- | ----- | ------ | ---- | ----- |
|
| ---------------- | ----- | ----- | ------ | ---- | ----- |
|
||||||
| width | 32 | 32 | 32 | 16 | 16 |
|
| width | 32 | 32 | 32 | 16 | 16 |
|
||||||
| nb Bits | 5 | 5 | 5 | 4 | 4 |
|
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
||||||
| range nb | 2 | 4 | 6 | 0 | 1 |
|
| range number | 2 | 4 | 6 | 0 | 1 |
|
||||||
| baseline | 32 | 64 | 96 | 0 | 16 |
|
| baseline | 32 | 64 | 96 | 0 | 16 |
|
||||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||||
|
|
||||||
Next state is determined from current state
|
Next state is determined from current state
|
||||||
by reading the required number of bits, and adding the specified baseline.
|
by reading the required number of bits, and adding the specified baseline.
|
||||||
@ -1044,9 +1045,9 @@ The bitstream starts with initial state values,
|
|||||||
each using the required number of bits in their respective _accuracy_,
|
each using the required number of bits in their respective _accuracy_,
|
||||||
decoded previously from their normalized distribution.
|
decoded previously from their normalized distribution.
|
||||||
|
|
||||||
It starts by `Literal Length State`,
|
It starts by `Literals_Length_State`,
|
||||||
followed by `Offset State`,
|
followed by `Offset_State`,
|
||||||
and finally `Match Length State`.
|
and finally `Match_Length_State`.
|
||||||
|
|
||||||
Reminder : always keep in mind that all values are read _backward_.
|
Reminder : always keep in mind that all values are read _backward_.
|
||||||
|
|
||||||
@ -1056,9 +1057,9 @@ A state gives a code.
|
|||||||
A code provides a baseline and number of bits to add.
|
A code provides a baseline and number of bits to add.
|
||||||
See [Symbol Decoding] section for details on each symbol.
|
See [Symbol Decoding] section for details on each symbol.
|
||||||
|
|
||||||
Decoding starts by reading the nb of bits required to decode offset.
|
Decoding starts by reading the number of bits required to decode offset.
|
||||||
It then does the same for match length,
|
It then does the same for match length,
|
||||||
and then for literal length.
|
and then for literals length.
|
||||||
|
|
||||||
Offset / matchLength / litLength define a sequence.
|
Offset / matchLength / litLength define a sequence.
|
||||||
It starts by inserting the number of literals defined by `litLength`,
|
It starts by inserting the number of literals defined by `litLength`,
|
||||||
@ -1066,11 +1067,11 @@ then continue by copying `matchLength` bytes from `currentPos - offset`.
|
|||||||
|
|
||||||
The next operation is to update states.
|
The next operation is to update states.
|
||||||
Using rules pre-calculated in the decoding tables,
|
Using rules pre-calculated in the decoding tables,
|
||||||
`Literal Length State` is updated,
|
`Literals_Length_State` is updated,
|
||||||
followed by `Match Length State`,
|
followed by `Match_Length_State`,
|
||||||
and then `Offset State`.
|
and then `Offset_State`.
|
||||||
|
|
||||||
This operation will be repeated `NbSeqs` times.
|
This operation will be repeated `Number_of_Sequences` times.
|
||||||
At the end, the bitstream shall be entirely consumed,
|
At the end, the bitstream shall be entirely consumed,
|
||||||
otherwise bitstream is considered corrupted.
|
otherwise bitstream is considered corrupted.
|
||||||
|
|
||||||
@ -1081,7 +1082,7 @@ otherwise bitstream is considered corrupted.
|
|||||||
As seen in [Offset Codes], the first 3 values define a repeated offset.
|
As seen in [Offset Codes], the first 3 values define a repeated offset.
|
||||||
They are sorted in recency order, with 1 meaning "most recent one".
|
They are sorted in recency order, with 1 meaning "most recent one".
|
||||||
|
|
||||||
There is an exception though, when current sequence's literal length is `0`.
|
There is an exception though, when current sequence's literals length is `0`.
|
||||||
In which case, repcodes are "pushed by one",
|
In which case, repcodes are "pushed by one",
|
||||||
so 1 becomes 2, 2 becomes 3,
|
so 1 becomes 2, 2 becomes 3,
|
||||||
and 3 becomes "offset_1 - 1_byte".
|
and 3 becomes "offset_1 - 1_byte".
|
||||||
@ -1112,31 +1113,32 @@ But dictionaries created by `zstd --train` follow a format, described here.
|
|||||||
__Pre-requisites__ : a dictionary has a known length,
|
__Pre-requisites__ : a dictionary has a known length,
|
||||||
defined either by a buffer limit, or a file size.
|
defined either by a buffer limit, or a file size.
|
||||||
|
|
||||||
| Header | DictID | Stats | Content |
|
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
|
||||||
| ------ | ------ | ----- | ------- |
|
| -------------- | --------------- | ---------------- | --------- |
|
||||||
|
|
||||||
__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
|
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
|
||||||
|
|
||||||
__Dict_ID__ : 4 bytes, stored in Little-Endian format.
|
__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
|
||||||
DictID can be any value, except 0 (which means no DictID).
|
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
|
||||||
It's used by decoders to check if they use the correct dictionary.
|
It's used by decoders to check if they use the correct dictionary.
|
||||||
_Reserved ranges :_
|
|
||||||
|
_Reserved ranges :_
|
||||||
If the frame is going to be distributed in a private environment,
|
If the frame is going to be distributed in a private environment,
|
||||||
any dictionary ID can be used.
|
any `Dictionary_ID` can be used.
|
||||||
However, for public distribution of compressed frames,
|
However, for public distribution of compressed frames,
|
||||||
some ranges are reserved for future use :
|
the following ranges are reserved for future use and should not be used :
|
||||||
|
|
||||||
- low range : 1 - 32767 : reserved
|
- low range : 1 - 32767
|
||||||
- high range : >= (2^31) : reserved
|
- high range : >= (2^31)
|
||||||
|
|
||||||
__Stats__ : Entropy tables, following the same format as a [compressed blocks].
|
__`Entropy_Tables`__ : following the same format as a [compressed blocks].
|
||||||
They are stored in following order :
|
They are stored in following order :
|
||||||
Huffman tables for literals, FSE table for offset,
|
Huffman tables for literals, FSE table for offset,
|
||||||
FSE table for matchLenth, and FSE table for litLength.
|
FSE table for matchLenth, and FSE table for litLength.
|
||||||
It's finally followed by 3 offset values, populating recent offsets,
|
It's finally followed by 3 offset values, populating recent offsets,
|
||||||
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
||||||
|
|
||||||
__Content__ : Where the actual dictionary content is.
|
__`Content`__ : Where the actual dictionary content is.
|
||||||
Content size depends on Dictionary size.
|
Content size depends on Dictionary size.
|
||||||
|
|
||||||
[compressed blocks]: #the-format-of-compressed_block
|
[compressed blocks]: #the-format-of-compressed_block
|
||||||
|
Loading…
Reference in New Issue
Block a user