updated zstd_compression_format.md
This commit is contained in:
parent
0132375ba6
commit
de9d13068f
@ -31,8 +31,8 @@ will rely more and more on previously decoded content to compress the rest of th
|
||||
Usage of the dictionary builder and created dictionaries with CLI:
|
||||
|
||||
1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName`
|
||||
2. Compress with dictionary: `zstd FILE -D dictionaryName`
|
||||
3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName`
|
||||
2. Compress with the dictionary: `zstd FILE -D dictionaryName`
|
||||
3. Decompress with the dictionary: `zstd --decompress FILE.zst -D dictionaryName`
|
||||
|
||||
|
||||
|
||||
|
@ -271,7 +271,7 @@ which can be any value from 1 to 2^64-1 bytes (16 EB).
|
||||
| ----------- | ---------- | ---------- |
|
||||
| Field name | `Exponent` | `Mantissa` |
|
||||
|
||||
Maximum distance is given by the following formulae :
|
||||
Maximum distance is given by the following formulas :
|
||||
```
|
||||
windowLog = 10 + Exponent;
|
||||
windowBase = 1 << windowLog;
|
||||
@ -415,7 +415,7 @@ To decode a compressed block, the following elements are necessary :
|
||||
or all previous blocks when `Single_Segment_flag` is set.
|
||||
- List of "recent offsets" from previous compressed block.
|
||||
- Decoding tables of previous compressed block for each symbol type
|
||||
(literals, litLength, matchLength, offset).
|
||||
(literals, literals lengths, match lengths, offsets).
|
||||
|
||||
|
||||
### `Literals_Section`
|
||||
@ -510,7 +510,7 @@ Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian conven
|
||||
|
||||
#### `Huffman_Tree_Description`
|
||||
|
||||
This section is only present when `Literals_Block_Type` type is `Compressed_Block` (`2`).
|
||||
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
|
||||
|
||||
Prefix coding represents symbols from an a priori known alphabet
|
||||
by bit sequences (codewords), one codeword for each symbol,
|
||||
@ -532,9 +532,11 @@ This specification limits maximum code length to 11 bits.
|
||||
##### Representation
|
||||
|
||||
All literal values from zero (included) to last present one (excluded)
|
||||
are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
|
||||
Transformation from `Weight` to `Number_of_Bits` follows this formulae :
|
||||
`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
|
||||
Transformation from `Weight` to `Number_of_Bits` follows this formula :
|
||||
```
|
||||
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
|
||||
```
|
||||
The last symbol's `Weight` is deduced from previously decoded ones,
|
||||
by completing to the nearest power of 2.
|
||||
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
|
||||
@ -549,7 +551,10 @@ Let's presume the following Huffman tree must be described :
|
||||
The tree depth is 4, since its smallest element uses 4 bits.
|
||||
Value `5` will not be listed, nor will values above `5`.
|
||||
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
|
||||
Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
|
||||
Weight formula is :
|
||||
```
|
||||
Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
|
||||
```
|
||||
It gives the following serie of weights :
|
||||
|
||||
| `Weight` | 4 | 3 | 2 | 0 | 1 |
|
||||
@ -580,9 +585,9 @@ which tells how to decode the list of weights.
|
||||
|
||||
- if `headerByte` < 128 :
|
||||
the serie of weights is compressed by FSE.
|
||||
The length of the FSE-compressed serie is `headerByte` (0-127).
|
||||
The length of the FSE-compressed serie is equal to `headerByte` (0-127).
|
||||
|
||||
##### FSE (Finite State Entropy) compression of Huffman weights
|
||||
##### Finite State Entropy (FSE) compression of Huffman weights
|
||||
|
||||
The serie of weights is compressed using FSE compression.
|
||||
It's a single bitstream with 2 interleaved states,
|
||||
@ -612,9 +617,10 @@ When both states have overflowed the bitstream, end is reached.
|
||||
##### Conversion from weights to Huffman prefix codes
|
||||
|
||||
All present symbols shall now have a `Weight` value.
|
||||
It is possible to transform weights into Number_of_Bits, using this formula :
|
||||
`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||
|
||||
It is possible to transform weights into Number_of_Bits, using this formula:
|
||||
```
|
||||
Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
|
||||
```
|
||||
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
|
||||
Symbols with a `Weight` of zero are removed.
|
||||
Then, starting from lowest weight, prefix codes are distributed in order.
|
||||
@ -636,21 +642,21 @@ it gives the following distribution :
|
||||
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
||||
|
||||
|
||||
#### Literals bitstreams
|
||||
#### The content of Huffman-compressed literal stream
|
||||
|
||||
##### Bitstreams sizes
|
||||
|
||||
As seen in a previous paragraph,
|
||||
there are 2 flavors of Huffman-compressed literals :
|
||||
single stream, and 4-streams.
|
||||
there are 2 types of Huffman-compressed literals :
|
||||
a single stream and 4 streams.
|
||||
|
||||
4-streams is useful for CPU with multiple execution units and out-of-order operations.
|
||||
Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations.
|
||||
Since each stream can be decoded independently,
|
||||
it's possible to decode them up to 4x faster than a single stream,
|
||||
presuming the CPU has enough parallelism available.
|
||||
|
||||
For single stream, header provides both the compressed and regenerated size.
|
||||
For 4-streams though,
|
||||
For 4 streams though,
|
||||
header only provides compressed and regenerated size of all 4 streams combined.
|
||||
In order to properly decode the 4 streams,
|
||||
it's necessary to know the compressed and regenerated size of each stream.
|
||||
@ -663,8 +669,10 @@ bitstreams are preceded by 3 unsigned little-endian 16-bits values.
|
||||
Each value represents the compressed size of one stream, in order.
|
||||
The last stream size is deducted from total compressed size
|
||||
and from previously decoded stream sizes :
|
||||
|
||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
|
||||
|
||||
|
||||
##### Bitstreams read and decode
|
||||
|
||||
Each bitstream must be read _backward_,
|
||||
@ -706,23 +714,18 @@ When all _sequences_ are decoded,
|
||||
if there is any literal left in the _literal section_,
|
||||
these bytes are added at the end of the block.
|
||||
|
||||
The _Sequences_Section_ regroup all symbols required to decode commands.
|
||||
The `Sequences_Section` regroup all symbols required to decode commands.
|
||||
There are 3 symbol types : literals lengths, offsets and match lengths.
|
||||
They are encoded together, interleaved, in a single _bitstream_.
|
||||
|
||||
Each symbol is a _code_ in its own context,
|
||||
which specifies a baseline and a number of bits to add.
|
||||
_Codes_ are FSE compressed,
|
||||
and interleaved with raw additional bits in the same bitstream.
|
||||
|
||||
The Sequences section starts by a header,
|
||||
followed by optional Probability tables for each symbol type,
|
||||
The `Sequences_Section` starts by a header,
|
||||
followed by optional probability tables for each symbol type,
|
||||
followed by the bitstream.
|
||||
|
||||
| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
|
||||
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
|
||||
|
||||
To decode the Sequence section, it's required to know its size.
|
||||
To decode the `Sequences_Section`, it's required to know its size.
|
||||
This size is deducted from `blockSize - literalSectionSize`.
|
||||
|
||||
|
||||
@ -753,8 +756,8 @@ This is a single byte, defining the compression mode of each symbol type.
|
||||
|
||||
The last field, `Reserved`, must be all-zeroes.
|
||||
|
||||
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
|
||||
literals lengths, offsets and match lengths respectively.
|
||||
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
|
||||
literals lengths, offsets, and match lengths respectively.
|
||||
|
||||
They follow the same enumeration :
|
||||
|
||||
@ -769,9 +772,14 @@ They follow the same enumeration :
|
||||
A distribution table will be present.
|
||||
It will be described in [next part](#distribution-tables).
|
||||
|
||||
#### Symbols decoding
|
||||
#### The codes for literals lengths, match lengths, and offsets.
|
||||
|
||||
##### Literals Length codes
|
||||
Each symbol is a _code_ in its own context,
|
||||
which specifies `Baseline` and `Number_of_Bits` to add.
|
||||
_Codes_ are FSE compressed,
|
||||
and interleaved with raw additional bits in the same bitstream.
|
||||
|
||||
##### Literals length codes
|
||||
|
||||
Literals length codes are values ranging from `0` to `35` included.
|
||||
They define lengths from 0 to 131071 bytes.
|
||||
@ -783,20 +791,20 @@ They define lengths from 0 to 131071 bytes.
|
||||
|
||||
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
||||
| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
||||
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
|
||||
| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
||||
| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
||||
| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
||||
|
||||
| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 8192 |16384 |32768 |65536 |
|
||||
| `Baseline` | 8192 |16384 |32768 |65536 |
|
||||
| `Number_of_Bits` | 13 | 14 | 15 | 16 |
|
||||
|
||||
__Default distribution__
|
||||
##### Default distribution for literals length codes
|
||||
|
||||
When `Compression_Mode` is `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
@ -809,7 +817,7 @@ short literalsLength_defaultDistribution[36] =
|
||||
-1,-1,-1,-1 };
|
||||
```
|
||||
|
||||
##### Match Length codes
|
||||
##### Match length codes
|
||||
|
||||
Match length codes are values ranging from `0` to `52` included.
|
||||
They define lengths from 3 to 131074 bytes.
|
||||
@ -821,25 +829,25 @@ They define lengths from 3 to 131074 bytes.
|
||||
|
||||
| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
||||
| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
||||
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
|
||||
| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
||||
| `Baseline` | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
||||
| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
||||
|
||||
| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
||||
| `Baseline` | 4098 | 8194 |16486 |32770 |65538 |
|
||||
| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
|
||||
|
||||
__Default distribution__
|
||||
##### Default distribution for match length codes
|
||||
|
||||
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
|
||||
Here is its definition. It uses an accuracy of 6 bits (64 states).
|
||||
Below is its definition. It uses an accuracy of 6 bits (64 states).
|
||||
```
|
||||
short matchLengths_defaultDistribution[53] =
|
||||
{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
|
||||
@ -858,26 +866,27 @@ For information, at the time of this writing.
|
||||
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
||||
|
||||
An offset code is also the number of additional bits to read,
|
||||
and can be translated into an `Offset_Value` using the following formulae :
|
||||
and can be translated into an `Offset_Value` using the following formulas :
|
||||
|
||||
```
|
||||
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
|
||||
if (Offset_Value > 3) offset = Offset_Value - 3;
|
||||
```
|
||||
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
|
||||
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4`
|
||||
but is limited by [maximum back-reference distance](#window_descriptor).
|
||||
|
||||
Offset_Value from 1 to 3 are special : they define "repeat codes",
|
||||
`Offset_Value` from 1 to 3 are special : they define "repeat codes",
|
||||
which means one of the previous offsets will be repeated.
|
||||
They are sorted in recency order, with 1 meaning the most recent one.
|
||||
See [Repeat offsets](#repeat-offsets) paragraph.
|
||||
|
||||
__Default distribution__
|
||||
|
||||
##### Default distribution for offset codes
|
||||
|
||||
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
|
||||
Here is its definition. It uses an accuracy of 5 bits (32 states),
|
||||
Below is its definition. It uses an accuracy of 5 bits (32 states),
|
||||
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
|
||||
|
||||
If any sequence in the compressed block requires an offset larger than this,
|
||||
@ -918,7 +927,7 @@ The bitstream starts by reporting on which scale it operates.
|
||||
Note that maximum `Accuracy_Log` for literal and match lengths is `9`,
|
||||
and for offsets is `8`. Higher values are considered errors.
|
||||
|
||||
Then follow each symbol value, from `0` to last present one.
|
||||
Then follows each symbol value, from `0` to last present one.
|
||||
The number of bits used by each field is variable.
|
||||
It depends on :
|
||||
|
||||
@ -947,11 +956,11 @@ It depends on :
|
||||
|
||||
Symbols probabilities are read one by one, in order.
|
||||
|
||||
Probability is obtained from Value decoded by following formulae :
|
||||
Probability is obtained from Value decoded by following formula :
|
||||
`Proba = value - 1`
|
||||
|
||||
It means value `0` becomes negative probability `-1`.
|
||||
`-1` is a special probability, which means `less than 1`.
|
||||
`-1` is a special probability, which means "less than 1".
|
||||
Its effect on distribution table is described in [next paragraph].
|
||||
For the purpose of calculating cumulated distribution, it counts as one.
|
||||
|
||||
@ -1006,7 +1015,7 @@ typically by a "less than 1" probability symbol.
|
||||
The result is a list of state values.
|
||||
Each state will decode the current symbol.
|
||||
|
||||
To get the Number of bits and baseline required for next state,
|
||||
To get the `Number_of_Bits` and `Baseline` required for next state,
|
||||
it's first necessary to sort all states in their natural order.
|
||||
The lower states will need 1 more bit than higher ones.
|
||||
|
||||
@ -1030,11 +1039,11 @@ Numbering starts from higher states using less bits.
|
||||
| width | 32 | 32 | 32 | 16 | 16 |
|
||||
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
||||
| range number | 2 | 4 | 6 | 0 | 1 |
|
||||
| baseline | 32 | 64 | 96 | 0 | 16 |
|
||||
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
|
||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||
|
||||
Next state is determined from current state
|
||||
by reading the required number of bits, and adding the specified baseline.
|
||||
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
|
||||
|
||||
|
||||
#### Bitstream
|
||||
@ -1064,16 +1073,16 @@ Reminder : always keep in mind that all values are read _backward_.
|
||||
##### Decoding a sequence
|
||||
|
||||
A state gives a code.
|
||||
A code provides a baseline and number of bits to add.
|
||||
A code provides `Baseline` and `Number_of_Bits` to add.
|
||||
See [Symbol Decoding] section for details on each symbol.
|
||||
|
||||
Decoding starts by reading the number of bits required to decode offset.
|
||||
It then does the same for match length,
|
||||
and then for literals length.
|
||||
Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
|
||||
It then does the same for `Match_Length`,
|
||||
and then for `Literals_Length`.
|
||||
|
||||
Offset / matchLength / litLength define a sequence.
|
||||
It starts by inserting the number of literals defined by `litLength`,
|
||||
then continue by copying `matchLength` bytes from `currentPos - offset`.
|
||||
`Offset`, `Match_Length`, and `Literals_Length` define a sequence.
|
||||
It starts by inserting the number of literals defined by `Literals_Length`,
|
||||
then continue by copying `Match_Length` bytes from `currentPos - Offset`.
|
||||
|
||||
The next operation is to update states.
|
||||
Using rules pre-calculated in the decoding tables,
|
||||
@ -1085,7 +1094,7 @@ This operation will be repeated `Number_of_Sequences` times.
|
||||
At the end, the bitstream shall be entirely consumed,
|
||||
otherwise bitstream is considered corrupted.
|
||||
|
||||
[Symbol Decoding]:#symbols-decoding
|
||||
[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
|
||||
|
||||
##### Repeat offsets
|
||||
|
||||
@ -1143,8 +1152,8 @@ _Reserved ranges :_
|
||||
|
||||
__`Entropy_Tables`__ : following the same format as a [compressed blocks].
|
||||
They are stored in following order :
|
||||
Huffman tables for literals, FSE table for offset,
|
||||
FSE table for matchLenth, and FSE table for litLength.
|
||||
Huffman tables for literals, FSE table for offsets,
|
||||
FSE table for match lengths, and FSE table for literals lengths.
|
||||
It's finally followed by 3 offset values, populating recent offsets,
|
||||
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user