updated zstd_compression_format.md

This commit is contained in:
inikep 2016-08-25 14:59:08 +02:00
parent 0132375ba6
commit de9d13068f
2 changed files with 72 additions and 63 deletions

View File

@ -31,8 +31,8 @@ will rely more and more on previously decoded content to compress the rest of th
Usage of the dictionary builder and created dictionaries with CLI:
1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName`
2. Compress with dictionary: `zstd FILE -D dictionaryName`
3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName`
2. Compress with the dictionary: `zstd FILE -D dictionaryName`
3. Decompress with the dictionary: `zstd --decompress FILE.zst -D dictionaryName`

View File

@ -271,7 +271,7 @@ which can be any value from 1 to 2^64-1 bytes (16 EB).
| ----------- | ---------- | ---------- |
| Field name | `Exponent` | `Mantissa` |
Maximum distance is given by the following formulae :
Maximum distance is given by the following formulas :
```
windowLog = 10 + Exponent;
windowBase = 1 << windowLog;
@ -415,7 +415,7 @@ To decode a compressed block, the following elements are necessary :
or all previous blocks when `Single_Segment_flag` is set.
- List of "recent offsets" from previous compressed block.
- Decoding tables of previous compressed block for each symbol type
(literals, litLength, matchLength, offset).
(literals, literals lengths, match lengths, offsets).
### `Literals_Section`
@ -510,7 +510,7 @@ Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian conven
#### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Block` (`2`).
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
Prefix coding represents symbols from an a priori known alphabet
by bit sequences (codewords), one codeword for each symbol,
@ -532,9 +532,11 @@ This specification limits maximum code length to 11 bits.
##### Representation
All literal values from zero (included) to last present one (excluded)
are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formulae :
`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
@ -549,7 +551,10 @@ Let's presume the following Huffman tree must be described :
The tree depth is 4, since its smallest element uses 4 bits.
Value `5` will not be listed, nor will values above `5`.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
Weight formula is :
```
Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
```
It gives the following serie of weights :
| `Weight` | 4 | 3 | 2 | 0 | 1 |
@ -580,9 +585,9 @@ which tells how to decode the list of weights.
- if `headerByte` < 128 :
the serie of weights is compressed by FSE.
The length of the FSE-compressed serie is `headerByte` (0-127).
The length of the FSE-compressed serie is equal to `headerByte` (0-127).
##### FSE (Finite State Entropy) compression of Huffman weights
##### Finite State Entropy (FSE) compression of Huffman weights
The serie of weights is compressed using FSE compression.
It's a single bitstream with 2 interleaved states,
@ -612,9 +617,10 @@ When both states have overflowed the bitstream, end is reached.
##### Conversion from weights to Huffman prefix codes
All present symbols shall now have a `Weight` value.
It is possible to transform weights into Number_of_Bits, using this formula :
`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
It is possible to transform weights into Number_of_Bits, using this formula:
```
Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
```
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
Symbols with a `Weight` of zero are removed.
Then, starting from lowest weight, prefix codes are distributed in order.
@ -636,21 +642,21 @@ it gives the following distribution :
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
#### Literals bitstreams
#### The content of Huffman-compressed literal stream
##### Bitstreams sizes
As seen in a previous paragraph,
there are 2 flavors of Huffman-compressed literals :
single stream, and 4-streams.
there are 2 types of Huffman-compressed literals :
a single stream and 4 streams.
4-streams is useful for CPU with multiple execution units and out-of-order operations.
Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations.
Since each stream can be decoded independently,
it's possible to decode them up to 4x faster than a single stream,
presuming the CPU has enough parallelism available.
For single stream, header provides both the compressed and regenerated size.
For 4-streams though,
For 4 streams though,
header only provides compressed and regenerated size of all 4 streams combined.
In order to properly decode the 4 streams,
it's necessary to know the compressed and regenerated size of each stream.
@ -663,8 +669,10 @@ bitstreams are preceded by 3 unsigned little-endian 16-bits values.
Each value represents the compressed size of one stream, in order.
The last stream size is deducted from total compressed size
and from previously decoded stream sizes :
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
##### Bitstreams read and decode
Each bitstream must be read _backward_,
@ -706,23 +714,18 @@ When all _sequences_ are decoded,
if there is any literal left in the _literal section_,
these bytes are added at the end of the block.
The _Sequences_Section_ regroup all symbols required to decode commands.
The `Sequences_Section` regroup all symbols required to decode commands.
There are 3 symbol types : literals lengths, offsets and match lengths.
They are encoded together, interleaved, in a single _bitstream_.
Each symbol is a _code_ in its own context,
which specifies a baseline and a number of bits to add.
_Codes_ are FSE compressed,
and interleaved with raw additional bits in the same bitstream.
The Sequences section starts by a header,
followed by optional Probability tables for each symbol type,
The `Sequences_Section` starts by a header,
followed by optional probability tables for each symbol type,
followed by the bitstream.
| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
To decode the Sequence section, it's required to know its size.
To decode the `Sequences_Section`, it's required to know its size.
This size is deducted from `blockSize - literalSectionSize`.
@ -753,8 +756,8 @@ This is a single byte, defining the compression mode of each symbol type.
The last field, `Reserved`, must be all-zeroes.
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
literals lengths, offsets and match lengths respectively.
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
literals lengths, offsets, and match lengths respectively.
They follow the same enumeration :
@ -769,9 +772,14 @@ They follow the same enumeration :
A distribution table will be present.
It will be described in [next part](#distribution-tables).
#### Symbols decoding
#### The codes for literals lengths, match lengths, and offsets.
##### Literals Length codes
Each symbol is a _code_ in its own context,
which specifies `Baseline` and `Number_of_Bits` to add.
_Codes_ are FSE compressed,
and interleaved with raw additional bits in the same bitstream.
##### Literals length codes
Literals length codes are values ranging from `0` to `35` included.
They define lengths from 0 to 131071 bytes.
@ -783,20 +791,20 @@ They define lengths from 0 to 131071 bytes.
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
| ---------------------- | ---- | ---- | ---- | ---- |
| Baseline | 8192 |16384 |32768 |65536 |
| `Baseline` | 8192 |16384 |32768 |65536 |
| `Number_of_Bits` | 13 | 14 | 15 | 16 |
__Default distribution__
##### Default distribution for literals length codes
When `Compression_Mode` is `Predefined_Mode`,
a predefined distribution is used for FSE compression.
@ -809,7 +817,7 @@ short literalsLength_defaultDistribution[36] =
-1,-1,-1,-1 };
```
##### Match Length codes
##### Match length codes
Match length codes are values ranging from `0` to `52` included.
They define lengths from 3 to 131074 bytes.
@ -821,25 +829,25 @@ They define lengths from 3 to 131074 bytes.
| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
| `Baseline` | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
| ------------------- | ---- | ---- | ---- | ---- | ---- |
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
| `Baseline` | 4098 | 8194 |16486 |32770 |65538 |
| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
__Default distribution__
##### Default distribution for match length codes
When `Compression_Mode` is defined as `Predefined_Mode`,
a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 6 bits (64 states).
Below is its definition. It uses an accuracy of 6 bits (64 states).
```
short matchLengths_defaultDistribution[53] =
{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
@ -858,26 +866,27 @@ For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
An offset code is also the number of additional bits to read,
and can be translated into an `Offset_Value` using the following formulae :
and can be translated into an `Offset_Value` using the following formulas :
```
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (Offset_Value > 3) offset = Offset_Value - 3;
```
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4`
but is limited by [maximum back-reference distance](#window_descriptor).
Offset_Value from 1 to 3 are special : they define "repeat codes",
`Offset_Value` from 1 to 3 are special : they define "repeat codes",
which means one of the previous offsets will be repeated.
They are sorted in recency order, with 1 meaning the most recent one.
See [Repeat offsets](#repeat-offsets) paragraph.
__Default distribution__
##### Default distribution for offset codes
When `Compression_Mode` is defined as `Predefined_Mode`,
a predefined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 5 bits (32 states),
Below is its definition. It uses an accuracy of 5 bits (32 states),
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
If any sequence in the compressed block requires an offset larger than this,
@ -918,7 +927,7 @@ The bitstream starts by reporting on which scale it operates.
Note that maximum `Accuracy_Log` for literal and match lengths is `9`,
and for offsets is `8`. Higher values are considered errors.
Then follow each symbol value, from `0` to last present one.
Then follows each symbol value, from `0` to last present one.
The number of bits used by each field is variable.
It depends on :
@ -947,11 +956,11 @@ It depends on :
Symbols probabilities are read one by one, in order.
Probability is obtained from Value decoded by following formulae :
Probability is obtained from Value decoded by following formula :
`Proba = value - 1`
It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means `less than 1`.
`-1` is a special probability, which means "less than 1".
Its effect on distribution table is described in [next paragraph].
For the purpose of calculating cumulated distribution, it counts as one.
@ -1006,7 +1015,7 @@ typically by a "less than 1" probability symbol.
The result is a list of state values.
Each state will decode the current symbol.
To get the Number of bits and baseline required for next state,
To get the `Number_of_Bits` and `Baseline` required for next state,
it's first necessary to sort all states in their natural order.
The lower states will need 1 more bit than higher ones.
@ -1030,11 +1039,11 @@ Numbering starts from higher states using less bits.
| width | 32 | 32 | 32 | 16 | 16 |
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
| range number | 2 | 4 | 6 | 0 | 1 |
| baseline | 32 | 64 | 96 | 0 | 16 |
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
Next state is determined from current state
by reading the required number of bits, and adding the specified baseline.
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
#### Bitstream
@ -1064,16 +1073,16 @@ Reminder : always keep in mind that all values are read _backward_.
##### Decoding a sequence
A state gives a code.
A code provides a baseline and number of bits to add.
A code provides `Baseline` and `Number_of_Bits` to add.
See [Symbol Decoding] section for details on each symbol.
Decoding starts by reading the number of bits required to decode offset.
It then does the same for match length,
and then for literals length.
Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
It then does the same for `Match_Length`,
and then for `Literals_Length`.
Offset / matchLength / litLength define a sequence.
It starts by inserting the number of literals defined by `litLength`,
then continue by copying `matchLength` bytes from `currentPos - offset`.
`Offset`, `Match_Length`, and `Literals_Length` define a sequence.
It starts by inserting the number of literals defined by `Literals_Length`,
then continue by copying `Match_Length` bytes from `currentPos - Offset`.
The next operation is to update states.
Using rules pre-calculated in the decoding tables,
@ -1085,7 +1094,7 @@ This operation will be repeated `Number_of_Sequences` times.
At the end, the bitstream shall be entirely consumed,
otherwise bitstream is considered corrupted.
[Symbol Decoding]:#symbols-decoding
[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets
##### Repeat offsets
@ -1143,8 +1152,8 @@ _Reserved ranges :_
__`Entropy_Tables`__ : following the same format as a [compressed blocks].
They are stored in following order :
Huffman tables for literals, FSE table for offset,
FSE table for matchLenth, and FSE table for litLength.
Huffman tables for literals, FSE table for offsets,
FSE table for match lengths, and FSE table for literals lengths.
It's finally followed by 3 offset values, populating recent offsets,
stored in order, 4-bytes little-endian each, for a total of 12 bytes.