Merge pull request #288 from inikep/Other
improved zstd_compression_format.md
This commit is contained in:
commit
d6ad8fed80
@ -60,8 +60,9 @@ explaining which parameter is unsupported.
|
||||
|
||||
Overall conventions
|
||||
-----------
|
||||
In this document square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
|
||||
|
||||
In this document:
|
||||
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
|
||||
- a naming convention for identifiers is `Mixed_Case_With_Underscores`
|
||||
|
||||
Definitions
|
||||
-----------
|
||||
@ -106,7 +107,7 @@ The structure of a single Zstandard frame is following:
|
||||
|
||||
__`Magic_Number`__
|
||||
|
||||
4 Bytes, Little-endian format.
|
||||
4 Bytes, little-endian format.
|
||||
Value : 0xFD2FB527
|
||||
|
||||
__`Frame_Header`__
|
||||
@ -122,7 +123,7 @@ __`Content_Checksum`__
|
||||
|
||||
An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
|
||||
The content checksum is the result
|
||||
of [xxh64() hash function](https://www.xxHash.com)
|
||||
of [xxh64() hash function](http://www.xxhash.org)
|
||||
digesting the original (decoded) data as input, and a seed of zero.
|
||||
The low 4 bytes of the checksum are stored in little endian format.
|
||||
|
||||
@ -214,11 +215,11 @@ __`Dictionary_ID_flag`__
|
||||
|
||||
This is a 2-bits flag (`= FHD & 3`),
|
||||
telling if a dictionary ID is provided within the header.
|
||||
It also specifies the size of this field.
|
||||
It also specifies the size of this field as `Field_Size`.
|
||||
|
||||
| Value | 0 | 1 | 2 | 3 |
|
||||
| -------- | --- | --- | --- | --- |
|
||||
|Field size| 0 | 1 | 2 | 4 |
|
||||
|`Flag_Value`| 0 | 1 | 2 | 3 |
|
||||
| ---------- | --- | --- | --- | --- |
|
||||
|`Field_Size`| 0 | 1 | 2 | 4 |
|
||||
|
||||
### `Window_Descriptor`
|
||||
|
||||
@ -230,22 +231,22 @@ The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag
|
||||
In this case, the maximum back-reference distance is the content size itself,
|
||||
which can be any value from 1 to 2^64-1 bytes (16 EB).
|
||||
|
||||
| Bit numbers | 7-3 | 0-2 |
|
||||
| ----------- | -------- | -------- |
|
||||
| Field name | Exponent | Mantissa |
|
||||
| Bit numbers | 7-3 | 0-2 |
|
||||
| ----------- | ---------- | ---------- |
|
||||
| Field name | `Exponent` | `Mantissa` |
|
||||
|
||||
Maximum distance is given by the following formulae :
|
||||
```
|
||||
windowLog = 10 + Exponent;
|
||||
windowBase = 1 << windowLog;
|
||||
windowAdd = (windowBase / 8) * Mantissa;
|
||||
windowSize = windowBase + windowAdd;
|
||||
Window_Size = windowBase + windowAdd;
|
||||
```
|
||||
The minimum window size is 1 KB.
|
||||
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
|
||||
|
||||
To properly decode compressed data,
|
||||
a decoder will need to allocate a buffer of at least `windowSize` bytes.
|
||||
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
|
||||
|
||||
In order to preserve decoder from unreasonable memory requirements,
|
||||
a decoder can refuse a compressed frame
|
||||
@ -287,7 +288,7 @@ the following ranges are reserved for future use and should not be used :
|
||||
This is the original (uncompressed) size. This information is optional.
|
||||
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
|
||||
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
||||
Format is Little-endian.
|
||||
Format is little-endian.
|
||||
|
||||
| `Field_Size` | Range |
|
||||
| ------------ | ---------- |
|
||||
@ -309,7 +310,7 @@ The structure of `Data_Block` is following:
|
||||
|:------------:|:------------:|:------------:|:---------------:|
|
||||
| 1 bit | 2 bits | 21 bits | n bytes |
|
||||
|
||||
The block header uses 3-bytes.
|
||||
The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
|
||||
|
||||
__`Last_Block`__
|
||||
|
||||
@ -379,7 +380,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
|
||||
|
||||
__`Magic_Number`__
|
||||
|
||||
4 Bytes, Little-endian format.
|
||||
4 Bytes, little-endian format.
|
||||
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
|
||||
All 16 values are valid to identify a skippable frame.
|
||||
|
||||
@ -387,7 +388,7 @@ __`Frame_Size`__
|
||||
|
||||
This is the size, in bytes, of the following `User_Data`
|
||||
(without including the magic number nor the size field itself).
|
||||
This field is represented using 4 Bytes, Little-endian format, unsigned 32-bits.
|
||||
This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
|
||||
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
||||
|
||||
__`User_Data`__
|
||||
@ -408,7 +409,7 @@ A compressed block consists of 2 sections :
|
||||
|
||||
### Prerequisites
|
||||
To decode a compressed block, the following elements are necessary :
|
||||
- Previous decoded blocks, up to a distance of `windowSize`,
|
||||
- Previous decoded blocks, up to a distance of `Window_Size`,
|
||||
or all previous blocks when `Single_Segment_flag` is set.
|
||||
- List of "recent offsets" from previous compressed block.
|
||||
- Decoding tables of previous compressed block for each symbol type
|
||||
@ -467,19 +468,19 @@ __`Size_Format`__
|
||||
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
|
||||
- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`.
|
||||
|
||||
For values spanning several bytes, convention is Little-endian.
|
||||
For values spanning several bytes, convention is little-endian.
|
||||
|
||||
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
||||
|
||||
- Value : x0 : `Regenerated_Size` uses 5 bits (0-31).
|
||||
Total literal header size is 1 byte.
|
||||
`size = h[0]>>3;`
|
||||
`Literals_Section_Header` has 1 byte.
|
||||
`Regenerated_Size = Header[0]>>3`
|
||||
- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095).
|
||||
Total literal header size is 2 bytes.
|
||||
`size = (h[0]>>4) + (h[1]<<4);`
|
||||
`Literals_Section_Header` has 2 bytes.
|
||||
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
|
||||
- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575).
|
||||
Total literal header size is 3 bytes.
|
||||
`size = (h[0]>>4) + (h[1]<<4) + (h[2]<<12);`
|
||||
`Literals_Section_Header` has 3 bytes.
|
||||
`Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`
|
||||
|
||||
Note : it's allowed to represent a short value (ex : `13`)
|
||||
using a long format, accepting the reduced compacity.
|
||||
@ -487,19 +488,19 @@ using a long format, accepting the reduced compacity.
|
||||
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
|
||||
|
||||
- Value : 00 : _Single stream_.
|
||||
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||
Total literal header size is 3 bytes.
|
||||
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||
`Literals_Section_Header` has 3 bytes.
|
||||
- Value : 01 : 4 streams.
|
||||
`Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||
Total literal header size is 3 bytes.
|
||||
Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023).
|
||||
`Literals_Section_Header` has 3 bytes.
|
||||
- Value : 10 : 4 streams.
|
||||
`Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
|
||||
Total literal header size is 4 bytes.
|
||||
Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383).
|
||||
`Literals_Section_Header` has 4 bytes.
|
||||
- Value : 11 : 4 streams.
|
||||
`Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
|
||||
Total literal header size is 5 bytes.
|
||||
Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143).
|
||||
`Literals_Section_Header` has 5 bytes.
|
||||
|
||||
`Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
||||
Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
||||
|
||||
|
||||
#### `Huffman_Tree_Description`
|
||||
@ -526,53 +527,53 @@ This specification limits maximum code length to 11 bits.
|
||||
##### Representation
|
||||
|
||||
All literal values from zero (included) to last present one (excluded)
|
||||
are represented by `weight` values, from 0 to `maxBits`.
|
||||
Transformation from `weight` to `nbBits` follows this formulae :
|
||||
`nbBits = weight ? maxBits + 1 - weight : 0;` .
|
||||
The last symbol's weight is deduced from previously decoded ones,
|
||||
are represented by `Weight` values, from 0 to `Max_Number_of_Bits`.
|
||||
Transformation from `Weight` to `Number_of_Bits` follows this formulae :
|
||||
`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||
The last symbol's `Weight` is deduced from previously decoded ones,
|
||||
by completing to the nearest power of 2.
|
||||
This power of 2 gives `maxBits`, the depth of the current tree.
|
||||
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
|
||||
|
||||
__Example__ :
|
||||
Let's presume the following Huffman tree must be described :
|
||||
|
||||
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||
| ------- | --- | --- | --- | --- | --- | --- |
|
||||
| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
|
||||
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||
| ---------------- | --- | --- | --- | --- | --- | --- |
|
||||
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
|
||||
|
||||
The tree depth is 4, since its smallest element uses 4 bits.
|
||||
Value `5` will not be listed, nor will values above `5`.
|
||||
Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
|
||||
Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
|
||||
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
|
||||
Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`.
|
||||
It gives the following serie of weights :
|
||||
|
||||
| weights | 4 | 3 | 2 | 0 | 1 |
|
||||
| ------- | --- | --- | --- | --- | --- |
|
||||
| literal | 0 | 1 | 2 | 3 | 4 |
|
||||
| `Weight` | 4 | 3 | 2 | 0 | 1 |
|
||||
| -------- | --- | --- | --- | --- | --- |
|
||||
| literal | 0 | 1 | 2 | 3 | 4 |
|
||||
|
||||
The decoder will do the inverse operation :
|
||||
having collected weights of literals from `0` to `4`,
|
||||
it knows the last literal, `5`, is present with a non-zero weight.
|
||||
The weight of `5` can be deducted by joining to the nearest power of 2.
|
||||
Sum of 2^(weight-1) (excluding 0) is :
|
||||
`8 + 4 + 2 + 0 + 1 = 15`
|
||||
Sum of `2^(Weight-1)` (excluding 0) is :
|
||||
`8 + 4 + 2 + 0 + 1 = 15`.
|
||||
Nearest power of 2 is 16.
|
||||
Therefore, `maxBits = 4` and `weight[5] = 1`.
|
||||
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
|
||||
|
||||
##### Huffman Tree header
|
||||
|
||||
This is a single byte value (0-255),
|
||||
which tells how to decode the list of weights.
|
||||
|
||||
- if headerByte >= 128 : this is a direct representation,
|
||||
where each weight is written directly as a 4 bits field (0-15).
|
||||
The full representation occupies `((nbSymbols+1)/2)` bytes,
|
||||
meaning it uses a last full byte even if nbSymbols is odd.
|
||||
`nbSymbols = headerByte - 127;`.
|
||||
Note that maximum nbSymbols is 255-127 = 128.
|
||||
- if `headerByte` >= 128 : this is a direct representation,
|
||||
where each `Weight` is written directly as a 4 bits field (0-15).
|
||||
The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
|
||||
meaning it uses a last full byte even if `Number_of_Symbols` is odd.
|
||||
`Number_of_Symbols = headerByte - 127`.
|
||||
Note that maximum `Number_of_Symbols` is 255-127 = 128.
|
||||
A larger serie must necessarily use FSE compression.
|
||||
|
||||
- if headerByte < 128 :
|
||||
- if `headerByte` < 128 :
|
||||
the serie of weights is compressed by FSE.
|
||||
The length of the FSE-compressed serie is `headerByte` (0-127).
|
||||
|
||||
@ -605,29 +606,29 @@ When both states have overflowed the bitstream, end is reached.
|
||||
|
||||
##### Conversion from weights to Huffman prefix codes
|
||||
|
||||
All present symbols shall now have a `weight` value.
|
||||
It is possible to transform weights into nbBits, using this formula :
|
||||
`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
|
||||
All present symbols shall now have a `Weight` value.
|
||||
It is possible to transform weights into Number_of_Bits, using this formula :
|
||||
`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` .
|
||||
|
||||
Symbols are sorted by weight. Within same weight, symbols keep natural order.
|
||||
Symbols with a weight of zero are removed.
|
||||
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
|
||||
Symbols with a `Weight` of zero are removed.
|
||||
Then, starting from lowest weight, prefix codes are distributed in order.
|
||||
|
||||
__Example__ :
|
||||
Let's presume the following list of weights has been decoded :
|
||||
|
||||
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||
| ------- | --- | --- | --- | --- | --- | --- |
|
||||
| weight | 4 | 3 | 2 | 0 | 1 | 1 |
|
||||
| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
|
||||
| -------- | --- | --- | --- | --- | --- | --- |
|
||||
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
|
||||
|
||||
Sorted by weight and then natural order,
|
||||
it gives the following distribution :
|
||||
|
||||
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
|
||||
| ------------ | --- | --- | --- | --- | --- | ---- |
|
||||
| weight | 0 | 1 | 1 | 2 | 3 | 4 |
|
||||
| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
|
||||
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
||||
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
|
||||
| ---------------- | --- | --- | --- | --- | --- | ---- |
|
||||
| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
|
||||
| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
|
||||
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
|
||||
|
||||
|
||||
#### Literals bitstreams
|
||||
@ -638,7 +639,7 @@ As seen in a previous paragraph,
|
||||
there are 2 flavors of Huffman-compressed literals :
|
||||
single stream, and 4-streams.
|
||||
|
||||
4-streams is useful for CPU with multiple execution units and OoO operations.
|
||||
4-streams is useful for CPU with multiple execution units and out-of-order operations.
|
||||
Since each stream can be decoded independently,
|
||||
it's possible to decode them up to 4x faster than a single stream,
|
||||
presuming the CPU has enough parallelism available.
|
||||
@ -653,11 +654,11 @@ Regenerated size of each stream can be calculated by `(totalSize+3)/4`,
|
||||
except for last one, which can be up to 3 bytes smaller, to reach `totalSize`.
|
||||
|
||||
Compressed size is provided explicitly : in the 4-streams variant,
|
||||
bitstreams are preceded by 3 unsigned Little-Endian 16-bits values.
|
||||
bitstreams are preceded by 3 unsigned little-endian 16-bits values.
|
||||
Each value represents the compressed size of one stream, in order.
|
||||
The last stream size is deducted from total compressed size
|
||||
and from previously decoded stream sizes :
|
||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
|
||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
|
||||
|
||||
##### Bitstreams read and decode
|
||||
|
||||
@ -676,7 +677,7 @@ Starting from the end,
|
||||
it's possible to read the bitstream in a little-endian fashion,
|
||||
keeping track of already used bits.
|
||||
|
||||
Reading the last `maxBits` bits,
|
||||
Reading the last `Max_Number_of_Bits` bits,
|
||||
it's then possible to compare extracted value to decoding table,
|
||||
determining the symbol to decode and number of bits to discard.
|
||||
|
||||
@ -696,7 +697,7 @@ A match copy command specifies an offset and a length.
|
||||
The offset gives the position to copy from,
|
||||
which can be within a previous block.
|
||||
|
||||
There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
|
||||
There are 3 symbol types, literals lengths, offsets and match lengths,
|
||||
which are encoded together, interleaved in a single _bitstream_.
|
||||
|
||||
Each symbol is a _code_ in its own context,
|
||||
@ -708,126 +709,125 @@ The Sequences section starts by a header,
|
||||
followed by optional Probability tables for each symbol type,
|
||||
followed by the bitstream.
|
||||
|
||||
| Header | [LitLengthTable] | [OffsetTable] | [MatchLengthTable] | bitStream |
|
||||
| ------ | ---------------- | ------------- | ------------------ | --------- |
|
||||
| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
|
||||
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
|
||||
|
||||
To decode the Sequence section, it's required to know its size.
|
||||
This size is deducted from `blockSize - literalSectionSize`.
|
||||
|
||||
|
||||
#### Sequences section header
|
||||
#### `Sequences_Section_Header`
|
||||
|
||||
Consists in 2 items :
|
||||
- Nb of Sequences
|
||||
- Flags providing Symbol compression types
|
||||
- `Number_of_Sequences`
|
||||
- Symbol compression modes
|
||||
|
||||
__Nb of Sequences__
|
||||
__`Number_of_Sequences`__
|
||||
|
||||
This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
|
||||
This is a variable size field using between 1 and 3 bytes.
|
||||
Let's call its first byte `byte0`.
|
||||
- `if (byte0 == 0)` : there are no sequences.
|
||||
The sequence section stops there.
|
||||
Regenerated content is defined entirely by literals section.
|
||||
- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
|
||||
- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
|
||||
- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
|
||||
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
|
||||
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
|
||||
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
|
||||
|
||||
__Symbol encoding modes__
|
||||
__Symbol compression modes__
|
||||
|
||||
This is a single byte, defining the compression mode of each symbol type.
|
||||
|
||||
| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
|
||||
| ------- | ------ | ------ | ------ | -------- |
|
||||
|FieldName| LLType | OFType | MLType | Reserved |
|
||||
|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
|
||||
| -------- | ----------------------- | -------------- | -------------------- | ---------- |
|
||||
|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
|
||||
|
||||
The last field, `Reserved`, must be all-zeroes.
|
||||
|
||||
`LLType`, `OFType` and `MLType` define the compression mode of
|
||||
Literal Lengths, Offsets and Match Lengths respectively.
|
||||
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of
|
||||
literals lengths, offsets and match lengths respectively.
|
||||
|
||||
They follow the same enumeration :
|
||||
|
||||
| Value | 0 | 1 | 2 | 3 |
|
||||
| ---------------- | ------ | --- | ---------- | ------ |
|
||||
| Compression Mode | predef | RLE | Compressed | Repeat |
|
||||
| Value | 0 | 1 | 2 | 3 |
|
||||
| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
|
||||
| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
|
||||
|
||||
- "predef" : uses a pre-defined distribution table.
|
||||
- "RLE" : it's a single code, repeated `nbSeqs` times.
|
||||
- "Repeat" : re-use distribution table from previous compressed block.
|
||||
- "Compressed" : standard FSE compression.
|
||||
- `Predefined_Mode` : uses a predefined distribution table.
|
||||
- `RLE_Mode` : it's a single code, repeated `Number_of_Sequences` times.
|
||||
- `Repeat_Mode` : re-use distribution table from previous compressed block.
|
||||
- `FSE_Compressed_Mode` : standard FSE compression.
|
||||
A distribution table will be present.
|
||||
It will be described in [next part](#distribution-tables).
|
||||
|
||||
#### Symbols decoding
|
||||
|
||||
##### Literal Lengths codes
|
||||
##### Literals Length codes
|
||||
|
||||
Literal lengths codes are values ranging from `0` to `35` included.
|
||||
Literals length codes are values ranging from `0` to `35` included.
|
||||
They define lengths from 0 to 131071 bytes.
|
||||
|
||||
| Code | 0-15 |
|
||||
| ------ | ---- |
|
||||
| length | Code |
|
||||
| nbBits | 0 |
|
||||
| `Literals_Length_Code` | 0-15 |
|
||||
| ---------------------- | ---------------------- |
|
||||
| length | `Literals_Length_Code` |
|
||||
| `Number_of_Bits` | 0 |
|
||||
|
||||
| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
||||
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
|
||||
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|
||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
|
||||
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
||||
| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
||||
|
||||
| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|
||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
|
||||
| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
||||
|
||||
| Code | 32 | 33 | 34 | 35 |
|
||||
| -------- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 8192 |16384 |32768 |65536 |
|
||||
| nb Bits | 13 | 14 | 15 | 16 |
|
||||
| `Literals_Length_Code` | 32 | 33 | 34 | 35 |
|
||||
| ---------------------- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 8192 |16384 |32768 |65536 |
|
||||
| `Number_of_Bits` | 13 | 14 | 15 | 16 |
|
||||
|
||||
__Default distribution__
|
||||
|
||||
When "compression mode" is "predef"",
|
||||
a pre-defined distribution is used for FSE compression.
|
||||
When `Compression_Mode` is `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
|
||||
Below is its definition. It uses an accuracy of 6 bits (64 states).
|
||||
```
|
||||
short literalLengths_defaultDistribution[36] =
|
||||
short literalsLength_defaultDistribution[36] =
|
||||
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
|
||||
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
|
||||
-1,-1,-1,-1 };
|
||||
```
|
||||
|
||||
##### Match Lengths codes
|
||||
##### Match Length codes
|
||||
|
||||
Match lengths codes are values ranging from `0` to `52` included.
|
||||
Match length codes are values ranging from `0` to `52` included.
|
||||
They define lengths from 3 to 131074 bytes.
|
||||
|
||||
| Code | 0-31 |
|
||||
| ------ | -------- |
|
||||
| value | Code + 3 |
|
||||
| nbBits | 0 |
|
||||
| `Match_Length_Code` | 0-31 |
|
||||
| ------------------- | ----------------------- |
|
||||
| value | `Match_Length_Code` + 3 |
|
||||
| `Number_of_Bits` | 0 |
|
||||
|
||||
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
||||
| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
|
||||
| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
|
||||
|
||||
| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
||||
| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
||||
| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
|
||||
| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
|
||||
|
||||
| Code | 48 | 49 | 50 | 51 | 52 |
|
||||
| -------- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
||||
| nb Bits | 12 | 13 | 14 | 15 | 16 |
|
||||
| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 |
|
||||
| ------------------- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Baseline | 4098 | 8194 |16486 |32770 |65538 |
|
||||
| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 |
|
||||
|
||||
__Default distribution__
|
||||
|
||||
When "compression mode" is defined as "predef",
|
||||
a pre-defined distribution is used for FSE compression.
|
||||
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
|
||||
Here is its definition. It uses an accuracy of 6 bits (64 states).
|
||||
```
|
||||
@ -840,31 +840,32 @@ short matchLengths_defaultDistribution[53] =
|
||||
|
||||
##### Offset codes
|
||||
|
||||
Offset codes are values ranging from `0` to `N`,
|
||||
with `N` being limited by maximum backreference distance.
|
||||
Offset codes are values ranging from `0` to `N`.
|
||||
|
||||
A decoder is free to limit its maximum `N` supported.
|
||||
Recommendation is to support at least up to `22`.
|
||||
For information, at the time of this writing.
|
||||
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
||||
|
||||
An offset code is also the nb of additional bits to read,
|
||||
and can be translated into an `OFValue` using the following formulae :
|
||||
An offset code is also the number of additional bits to read,
|
||||
and can be translated into an `Offset_Value` using the following formulae :
|
||||
|
||||
```
|
||||
OFValue = (1 << offsetCode) + readNBits(offsetCode);
|
||||
if (OFValue > 3) offset = OFValue - 3;
|
||||
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
|
||||
if (Offset_Value > 3) offset = Offset_Value - 3;
|
||||
```
|
||||
It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4
|
||||
but is limited by [maximum back-reference distance](#window_descriptor).
|
||||
|
||||
OFValue from 1 to 3 are special : they define "repeat codes",
|
||||
Offset_Value from 1 to 3 are special : they define "repeat codes",
|
||||
which means one of the previous offsets will be repeated.
|
||||
They are sorted in recency order, with 1 meaning the most recent one.
|
||||
See [Repeat offsets](#repeat-offsets) paragraph.
|
||||
|
||||
__Default distribution__
|
||||
|
||||
When "compression mode" is defined as "predef",
|
||||
a pre-defined distribution is used for FSE compression.
|
||||
When `Compression_Mode` is defined as `Predefined_Mode`,
|
||||
a predefined distribution is used for FSE compression.
|
||||
|
||||
Here is its definition. It uses an accuracy of 5 bits (32 states),
|
||||
and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
|
||||
@ -882,38 +883,38 @@ short offsetCodes_defaultDistribution[53] =
|
||||
|
||||
Following the header, up to 3 distribution tables can be described.
|
||||
When present, they are in this order :
|
||||
- Literal lengthes
|
||||
- Literals lengths
|
||||
- Offsets
|
||||
- Match Lengthes
|
||||
- Match Lengths
|
||||
|
||||
The content to decode depends on their respective encoding mode :
|
||||
- Predef : no content. Use pre-defined distribution table.
|
||||
- RLE : 1 byte. This is the only code to use across the whole compressed block.
|
||||
- FSE : A distribution table is present.
|
||||
- Repeat mode : no content. Re-use distribution from previous compressed block.
|
||||
- `Predefined_Mode` : no content. Use predefined distribution table.
|
||||
- `RLE_Mode` : 1 byte. This is the only code to use across the whole compressed block.
|
||||
- `FSE_Compressed_Mode` : A distribution table is present.
|
||||
- `Repeat_Mode` : no content. Re-use distribution from previous compressed block.
|
||||
|
||||
##### FSE distribution table : condensed format
|
||||
|
||||
An FSE distribution table describes the probabilities of all symbols
|
||||
from `0` to the last present one (included)
|
||||
on a normalized scale of `1 << AccuracyLog` .
|
||||
on a normalized scale of `1 << Accuracy_Log` .
|
||||
|
||||
It's a bitstream which is read forward, in little-endian fashion.
|
||||
It's not necessary to know its exact size,
|
||||
since it will be discovered and reported by the decoding process.
|
||||
|
||||
The bitstream starts by reporting on which scale it operates.
|
||||
`AccuracyLog = low4bits + 5;`
|
||||
Note that maximum `AccuracyLog` for literal and match lengthes is `9`,
|
||||
`Accuracy_Log = low4bits + 5`.
|
||||
Note that maximum `Accuracy_Log` for literal and match length is `9`,
|
||||
and for offsets it is `8`. Higher values are considered errors.
|
||||
|
||||
Then follow each symbol value, from `0` to last present one.
|
||||
The nb of bits used by each field is variable.
|
||||
The number of bits used by each field is variable.
|
||||
It depends on :
|
||||
|
||||
- Remaining probabilities + 1 :
|
||||
__example__ :
|
||||
Presuming an AccuracyLog of 8,
|
||||
Presuming an `Accuracy_Log` of 8,
|
||||
and presuming 100 probabilities points have already been distributed,
|
||||
the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
|
||||
Therefore, it must read `log2sup(156) == 8` bits.
|
||||
@ -927,17 +928,17 @@ It depends on :
|
||||
values from 99 to 156 use 8 bits.
|
||||
This is achieved through this scheme :
|
||||
|
||||
| Value read | Value decoded | nb Bits used |
|
||||
| ---------- | ------------- | ------------ |
|
||||
| 0 - 98 | 0 - 98 | 7 |
|
||||
| 99 - 127 | 99 - 127 | 8 |
|
||||
| 128 - 226 | 0 - 98 | 7 |
|
||||
| 227 - 255 | 128 - 156 | 8 |
|
||||
| Value read | Value decoded | Number of bits used |
|
||||
| ---------- | ------------- | ------------------- |
|
||||
| 0 - 98 | 0 - 98 | 7 |
|
||||
| 99 - 127 | 99 - 127 | 8 |
|
||||
| 128 - 226 | 0 - 98 | 7 |
|
||||
| 227 - 255 | 128 - 156 | 8 |
|
||||
|
||||
Symbols probabilities are read one by one, in order.
|
||||
|
||||
Probability is obtained from Value decoded by following formulae :
|
||||
`Proba = value - 1;`
|
||||
`Proba = value - 1`
|
||||
|
||||
It means value `0` becomes negative probability `-1`.
|
||||
`-1` is a special probability, which means `less than 1`.
|
||||
@ -952,9 +953,9 @@ This repeat flag tells how many probabilities of zeroes follow the current one.
|
||||
It provides a number ranging from 0 to 3.
|
||||
If it is a 3, another 2-bits repeat flag follows, and so on.
|
||||
|
||||
When last symbol reaches cumulated total of `1 << AccuracyLog`,
|
||||
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
|
||||
decoding is complete.
|
||||
If the last symbol makes cumulated total go above `1 << AccuracyLog`,
|
||||
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
|
||||
distribution is considered corrupted.
|
||||
|
||||
Then the decoder can tell how many bytes were used in this process,
|
||||
@ -969,14 +970,14 @@ to create a unique decoding table.
|
||||
|
||||
It follows the following build rule :
|
||||
|
||||
The table has a size of `tableSize = 1 << AccuracyLog;`.
|
||||
The table has a size of `tableSize = 1 << Accuracy_Log`.
|
||||
Each cell describes the symbol decoded,
|
||||
and instructions to get the next state.
|
||||
|
||||
Symbols are scanned in their natural order for `less than 1` probabilities.
|
||||
Symbols with this probability are being attributed a single cell,
|
||||
starting from the end of the table.
|
||||
These symbols define a full state reset, reading `AccuracyLog` bits.
|
||||
These symbols define a full state reset, reading `Accuracy_Log` bits.
|
||||
|
||||
All remaining symbols are sorted in their natural order.
|
||||
Starting from symbol `0` and table position `0`,
|
||||
@ -1005,7 +1006,7 @@ It receives 5 state values. States are sorted in natural order.
|
||||
|
||||
Next power of 2 is 8.
|
||||
Space of probabilities is divided into 8 equal parts.
|
||||
Presuming the AccuracyLog is 7, it defines 128 states.
|
||||
Presuming the `Accuracy_Log` is 7, it defines 128 states.
|
||||
Divided by 8, each share is 16 large.
|
||||
|
||||
In order to reach 8, 8-5=3 lowest states will count "double",
|
||||
@ -1014,13 +1015,13 @@ requiring one more bit in the process.
|
||||
|
||||
Numbering starts from higher states using less bits.
|
||||
|
||||
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||
| ----------- | ----- | ----- | ------ | ---- | ----- |
|
||||
| width | 32 | 32 | 32 | 16 | 16 |
|
||||
| nb Bits | 5 | 5 | 5 | 4 | 4 |
|
||||
| range nb | 2 | 4 | 6 | 0 | 1 |
|
||||
| baseline | 32 | 64 | 96 | 0 | 16 |
|
||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||
| ---------------- | ----- | ----- | ------ | ---- | ----- |
|
||||
| width | 32 | 32 | 32 | 16 | 16 |
|
||||
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
||||
| range number | 2 | 4 | 6 | 0 | 1 |
|
||||
| baseline | 32 | 64 | 96 | 0 | 16 |
|
||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||
|
||||
Next state is determined from current state
|
||||
by reading the required number of bits, and adding the specified baseline.
|
||||
@ -1044,9 +1045,9 @@ The bitstream starts with initial state values,
|
||||
each using the required number of bits in their respective _accuracy_,
|
||||
decoded previously from their normalized distribution.
|
||||
|
||||
It starts by `Literal Length State`,
|
||||
followed by `Offset State`,
|
||||
and finally `Match Length State`.
|
||||
It starts by `Literals_Length_State`,
|
||||
followed by `Offset_State`,
|
||||
and finally `Match_Length_State`.
|
||||
|
||||
Reminder : always keep in mind that all values are read _backward_.
|
||||
|
||||
@ -1056,9 +1057,9 @@ A state gives a code.
|
||||
A code provides a baseline and number of bits to add.
|
||||
See [Symbol Decoding] section for details on each symbol.
|
||||
|
||||
Decoding starts by reading the nb of bits required to decode offset.
|
||||
Decoding starts by reading the number of bits required to decode offset.
|
||||
It then does the same for match length,
|
||||
and then for literal length.
|
||||
and then for literals length.
|
||||
|
||||
Offset / matchLength / litLength define a sequence.
|
||||
It starts by inserting the number of literals defined by `litLength`,
|
||||
@ -1066,11 +1067,11 @@ then continue by copying `matchLength` bytes from `currentPos - offset`.
|
||||
|
||||
The next operation is to update states.
|
||||
Using rules pre-calculated in the decoding tables,
|
||||
`Literal Length State` is updated,
|
||||
followed by `Match Length State`,
|
||||
and then `Offset State`.
|
||||
`Literals_Length_State` is updated,
|
||||
followed by `Match_Length_State`,
|
||||
and then `Offset_State`.
|
||||
|
||||
This operation will be repeated `NbSeqs` times.
|
||||
This operation will be repeated `Number_of_Sequences` times.
|
||||
At the end, the bitstream shall be entirely consumed,
|
||||
otherwise bitstream is considered corrupted.
|
||||
|
||||
@ -1081,7 +1082,7 @@ otherwise bitstream is considered corrupted.
|
||||
As seen in [Offset Codes], the first 3 values define a repeated offset.
|
||||
They are sorted in recency order, with 1 meaning "most recent one".
|
||||
|
||||
There is an exception though, when current sequence's literal length is `0`.
|
||||
There is an exception though, when current sequence's literals length is `0`.
|
||||
In which case, repcodes are "pushed by one",
|
||||
so 1 becomes 2, 2 becomes 3,
|
||||
and 3 becomes "offset_1 - 1_byte".
|
||||
@ -1112,31 +1113,32 @@ But dictionaries created by `zstd --train` follow a format, described here.
|
||||
__Pre-requisites__ : a dictionary has a known length,
|
||||
defined either by a buffer limit, or a file size.
|
||||
|
||||
| Header | DictID | Stats | Content |
|
||||
| ------ | ------ | ----- | ------- |
|
||||
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
|
||||
| -------------- | --------------- | ---------------- | --------- |
|
||||
|
||||
__Header__ : 4 bytes ID, value 0xEC30A437, Little-Endian format
|
||||
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
|
||||
|
||||
__Dict_ID__ : 4 bytes, stored in Little-Endian format.
|
||||
DictID can be any value, except 0 (which means no DictID).
|
||||
__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
|
||||
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
|
||||
It's used by decoders to check if they use the correct dictionary.
|
||||
_Reserved ranges :_
|
||||
|
||||
_Reserved ranges :_
|
||||
If the frame is going to be distributed in a private environment,
|
||||
any dictionary ID can be used.
|
||||
any `Dictionary_ID` can be used.
|
||||
However, for public distribution of compressed frames,
|
||||
some ranges are reserved for future use :
|
||||
the following ranges are reserved for future use and should not be used :
|
||||
|
||||
- low range : 1 - 32767 : reserved
|
||||
- high range : >= (2^31) : reserved
|
||||
- low range : 1 - 32767
|
||||
- high range : >= (2^31)
|
||||
|
||||
__Stats__ : Entropy tables, following the same format as a [compressed blocks].
|
||||
__`Entropy_Tables`__ : following the same format as a [compressed blocks].
|
||||
They are stored in following order :
|
||||
Huffman tables for literals, FSE table for offset,
|
||||
FSE table for matchLenth, and FSE table for litLength.
|
||||
It's finally followed by 3 offset values, populating recent offsets,
|
||||
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
||||
|
||||
__Content__ : Where the actual dictionary content is.
|
||||
__`Content`__ : Where the actual dictionary content is.
|
||||
Content size depends on Dictionary size.
|
||||
|
||||
[compressed blocks]: #the-format-of-compressed_block
|
||||
|
Loading…
Reference in New Issue
Block a user