Clarifications of Zstandard format specification

from IETF RFC review
2018-04-30 11:35:49 -07:00 · 2018-04-30 11:35:49 -07:00 · 82ad249645
commit 82ad249645
parent a11f0c67bb
1 changed files with 88 additions and 57 deletions
--- a/doc/zstd_compression_format.md
+++ b/doc/zstd_compression_format.md
@ -16,7 +16,7 @@ Distribution of this document is unlimited.

 ### Version

-0.2.6 (19/08/17)
+0.2.7 (30/04/18)


 Introduction
@ -112,6 +112,11 @@ __`Magic_Number`__

 4 Bytes, __little-endian__ format.
 Value : 0xFD2FB528
+Note: This value was selected to be less probable to find at the beginning of some random file.
+It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
+contains byte values outside of ASCII range,
+and doesn't map into UTF8 space.
+It reduces the chances that a text file represent this value by accident.

 __`Frame_Header`__

@ -171,8 +176,8 @@ according to the following table:
 |`FCS_Field_Size`| 0 or 1 |  2  |  4  |  8  |

 When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
-if `Single_Segment_flag` is set, `Field_Size` is 1.
-Otherwise, `Field_Size` is 0 : `Frame_Content_Size` is not provided.
+if `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
+Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.

 __`Single_Segment_flag`__

@ -218,11 +223,11 @@ __`Dictionary_ID_flag`__

 This is a 2-bits flag (`= FHD & 3`),
 telling if a dictionary ID is provided within the header.
-It also specifies the size of this field as `Field_Size`.
+It also specifies the size of this field as `DID_Field_Size`.

-|`Flag_Value`|  0  |  1  |  2  |  3  |
-| ---------- | --- | --- | --- | --- |
-|`Field_Size`|  0  |  1  |  2  |  4  |
+|`Flag_Value`    |  0  |  1  |  2  |  3  |
+| -------------- | --- | --- | --- | --- |
+|`DID_Field_Size`|  0  |  1  |  2  |  4  |

 #### `Window_Descriptor`

@ -270,7 +275,8 @@ the ID of the dictionary required to properly decode the frame.
 `Dictionary_ID` field is optional. When it's not present,
 it's up to the decoder to make sure it uses the correct dictionary.

-Field size depends on `Dictionary_ID_flag`.
+`Dictionary_ID` field size is provided by `DID_Field_Size`.
+`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
 1 byte can represent an ID 0-255.
 2 bytes can represent an ID 0-65535.
 4 bytes can represent an ID 0-4294967295.
@ -363,16 +369,14 @@ There are 4 block types :
 __`Block_Size`__

 The upper 21 bits of `Block_Header` represent the `Block_Size`.
+`Block_Size` is the size of the block excluding the header.
+A block can contain any number of bytes (even zero), up to
+`Block_Maximum_Decompressed_Size`, which is the smallest of:
+-  Window_Size
+-  128 KB

-Block sizes must respect a few rules :
- For `Compressed_Block`, `Block_Size` is always strictly less than decompressed size.
- Block decompressed size is always <= `Window_Size`
- Block decompressed size is always <= 128 KB.
-
-A block can contain any number of bytes (even empty),
-up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
- `Window_Size`
- 128 KB
+A `Compressed_Block` has the extra restriction that `Block_Size` is always
+strictly less than the decompressed size.


 Compressed Blocks
@ -392,8 +396,14 @@ To decode a compressed block, the following elements are necessary :
 - Previous decoded data, up to a distance of `Window_Size`,
  or all previously decoded data when `Single_Segment_flag` is set.
 - List of "recent offsets" from previous `Compressed_Block`.
- Decoding tables of previous `Compressed_Block` for each symbol type
-  (literals, literals lengths, match lengths, offsets).
+- The previous Huffman tree, required by `Treeless_Literals_Block` type
+- Previous FSE decoding tables, required by `Repeat_Mode`
+  for each symbol type (literals lengths, match lengths, offsets)
+
+Note that decoding tables aren't always from the previous `Compressed_Block`.
+
+- Every decoding table can come from a dictionary.
+- The Huffman tree comes from the previous `Compressed_Literals_Block`.

 Literals Section
 ----------------
@ -460,17 +470,20 @@ For values spanning several bytes, convention is __little-endian__.

 __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :

- Value ?0 : `Size_Format` uses 1 bit.
+`Size_Format` uses 1 _or_ 2 bits.
+Its value is : `Size_Format = (Header[0]>>2) & 3`
+
+- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
               `Regenerated_Size` uses 5 bits (0-31).
-               `Literals_Section_Header` has 1 byte.
+               `Literals_Section_Header` uses 1 byte.
               `Regenerated_Size = Header[0]>>3`
- Value 01 : `Size_Format` uses 2 bits.
+- `Size_Format` == 01 : `Size_Format` uses 2 bits.
               `Regenerated_Size` uses 12 bits (0-4095).
-               `Literals_Section_Header` has 2 bytes.
+               `Literals_Section_Header` uses 2 bytes.
               `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)`
- Value 11 : `Size_Format` uses 2 bits.
+- `Size_Format` == 11 : `Size_Format` uses 2 bits.
               `Regenerated_Size` uses 20 bits (0-1048575).
-               `Literals_Section_Header` has 3 bytes.
+               `Literals_Section_Header` uses 3 bytes.
               `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)`

 Only Stream1 is present for these cases.
@ -479,18 +492,20 @@ using a long format, even if it's less efficient.

 __`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :

- Value 00 : _A single stream_.
+`Size_Format` always uses 2 bits.
+
+- `Size_Format` == 00 : _A single stream_.
               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
-               `Literals_Section_Header` has 3 bytes.
- Value 01 : 4 streams.
+               `Literals_Section_Header` uses 3 bytes.
+- `Size_Format` == 01 : 4 streams.
               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
-               `Literals_Section_Header` has 3 bytes.
- Value 10 : 4 streams.
+               `Literals_Section_Header` uses 3 bytes.
+- `Size_Format` == 10 : 4 streams.
               Both `Regenerated_Size` and `Compressed_Size` use 14 bits (0-16383).
-               `Literals_Section_Header` has 4 bytes.
- Value 11 : 4 streams.
+               `Literals_Section_Header` uses 4 bytes.
+- `Size_Format` == 11 : 4 streams.
               Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143).
-               `Literals_Section_Header` has 5 bytes.
+               `Literals_Section_Header` uses 5 bytes.

 Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
 Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
@ -516,7 +531,8 @@ it must be used to determine where streams begin.
 `Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.

 For `Treeless_Literals_Block`,
-the Huffman table comes from previously compressed literals block.
+the Huffman table comes from previously compressed literals block,
+or from a dictionary.

 Huffman compressed data consists of either 1 or 4 Huffman-coded streams.

@ -570,7 +586,8 @@ followed by the bitstream.
 | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |

 To decode the `Sequences_Section`, it's required to know its size.
-This size is deduced from `Block_Size - Literals_Section_Size`.
+This size is deduced from the literals section size:
+`Sequences_Section_Size = Block_Size - Literals_Section_Size`.


 #### `Sequences_Section_Header`
@ -614,9 +631,11 @@ They follow the same enumeration :
          No distribution table will be present.
 - `RLE_Mode` : The table description consists of a single byte.
          This code will be repeated for all sequences.
- `Repeat_Mode` : The table used in the previous compressed block will be used again.
+- `Repeat_Mode` : The table used in the previous `Compressed_Block` will be used again,
+          or if this is the first block, table in the dictionary will be used
          No distribution table will be present.
-          Note: this includes RLE mode, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
+          Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
+          Note that this also includes `Predefined_Mode`.
          If this mode is used without any previous sequence table in the frame
          (or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
 - `FSE_Compressed_Mode` : standard FSE compression.
@ -624,6 +643,8 @@ They follow the same enumeration :
          The format of this distribution table is described in [FSE Table Description](#fse-table-description).
          Note that the maximum allowed accuracy log for literals length and match length tables is 9,
          and the maximum accuracy log for the offsets table is 8.
+          `FSE_Compressed_Mode` must not be used when only one symbol is present,
+          `RLE_Mode` should be used instead (although any other mode will work).

 #### The codes for literals lengths, match lengths, and offsets.

@ -696,7 +717,7 @@ Offset codes are values ranging from `0` to `N`.
 A decoder is free to limit its maximum `N` supported.
 Recommendation is to support at least up to `22`.
 For information, at the time of this writing.
-the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
+the reference decoder supports a maximum `N` value of `31` in 64-bits mode.

 An offset code is also the number of additional bits to read in __little-endian__ fashion,
 and can be translated into an `Offset_Value` using the following formulas :
@ -856,7 +877,8 @@ so an `offset_value` of 1 means `Repeated_Offset2`,
 an `offset_value` of 2 means `Repeated_Offset3`,
 and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.

-For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order).
+For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order),
+unless a dictionary is used, in which case they come from the dictionary.

 Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
 Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
@ -919,6 +941,8 @@ FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
 FSE encoding/decoding involves a state that is carried over between symbols,
 so decoding must be done in the opposite direction as encoding.
 Therefore, all FSE bitstreams are read from end to beginning.
+Note that the order of the bits in the stream is not reversed,
+we just read the elements in the reverse order they are written.

 For additional details on FSE, see [Finite State Entropy].

@ -943,6 +967,7 @@ The Zstandard format encodes FSE table descriptions as follows:
 An FSE distribution table describes the probabilities of all symbols
 from `0` to the last present one (included)
 on a normalized scale of `1 << Accuracy_Log` .
+Note that there must be two or more symbols with nonzero probability.

 It's a bitstream which is read forward, in __little-endian__ fashion.
 It's not necessary to know its exact size,
@ -959,24 +984,24 @@ It depends on :
  __example__ :
  Presuming an `Accuracy_Log` of 8,
  and presuming 100 probabilities points have already been distributed,
-  the decoder may read any value from `0` to `255 - 100 + 1 == 156` (inclusive).
-  Therefore, it must read `log2sup(156) == 8` bits.
+  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
+  Therefore, it must read `log2sup(157) == 8` bits.

 - Value decoded : small values use 1 less bit :
  __example__ :
-  Presuming values from 0 to 156 (inclusive) are possible,
-  255-156 = 99 values are remaining in an 8-bits field.
+  Presuming values from 0 to 157 (inclusive) are possible,
+  255-157 = 98 values are remaining in an 8-bits field.
  They are used this way :
-  first 99 values (hence from 0 to 98) use only 7 bits,
-  values from 99 to 156 use 8 bits.
+  first 98 values (hence from 0 to 97) use only 7 bits,
+  values from 98 to 157 use 8 bits.
  This is achieved through this scheme :

  | Value read | Value decoded | Number of bits used |
  | ---------- | ------------- | ------------------- |
-  |   0 -  98  |   0 -  98     |  7                  |
-  |  99 - 127  |  99 - 127     |  8                  |
-  | 128 - 226  |   0 -  98     |  7                  |
-  | 227 - 255  | 128 - 156     |  8                  |
+  |   0 -  97  |   0 -  97     |  7                  |
+  |  98 - 127  |  98 - 127     |  8                  |
+  | 128 - 225  |   0 -  97     |  7                  |
+  | 226 - 255  | 128 - 157     |  8                  |

 Symbols probabilities are read one by one, in order.

@ -1019,12 +1044,12 @@ and instructions to get the next state.

 Symbols are scanned in their natural order for "less than 1" probabilities.
 Symbols with this probability are being attributed a single cell,
-starting from the end of the table.
+starting from the end of the table and retreating.
 These symbols define a full state reset, reading `Accuracy_Log` bits.

-All remaining symbols are sorted in their natural order.
+All remaining symbols are allocated in their natural order.
 Starting from symbol `0` and table position `0`,
-each symbol gets attributed as many cells as its probability.
+each symbol gets allocated as many cells as its probability.
 Cell allocation is spreaded, not linear :
 each successor position follow this rule :

@ -1044,6 +1069,7 @@ Each state will decode the current symbol.
 To get the `Number_of_Bits` and `Baseline` required for next state,
 it's first necessary to sort all states in their natural order.
 The lower states will need 1 more bit than higher ones.
+The process is repeated for each symbol.

 __Example__ :
 Presuming a symbol has a probability of 5.
@ -1055,10 +1081,12 @@ Presuming the `Accuracy_Log` is 7, it defines 128 states.
 Divided by 8, each share is 16 large.

 In order to reach 8, 8-5=3 lowest states will count "double",
-taking shares twice larger,
+doubling the number of shares (32 in width),
 requiring one more bit in the process.

-Numbering starts from higher states using less bits.
+Baseline is assigned starting from the higher states using fewer bits,
+and proceeding naturally, then resuming at the first state,
+each takes its allocated width from Baseline.

 | state order      |   0   |   1   |    2   |   3  |   4   |
 | ---------------- | ----- | ----- | ------ | ---- | ----- |
@ -1075,6 +1103,7 @@ See [Appendix A] for the results of this process applied to the default distribu

 [Appendix A]: #appendix-a---decoding-tables-for-predefined-codes

+
 Huffman Coding
 --------------
 Zstandard Huffman-coded streams are read backwards,
@ -1096,6 +1125,7 @@ The bitstream contains Huffman-coded symbols in __little-endian__ order,
 with the codes defined by the method below.

 ### Huffman Tree Description
+
 Prefix coding represents symbols from an a priori known alphabet
 by bit sequences (codewords), one codeword for each symbol,
 in a manner such that different symbols may be represented
@ -1112,7 +1142,6 @@ More bits improve accuracy but cost more header size,
 and require more memory or more complex decoding operations.
 This specification limits maximum code length to 11 bits.

-
 ##### Representation

 All literal values from zero (included) to last present one (excluded)
@ -1190,7 +1219,7 @@ and last symbol's weight is not represented.

 An FSE bitstream starts by a header, describing probabilities distribution.
 It will create a Decoding Table.
-For a list of Huffman weights, the maximum accuracy log is 7 bits.
+For a list of Huffman weights, the maximum accuracy log is 6 bits.
 For more description see the [FSE header description](#fse-table-description)

 The Huffman header compression uses 2 states,
@ -1330,7 +1359,8 @@ __`Content`__ : The rest of the dictionary is its content.
              As long as the amount of data decoded from this frame is less than or
              equal to `Window_Size`, sequence commands may specify offsets longer
              than the total length of decoded output so far to reference back to the
-              dictionary.  After the total output has surpassed `Window_Size` however,
+              dictionary, even parts of the dictionary with offsets larger than `Window_Size`.  
+              After the total output has surpassed `Window_Size` however,
              this is no longer allowed and the dictionary is no longer accessible.

 [compressed blocks]: #the-format-of-compressed_block
@ -1523,6 +1553,7 @@ to crosscheck that an implementation build its decoding tables correctly.

 Version changes
 ---------------
+- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
 - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
 - 0.2.5 : minor typos and clarifications
 - 0.2.4 : section restructuring, by Sean Purcell