Minor formatting changes

* In the description about "three categories", explicitly number them instead
of using a giant paragraph that is harder to follow.

* Switch lists of items to consistently use American style commas. The American
style lists is better for clarity purposes. Consider the following:
	-Each category of value (insert and copy lengths, literals and distances)
	+Each category of value (insert and copy lengths, literals, and distances)

* Make sure not to break a hyphenated phrase with a newline. When the nroff
file is processed, "insert-\nand-copy" becomes "insert- and-copy", making it
inconsistent with other uses of the hyphenated phrase.

* Consistently use the same hyphenated phrase if referred to as a single unit.
	"insert and copy"   -> "insert-and-copy"
	"least significant" -> "least-significant"
	"most significant"  -> "most-significant"
	"fixed length"      -> "fixed-length"
	"block switch"      -> "block-switch".

* Consistently use "indexes" instead of "indices"
This commit is contained in:
Joe Tsai 2015-11-01 16:50:13 -08:00
parent 166edb0287
commit c5b6b5c7c1

View File

@ -160,11 +160,11 @@ String: a sequence of arbitrary bytes.
Bytes stored within a computer do not have a "bit order", since
they are always treated as a unit. However, a byte considered as
an integer between 0 and 255 does have a most- and least-
significant bit, and since we write numbers with the most-
significant digit on the left, we also write bytes with the most-
significant bit on the left. In the diagrams below, we number the
bits of a byte so that bit 0 is the least-significant bit, i.e.,
an integer between 0 and 255 does have a most- and
least-significant bit, and since we write numbers with the
most-significant digit on the left, we also write bytes with the
most-significant bit on the left. In the diagrams below, we number
the bits of a byte so that bit 0 is the least-significant bit, i.e.,
the bits are numbered:
.nf
@ -209,15 +209,15 @@ compressed byte sequence:
starting with the least-significant bit of the data
element. These are referred to here as integer values
and are considered unsigned.
* Prefix codes are packed starting with the most-
significant bit of the code.
* Prefix codes are packed starting with the most-significant
bit of the code.
.fi
In other words, if one were to print out the compressed data as
a sequence of bytes, starting with the first byte at the
*right* margin and proceeding to the *left*, with the most-
significant bit of each byte on the left as usual, one would be
able to parse the result from right to left, with fixed-width
*right* margin and proceeding to the *left*, with the
most-significant bit of each byte on the left as usual, one would
be able to parse the result from right to left, with fixed-width
elements in the correct MSB-to-LSB order and prefix codes in
bit-reversed order (i.e., with the first bit of the code in the
relative LSB position).
@ -225,8 +225,8 @@ relative LSB position).
.ti 0
2. Compressed representation overview
A compressed data set consists of a header and a series of meta-
blocks. Each meta-block decompresses to a sequence of 0
A compressed data set consists of a header and a series of
meta-blocks. Each meta-block decompresses to a sequence of 0
to 16,777,216 (16 MiB) uncompressed bytes. The final uncompressed data is
the concatenation of the uncompressed sequences from each meta-block.
@ -259,30 +259,37 @@ duplicated is two, but the last command in the meta-block is permitted to have
only literals and no pointer to a string to duplicate.
Each command in the compressed data is represented using three categories
of prefix codes: one set of prefix codes are for the literal sequence
lengths (also referred to as literal insertion lengths) and backward
copy lengths (that is, a single code word represents two lengths,
one of the literal sequence and one of the backward copy), a separate
set of prefix codes are for literals, and a third set of prefix codes are for
distances. The prefix code descriptions for each meta-block appear in a compact
form just before the compressed data in the meta-block header. The insert and
copy length and distance prefix codes may be followed by extra bits that are
added to the base values determined by the codes. The number of extra bits is
determined by the code.
of prefix codes:
1) One set of prefix codes are for the literal sequence lengths
(also referred to as literal insertion lengths) and backward
copy lengths (that is, a single code word represents two lengths,
one of the literal sequence and one of the backward copy).
2) One set of prefix codes are for literals.
3) One set of prefix codes are for distances.
The prefix code descriptions for each meta-block appear in a compact
form just before the compressed data in the meta-block header.
The insert-and-copy length and distance prefix codes may be followed by
extra bits that are added to the base values determined by the codes.
The number of extra bits is determined by the code.
One meta-block command then appears as a sequence of prefix codes:
Insert and copy length, literal, literal, ..., literal, distance
Insert-and-copy length, literal, literal, ..., literal, distance
where the insert and copy defines the number of literals that immediately
follow and the copy length, and the distance defines how far back to go
for the copy, used in combination with the copy length. The resulting
where the insert-and-copy defines an insertion length and a copy length.
The insertion length determines the number of literals that immediately
follow. The distance defines how far back to go for the copy and the
copy length determine the number of bytes to copy. The resulting
uncompressed data is the sequence of bytes:
literal, literal, ..., literal, copy, copy, ..., copy
where the number of literal bytes and copy bytes are determined by the
insert and copy length code. (The number of bytes copied for a static
insert-and-copy length code. (The number of bytes copied for a static
dictionary entry can vary from the copy length.)
The last command in the meta-block may end with the last literal if the
@ -301,14 +308,13 @@ a new block type and block count is read from the stream immediately
preceding the next element of that category, which will use the new
block type.
The insert and copy block type directly determines which prefix code to
use for the next insert and copy element. For the literal and distance
The insert-and-copy block type directly determines which prefix code to
use for the next insert-and-copy element. For the literal and distance
elements, the respective block type is used in combination with other
context information to determine which prefix code to use for the next
element.
Consider the following
example:
Consider the following example:
(IaC0, L0, L1, L2, D0)(IaC1, D1)(IaC2, L3, L4, D2)(IaC3, L5, D3)
@ -341,7 +347,7 @@ meta-block is then:
IaC0 L0 L1 LBlockSwitch(1, 3) L2 D0 IaC1 DBlockSwitch(1, 3) D1
IaCBlockSwitch(1, 2) IaC2 L3 L4 D2 IaC3 LBlockSwitch(0, 1) L5 D3
where *BlockSwitch(t, n) switches to block type t for a count of n elements.
where xBlockSwitch(t, n) switches to block type t for a count of n elements.
Note that in this example DBlockSwitch(1, 3) immediately precedes the
next required distance D1. It does not follow the last distance of
the previous block, D0. Whenever an element of a category is needed,
@ -349,14 +355,14 @@ and the block count for that category has reached zero, then a new
block type and count is read from the stream just before reading that next
element.
The block switch commands for the first blocks of each category are not part
The block-switch commands for the first blocks of each category are not part
of the meta-block compressed data. Instead the first block type
is defined to be 0, and the first block count for each category is
encoded in the meta-block header. The prefix codes for the block types and counts, a total of
six prefix codes over the three categories, are defined in a compact form in the meta-block
header.
Each category of value (insert-and-copy lengths, literals and distances)
Each category of value (insert-and-copy lengths, literals, and distances)
can be encoded with any prefix code from a collection of prefix
codes belonging to the same category appearing in the meta-block header. The
particular prefix code used can depend on two factors: the block
@ -366,10 +372,9 @@ the uncompressed data, and in the case of distances, the context is the copy
length from the same command. For insert-and-copy lengths, no context
is used and the prefix code depends only on the block type. In the case
of literals and distances, the context is mapped to a context ID in
the range 0..63 for literals and 0..3 for distances and the matrix
of the prefix code indices for each block type and context ID,
called the context map, is encoded in a compact form in the meta-
block header.
the range 0..63 for literals and 0..3 for distances. The matrix
of the prefix code indexes for each block type and context ID,
called the context map, is encoded in a compact form in the meta-block header.
For example, the prefix code to use to decode L2 depends on the
block type (1), and the literal context ID determined by the two uncompressed
@ -379,8 +384,8 @@ type (0), and the distance context ID determined by the copy length decoded
from IaC0. The prefix code to use to decode IaC3 depends only on the block
type (1).
In addition to the parts listed above (prefix code for insert-
and-copy lengths, literals, distances, block types and block counts
In addition to the parts listed above (prefix code for insert-and-copy
lengths, literals, distances, block types and block counts,
and the context map), the meta-block header contains the number of
uncompressed bytes coded in the meta-block and two additional parameters used in
the representation of match distances: the number of postfix bits and
@ -515,7 +520,7 @@ initially in tree[I].Len; the codes are produced in tree[I].Code.
value.
.KS
for (n = 0; n <= max_code; n++) {
for (n = 0; n <= max_code; n++) {
len = tree[n].Len;
if (len != 0) {
tree[n].Code = next_code[len];
@ -577,7 +582,7 @@ Prefix codes are used for different purposes in the brotli
format, and each purpose has a different alphabet size. For
literal codes the alphabet size is 256. For insert-and-copy
length codes the alphabet size is 704. For block count codes,
the alphabet size is 26. For distance codes, block type codes and
the alphabet size is 26. For distance codes, block type codes, and
the prefix codes used in compressing the context map, the
alphabet size is dynamic and is based on parameters defined in
later sections. The following table summarizes the alphabet sizes
@ -585,25 +590,25 @@ for the various prefix codes and the sections where they are defined.
.nf
.KS
+-----------------+------------------------+-------------+
| Prefix code | Alphabet size | Definition |
+-----------------+------------------------+-------------+
| literal | 256 | |
+-----------------+------------------------+-------------+
| distance | 16 + NDIRECT + | Section 4. |
| | (48 << NPOSTFIX) | |
+-----------------+------------------------+-------------+
| insert-and-copy | 704 | Section 5. |
| length | | |
+-----------------+------------------------+-------------+
| block count | 26 | Section 6. |
+-----------------+------------------------+-------------+
| block type | NBLTYPESx + 2, | Section 6. |
| | (where x is I, L or D) | |
+-----------------+------------------------+-------------+
| context map | NTREESx + RLEMAXx | Section 7. |
| | (where x is L or D) | |
+-----------------+------------------------+-------------+
+-----------------+-------------------------+------------+
| Prefix code | Alphabet size | Definition |
+-----------------+-------------------------+------------+
| literal | 256 | |
+-----------------+-------------------------+------------+
| distance | 16 + NDIRECT + | Section 4. |
| | (48 << NPOSTFIX) | |
+-----------------+-------------------------+------------+
| insert-and-copy | 704 | Section 5. |
| length | | |
+-----------------+-------------------------+------------+
| block count | 26 | Section 6. |
+-----------------+-------------------------+------------+
| block type | NBLTYPESx + 2, | Section 6. |
| | (where x is I, L, or D) | |
+-----------------+-------------------------+------------+
| context map | NTREESx + RLEMAXx | Section 7. |
| | (where x is L or D) | |
+-----------------+-------------------------+------------+
.KE
.fi
@ -615,8 +620,8 @@ prefix code distinguishes between simple and complex prefix
codes. If this value is 1, then a simple prefix code follows.
Otherwise the value indicates the number of leading zeros.
A simple prefix code can have only up to four symbols with non-
zero code length. The format of the simple prefix code is as
A simple prefix code can have only up to four symbols with
non-zero code length. The format of the simple prefix code is as
follows:
.nf
@ -657,7 +662,7 @@ follows:
prefix code.
* if NSYM = 4, the code lengths (in order of symbols decoded)
depend on the tree-select bit: 2, 2, 2, 2, (tree-select bit 0)
depend on the tree-select bit: 2, 2, 2, 2 (tree-select bit 0),
or 1, 2, 3, 3 (tree-select bit 1).
.fi
@ -724,11 +729,10 @@ in the compressed data, where the bits are parsed from right to left):
.KE
.fi
We can now define the format of the complex prefix code as
follows:
We can now define the format of the complex prefix code as follows:
.nf
2 bits: HSKIP, values of 0, 2 or 3 represent the respective
2 bits: HSKIP, values of 0, 2, or 3 represent the respective
number of skipped code lengths. The skipped lengths
are taken to be zero. (An HSKIP of 1 indicates a
Simple prefix code.)
@ -742,7 +746,7 @@ follows:
is for symbol 4.
The code lengths of code length symbols are between 0 and
5 and they are represented with 2 - 4 bits according to
5, and they are represented with 2 - 4 bits according to
the variable length code above. A code length of 0 means
the corresponding code length symbol is not used.
@ -805,7 +809,7 @@ the number of direct distance codes, denoted by NDIRECT (0..120). Both of
these parameters are encoded in the meta-block header. We will also
use the following derived parameter:
POSTFIX_MASK = ((1 << NPOSTFIX) - 1)
POSTFIX_MASK = (1 << NPOSTFIX) - 1
The first 16 distance symbols are special symbols that reference
past distances as follows:
@ -830,10 +834,10 @@ past distances as follows:
.fi
The ring buffer of four last distances is initialized by the values
16, 15, 11 and 4 (i.e. the fourth-to-last is set to 16, the third-to-last
to 15, the second-to-last to 11 and the last distance to 4) at the
beginning of the *stream* (as opposed to the beginning of the meta-
block) and it is not reset at meta-block boundaries. When a distance
16, 15, 11, and 4 (i.e. the fourth-to-last is set to 16, the third-to-last
to 15, the second-to-last to 11, and the last distance to 4) at the
beginning of the *stream* (as opposed to the beginning of the meta-block)
and it is not reset at meta-block boundaries. When a distance
symbol 0 appears, the distance it represents (i.e. the last distance
in the sequence of distances) is not pushed to the ring buffer of
last distances, in other words, the expression "(second, third,
@ -865,7 +869,7 @@ Given a distance symbol "dcode" (>= 16 + NDIRECT), and extra bits
.nf
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
offset = ((2 + (hcode & 1)) << ndistbits) - 4;
offset = ((2 + (hcode & 1)) << ndistbits) - 4
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
.fi
@ -881,7 +885,7 @@ of a meta-block is represented with the following triplet:
<insert-and-copy length code, insert extra bits, copy extra bits>
The insert-and-copy length code, the insert extra bits and the copy
The insert-and-copy length code, the insert extra bits, and the copy
extra bits are encoded back-to-back, the insert-and-copy length code
is encoded using a prefix code over the insert-and-copy length code
alphabet, while the extra bits values are encoded as fixed-width
@ -999,7 +1003,7 @@ are initialized to 1 and 0, respectively, at the end of the
meta-block header.
Since the first block type of each block category is 0, the block
type of the first block switch command is not encoded in
type of the first block-switch command is not encoded in
the compressed data. Instead the block count for each category
that has more than one type is encoded in the meta-block header.
@ -1009,13 +1013,13 @@ count down to exactly zero at the end of the meta-block.
The number of different block types in each block category, denoted
by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
lengths and distances, respectively, is encoded in the meta-block
lengths, and distances, respectively, is encoded in the meta-block
header, and it must equal to the largest block type plus one in that
block category. In other words, the set of literal, insert-and-copy
length and distance block types must be [0..NBLTYPESL-1],
length, and distance block types must be [0..NBLTYPESL-1],
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
follows that the alphabet size of literal, insert-and-copy length and
distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2 and
follows that the alphabet size of literal, insert-and-copy length, and
distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and
NBLTYPESD + 2, respectively.
Each block count in the compressed data is represented with a pair
@ -1044,7 +1048,7 @@ number of extra bits and the range of block counts are as follows:
.KE
.fi
The first block switch command of each block category is special in
The first block-switch command of each block category is special in
the sense that it is encoded in the meta-block header, and as
described earlier the block type code is omitted, since it is an
implicit zero.
@ -1072,10 +1076,10 @@ p1 and p2 are initialized to zero.
There are four methods, called context modes, to compute the
Context ID:
.nf
* LSB6, where the Context ID is the value of six least
significant bits of p1,
* MSB6, where the Context ID is the value of six most
significant bits of p1,
* LSB6, where the Context ID is the value of six
least-significant bits of p1,
* MSB6, where the Context ID is the value of six
most-significant bits of p1,
* UTF8, where the Context ID is a complex function of p1, p2,
optimized for text compression, and
* Signed, where Context ID is a complex function of p1, p2,
@ -1156,9 +1160,9 @@ Given p1 is the last uncompressed byte and p2 is the second-to-last
uncompressed byte the context IDs can be computed as follows:
.nf
For LSB6 : Context ID = p1 & 0x3f
For MSB6 : Context ID = p1 >> 2
For UTF8 : Context ID = Lut0[p1] | Lut1[p2]
For LSB6: Context ID = p1 & 0x3f
For MSB6: Context ID = p1 >> 2
For UTF8: Context ID = Lut0[p1] | Lut1[p2]
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]
.fi
@ -1466,7 +1470,7 @@ the following:
fill bits are not zero, then the stream should be
rejected as invalid)
2 bits: MNIBBLES, # of nibbles to represent the uncompressed
length, encoded with the following fixed length code:
length, encoded with the following fixed-length code:
Value Bit Pattern
----- -----------
@ -1563,7 +1567,7 @@ the following:
2 bits: NPOSTFIX, parameter used in the distance coding
4 bits: four most significant bits of NDIRECT, to get the
4 bits: four most-significant bits of NDIRECT, to get the
actual value of the parameter NDIRECT, left-shift
this four bit number by NPOSTFIX bits
@ -1773,7 +1777,7 @@ specification, where non-conformant compressed data sequences should be discarde
A possible attack against a system containing a decompressor
implementation (e.g. a web browser) is to exploit a buffer
overflow caused by an invalid compressed data. Therefore decompressor
implementations should perform bound-checking for each memory access
implementations should perform bounds-checking for each memory access
that result from values decoded from the compressed stream.
.ti 0
@ -5678,8 +5682,8 @@ length is 122,784 bytes and the zlib CRC-32 of the byte sequence is
9ce0a58de0a49ee0a4bee0a4aae0a4a8e0a495e0a4bee0a4b0e0a58de0a4b0e0
a4b5e0a4bee0a488e0a4b8e0a495e0a58de0a4b0e0a4bfe0a4afe0a4a4e0a4be
The number of words for each length is given by the following bit-
depth array:
The number of words for each length is given by the following
bit-depth array:
NDBITS := 0, 0, 0, 0, 10, 10, 11, 11, 10, 10,
10, 10, 10, 9, 9, 8, 7, 7, 8, 7,