mirror of
https://github.com/google/brotli.git
synced 2024-11-24 04:20:16 +00:00
Minor formatting changes
* In the description about "three categories", explicitly number them instead of using a giant paragraph that is harder to follow. * Switch lists of items to consistently use American style commas. The American style lists is better for clarity purposes. Consider the following: -Each category of value (insert and copy lengths, literals and distances) +Each category of value (insert and copy lengths, literals, and distances) * Make sure not to break a hyphenated phrase with a newline. When the nroff file is processed, "insert-\nand-copy" becomes "insert- and-copy", making it inconsistent with other uses of the hyphenated phrase. * Consistently use the same hyphenated phrase if referred to as a single unit. "insert and copy" -> "insert-and-copy" "least significant" -> "least-significant" "most significant" -> "most-significant" "fixed length" -> "fixed-length" "block switch" -> "block-switch". * Consistently use "indexes" instead of "indices"
This commit is contained in:
parent
166edb0287
commit
c5b6b5c7c1
@ -160,11 +160,11 @@ String: a sequence of arbitrary bytes.
|
||||
|
||||
Bytes stored within a computer do not have a "bit order", since
|
||||
they are always treated as a unit. However, a byte considered as
|
||||
an integer between 0 and 255 does have a most- and least-
|
||||
significant bit, and since we write numbers with the most-
|
||||
significant digit on the left, we also write bytes with the most-
|
||||
significant bit on the left. In the diagrams below, we number the
|
||||
bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
||||
an integer between 0 and 255 does have a most- and
|
||||
least-significant bit, and since we write numbers with the
|
||||
most-significant digit on the left, we also write bytes with the
|
||||
most-significant bit on the left. In the diagrams below, we number
|
||||
the bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
||||
the bits are numbered:
|
||||
|
||||
.nf
|
||||
@ -209,15 +209,15 @@ compressed byte sequence:
|
||||
starting with the least-significant bit of the data
|
||||
element. These are referred to here as integer values
|
||||
and are considered unsigned.
|
||||
* Prefix codes are packed starting with the most-
|
||||
significant bit of the code.
|
||||
* Prefix codes are packed starting with the most-significant
|
||||
bit of the code.
|
||||
.fi
|
||||
|
||||
In other words, if one were to print out the compressed data as
|
||||
a sequence of bytes, starting with the first byte at the
|
||||
*right* margin and proceeding to the *left*, with the most-
|
||||
significant bit of each byte on the left as usual, one would be
|
||||
able to parse the result from right to left, with fixed-width
|
||||
*right* margin and proceeding to the *left*, with the
|
||||
most-significant bit of each byte on the left as usual, one would
|
||||
be able to parse the result from right to left, with fixed-width
|
||||
elements in the correct MSB-to-LSB order and prefix codes in
|
||||
bit-reversed order (i.e., with the first bit of the code in the
|
||||
relative LSB position).
|
||||
@ -225,8 +225,8 @@ relative LSB position).
|
||||
.ti 0
|
||||
2. Compressed representation overview
|
||||
|
||||
A compressed data set consists of a header and a series of meta-
|
||||
blocks. Each meta-block decompresses to a sequence of 0
|
||||
A compressed data set consists of a header and a series of
|
||||
meta-blocks. Each meta-block decompresses to a sequence of 0
|
||||
to 16,777,216 (16 MiB) uncompressed bytes. The final uncompressed data is
|
||||
the concatenation of the uncompressed sequences from each meta-block.
|
||||
|
||||
@ -259,30 +259,37 @@ duplicated is two, but the last command in the meta-block is permitted to have
|
||||
only literals and no pointer to a string to duplicate.
|
||||
|
||||
Each command in the compressed data is represented using three categories
|
||||
of prefix codes: one set of prefix codes are for the literal sequence
|
||||
lengths (also referred to as literal insertion lengths) and backward
|
||||
copy lengths (that is, a single code word represents two lengths,
|
||||
one of the literal sequence and one of the backward copy), a separate
|
||||
set of prefix codes are for literals, and a third set of prefix codes are for
|
||||
distances. The prefix code descriptions for each meta-block appear in a compact
|
||||
form just before the compressed data in the meta-block header. The insert and
|
||||
copy length and distance prefix codes may be followed by extra bits that are
|
||||
added to the base values determined by the codes. The number of extra bits is
|
||||
determined by the code.
|
||||
of prefix codes:
|
||||
|
||||
1) One set of prefix codes are for the literal sequence lengths
|
||||
(also referred to as literal insertion lengths) and backward
|
||||
copy lengths (that is, a single code word represents two lengths,
|
||||
one of the literal sequence and one of the backward copy).
|
||||
|
||||
2) One set of prefix codes are for literals.
|
||||
|
||||
3) One set of prefix codes are for distances.
|
||||
|
||||
The prefix code descriptions for each meta-block appear in a compact
|
||||
form just before the compressed data in the meta-block header.
|
||||
The insert-and-copy length and distance prefix codes may be followed by
|
||||
extra bits that are added to the base values determined by the codes.
|
||||
The number of extra bits is determined by the code.
|
||||
|
||||
One meta-block command then appears as a sequence of prefix codes:
|
||||
|
||||
Insert and copy length, literal, literal, ..., literal, distance
|
||||
Insert-and-copy length, literal, literal, ..., literal, distance
|
||||
|
||||
where the insert and copy defines the number of literals that immediately
|
||||
follow and the copy length, and the distance defines how far back to go
|
||||
for the copy, used in combination with the copy length. The resulting
|
||||
where the insert-and-copy defines an insertion length and a copy length.
|
||||
The insertion length determines the number of literals that immediately
|
||||
follow. The distance defines how far back to go for the copy and the
|
||||
copy length determine the number of bytes to copy. The resulting
|
||||
uncompressed data is the sequence of bytes:
|
||||
|
||||
literal, literal, ..., literal, copy, copy, ..., copy
|
||||
|
||||
where the number of literal bytes and copy bytes are determined by the
|
||||
insert and copy length code. (The number of bytes copied for a static
|
||||
insert-and-copy length code. (The number of bytes copied for a static
|
||||
dictionary entry can vary from the copy length.)
|
||||
|
||||
The last command in the meta-block may end with the last literal if the
|
||||
@ -301,14 +308,13 @@ a new block type and block count is read from the stream immediately
|
||||
preceding the next element of that category, which will use the new
|
||||
block type.
|
||||
|
||||
The insert and copy block type directly determines which prefix code to
|
||||
use for the next insert and copy element. For the literal and distance
|
||||
The insert-and-copy block type directly determines which prefix code to
|
||||
use for the next insert-and-copy element. For the literal and distance
|
||||
elements, the respective block type is used in combination with other
|
||||
context information to determine which prefix code to use for the next
|
||||
element.
|
||||
|
||||
Consider the following
|
||||
example:
|
||||
Consider the following example:
|
||||
|
||||
(IaC0, L0, L1, L2, D0)(IaC1, D1)(IaC2, L3, L4, D2)(IaC3, L5, D3)
|
||||
|
||||
@ -341,7 +347,7 @@ meta-block is then:
|
||||
IaC0 L0 L1 LBlockSwitch(1, 3) L2 D0 IaC1 DBlockSwitch(1, 3) D1
|
||||
IaCBlockSwitch(1, 2) IaC2 L3 L4 D2 IaC3 LBlockSwitch(0, 1) L5 D3
|
||||
|
||||
where *BlockSwitch(t, n) switches to block type t for a count of n elements.
|
||||
where xBlockSwitch(t, n) switches to block type t for a count of n elements.
|
||||
Note that in this example DBlockSwitch(1, 3) immediately precedes the
|
||||
next required distance D1. It does not follow the last distance of
|
||||
the previous block, D0. Whenever an element of a category is needed,
|
||||
@ -349,14 +355,14 @@ and the block count for that category has reached zero, then a new
|
||||
block type and count is read from the stream just before reading that next
|
||||
element.
|
||||
|
||||
The block switch commands for the first blocks of each category are not part
|
||||
The block-switch commands for the first blocks of each category are not part
|
||||
of the meta-block compressed data. Instead the first block type
|
||||
is defined to be 0, and the first block count for each category is
|
||||
encoded in the meta-block header. The prefix codes for the block types and counts, a total of
|
||||
six prefix codes over the three categories, are defined in a compact form in the meta-block
|
||||
header.
|
||||
|
||||
Each category of value (insert-and-copy lengths, literals and distances)
|
||||
Each category of value (insert-and-copy lengths, literals, and distances)
|
||||
can be encoded with any prefix code from a collection of prefix
|
||||
codes belonging to the same category appearing in the meta-block header. The
|
||||
particular prefix code used can depend on two factors: the block
|
||||
@ -366,10 +372,9 @@ the uncompressed data, and in the case of distances, the context is the copy
|
||||
length from the same command. For insert-and-copy lengths, no context
|
||||
is used and the prefix code depends only on the block type. In the case
|
||||
of literals and distances, the context is mapped to a context ID in
|
||||
the range 0..63 for literals and 0..3 for distances and the matrix
|
||||
of the prefix code indices for each block type and context ID,
|
||||
called the context map, is encoded in a compact form in the meta-
|
||||
block header.
|
||||
the range 0..63 for literals and 0..3 for distances. The matrix
|
||||
of the prefix code indexes for each block type and context ID,
|
||||
called the context map, is encoded in a compact form in the meta-block header.
|
||||
|
||||
For example, the prefix code to use to decode L2 depends on the
|
||||
block type (1), and the literal context ID determined by the two uncompressed
|
||||
@ -379,8 +384,8 @@ type (0), and the distance context ID determined by the copy length decoded
|
||||
from IaC0. The prefix code to use to decode IaC3 depends only on the block
|
||||
type (1).
|
||||
|
||||
In addition to the parts listed above (prefix code for insert-
|
||||
and-copy lengths, literals, distances, block types and block counts
|
||||
In addition to the parts listed above (prefix code for insert-and-copy
|
||||
lengths, literals, distances, block types and block counts,
|
||||
and the context map), the meta-block header contains the number of
|
||||
uncompressed bytes coded in the meta-block and two additional parameters used in
|
||||
the representation of match distances: the number of postfix bits and
|
||||
@ -515,7 +520,7 @@ initially in tree[I].Len; the codes are produced in tree[I].Code.
|
||||
value.
|
||||
|
||||
.KS
|
||||
for (n = 0; n <= max_code; n++) {
|
||||
for (n = 0; n <= max_code; n++) {
|
||||
len = tree[n].Len;
|
||||
if (len != 0) {
|
||||
tree[n].Code = next_code[len];
|
||||
@ -577,7 +582,7 @@ Prefix codes are used for different purposes in the brotli
|
||||
format, and each purpose has a different alphabet size. For
|
||||
literal codes the alphabet size is 256. For insert-and-copy
|
||||
length codes the alphabet size is 704. For block count codes,
|
||||
the alphabet size is 26. For distance codes, block type codes and
|
||||
the alphabet size is 26. For distance codes, block type codes, and
|
||||
the prefix codes used in compressing the context map, the
|
||||
alphabet size is dynamic and is based on parameters defined in
|
||||
later sections. The following table summarizes the alphabet sizes
|
||||
@ -585,25 +590,25 @@ for the various prefix codes and the sections where they are defined.
|
||||
|
||||
.nf
|
||||
.KS
|
||||
+-----------------+------------------------+-------------+
|
||||
| Prefix code | Alphabet size | Definition |
|
||||
+-----------------+------------------------+-------------+
|
||||
| literal | 256 | |
|
||||
+-----------------+------------------------+-------------+
|
||||
| distance | 16 + NDIRECT + | Section 4. |
|
||||
| | (48 << NPOSTFIX) | |
|
||||
+-----------------+------------------------+-------------+
|
||||
| insert-and-copy | 704 | Section 5. |
|
||||
| length | | |
|
||||
+-----------------+------------------------+-------------+
|
||||
| block count | 26 | Section 6. |
|
||||
+-----------------+------------------------+-------------+
|
||||
| block type | NBLTYPESx + 2, | Section 6. |
|
||||
| | (where x is I, L or D) | |
|
||||
+-----------------+------------------------+-------------+
|
||||
| context map | NTREESx + RLEMAXx | Section 7. |
|
||||
| | (where x is L or D) | |
|
||||
+-----------------+------------------------+-------------+
|
||||
+-----------------+-------------------------+------------+
|
||||
| Prefix code | Alphabet size | Definition |
|
||||
+-----------------+-------------------------+------------+
|
||||
| literal | 256 | |
|
||||
+-----------------+-------------------------+------------+
|
||||
| distance | 16 + NDIRECT + | Section 4. |
|
||||
| | (48 << NPOSTFIX) | |
|
||||
+-----------------+-------------------------+------------+
|
||||
| insert-and-copy | 704 | Section 5. |
|
||||
| length | | |
|
||||
+-----------------+-------------------------+------------+
|
||||
| block count | 26 | Section 6. |
|
||||
+-----------------+-------------------------+------------+
|
||||
| block type | NBLTYPESx + 2, | Section 6. |
|
||||
| | (where x is I, L, or D) | |
|
||||
+-----------------+-------------------------+------------+
|
||||
| context map | NTREESx + RLEMAXx | Section 7. |
|
||||
| | (where x is L or D) | |
|
||||
+-----------------+-------------------------+------------+
|
||||
.KE
|
||||
.fi
|
||||
|
||||
@ -615,8 +620,8 @@ prefix code distinguishes between simple and complex prefix
|
||||
codes. If this value is 1, then a simple prefix code follows.
|
||||
Otherwise the value indicates the number of leading zeros.
|
||||
|
||||
A simple prefix code can have only up to four symbols with non-
|
||||
zero code length. The format of the simple prefix code is as
|
||||
A simple prefix code can have only up to four symbols with
|
||||
non-zero code length. The format of the simple prefix code is as
|
||||
follows:
|
||||
|
||||
.nf
|
||||
@ -657,7 +662,7 @@ follows:
|
||||
prefix code.
|
||||
|
||||
* if NSYM = 4, the code lengths (in order of symbols decoded)
|
||||
depend on the tree-select bit: 2, 2, 2, 2, (tree-select bit 0)
|
||||
depend on the tree-select bit: 2, 2, 2, 2 (tree-select bit 0),
|
||||
or 1, 2, 3, 3 (tree-select bit 1).
|
||||
.fi
|
||||
|
||||
@ -724,11 +729,10 @@ in the compressed data, where the bits are parsed from right to left):
|
||||
.KE
|
||||
.fi
|
||||
|
||||
We can now define the format of the complex prefix code as
|
||||
follows:
|
||||
We can now define the format of the complex prefix code as follows:
|
||||
|
||||
.nf
|
||||
2 bits: HSKIP, values of 0, 2 or 3 represent the respective
|
||||
2 bits: HSKIP, values of 0, 2, or 3 represent the respective
|
||||
number of skipped code lengths. The skipped lengths
|
||||
are taken to be zero. (An HSKIP of 1 indicates a
|
||||
Simple prefix code.)
|
||||
@ -742,7 +746,7 @@ follows:
|
||||
is for symbol 4.
|
||||
|
||||
The code lengths of code length symbols are between 0 and
|
||||
5 and they are represented with 2 - 4 bits according to
|
||||
5, and they are represented with 2 - 4 bits according to
|
||||
the variable length code above. A code length of 0 means
|
||||
the corresponding code length symbol is not used.
|
||||
|
||||
@ -805,7 +809,7 @@ the number of direct distance codes, denoted by NDIRECT (0..120). Both of
|
||||
these parameters are encoded in the meta-block header. We will also
|
||||
use the following derived parameter:
|
||||
|
||||
POSTFIX_MASK = ((1 << NPOSTFIX) - 1)
|
||||
POSTFIX_MASK = (1 << NPOSTFIX) - 1
|
||||
|
||||
The first 16 distance symbols are special symbols that reference
|
||||
past distances as follows:
|
||||
@ -830,10 +834,10 @@ past distances as follows:
|
||||
.fi
|
||||
|
||||
The ring buffer of four last distances is initialized by the values
|
||||
16, 15, 11 and 4 (i.e. the fourth-to-last is set to 16, the third-to-last
|
||||
to 15, the second-to-last to 11 and the last distance to 4) at the
|
||||
beginning of the *stream* (as opposed to the beginning of the meta-
|
||||
block) and it is not reset at meta-block boundaries. When a distance
|
||||
16, 15, 11, and 4 (i.e. the fourth-to-last is set to 16, the third-to-last
|
||||
to 15, the second-to-last to 11, and the last distance to 4) at the
|
||||
beginning of the *stream* (as opposed to the beginning of the meta-block)
|
||||
and it is not reset at meta-block boundaries. When a distance
|
||||
symbol 0 appears, the distance it represents (i.e. the last distance
|
||||
in the sequence of distances) is not pushed to the ring buffer of
|
||||
last distances, in other words, the expression "(second, third,
|
||||
@ -865,7 +869,7 @@ Given a distance symbol "dcode" (>= 16 + NDIRECT), and extra bits
|
||||
.nf
|
||||
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
|
||||
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
|
||||
offset = ((2 + (hcode & 1)) << ndistbits) - 4;
|
||||
offset = ((2 + (hcode & 1)) << ndistbits) - 4
|
||||
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
|
||||
.fi
|
||||
|
||||
@ -881,7 +885,7 @@ of a meta-block is represented with the following triplet:
|
||||
|
||||
<insert-and-copy length code, insert extra bits, copy extra bits>
|
||||
|
||||
The insert-and-copy length code, the insert extra bits and the copy
|
||||
The insert-and-copy length code, the insert extra bits, and the copy
|
||||
extra bits are encoded back-to-back, the insert-and-copy length code
|
||||
is encoded using a prefix code over the insert-and-copy length code
|
||||
alphabet, while the extra bits values are encoded as fixed-width
|
||||
@ -999,7 +1003,7 @@ are initialized to 1 and 0, respectively, at the end of the
|
||||
meta-block header.
|
||||
|
||||
Since the first block type of each block category is 0, the block
|
||||
type of the first block switch command is not encoded in
|
||||
type of the first block-switch command is not encoded in
|
||||
the compressed data. Instead the block count for each category
|
||||
that has more than one type is encoded in the meta-block header.
|
||||
|
||||
@ -1009,13 +1013,13 @@ count down to exactly zero at the end of the meta-block.
|
||||
|
||||
The number of different block types in each block category, denoted
|
||||
by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
|
||||
lengths and distances, respectively, is encoded in the meta-block
|
||||
lengths, and distances, respectively, is encoded in the meta-block
|
||||
header, and it must equal to the largest block type plus one in that
|
||||
block category. In other words, the set of literal, insert-and-copy
|
||||
length and distance block types must be [0..NBLTYPESL-1],
|
||||
length, and distance block types must be [0..NBLTYPESL-1],
|
||||
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
|
||||
follows that the alphabet size of literal, insert-and-copy length and
|
||||
distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2 and
|
||||
follows that the alphabet size of literal, insert-and-copy length, and
|
||||
distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and
|
||||
NBLTYPESD + 2, respectively.
|
||||
|
||||
Each block count in the compressed data is represented with a pair
|
||||
@ -1044,7 +1048,7 @@ number of extra bits and the range of block counts are as follows:
|
||||
.KE
|
||||
.fi
|
||||
|
||||
The first block switch command of each block category is special in
|
||||
The first block-switch command of each block category is special in
|
||||
the sense that it is encoded in the meta-block header, and as
|
||||
described earlier the block type code is omitted, since it is an
|
||||
implicit zero.
|
||||
@ -1072,10 +1076,10 @@ p1 and p2 are initialized to zero.
|
||||
There are four methods, called context modes, to compute the
|
||||
Context ID:
|
||||
.nf
|
||||
* LSB6, where the Context ID is the value of six least
|
||||
significant bits of p1,
|
||||
* MSB6, where the Context ID is the value of six most
|
||||
significant bits of p1,
|
||||
* LSB6, where the Context ID is the value of six
|
||||
least-significant bits of p1,
|
||||
* MSB6, where the Context ID is the value of six
|
||||
most-significant bits of p1,
|
||||
* UTF8, where the Context ID is a complex function of p1, p2,
|
||||
optimized for text compression, and
|
||||
* Signed, where Context ID is a complex function of p1, p2,
|
||||
@ -1156,9 +1160,9 @@ Given p1 is the last uncompressed byte and p2 is the second-to-last
|
||||
uncompressed byte the context IDs can be computed as follows:
|
||||
|
||||
.nf
|
||||
For LSB6 : Context ID = p1 & 0x3f
|
||||
For MSB6 : Context ID = p1 >> 2
|
||||
For UTF8 : Context ID = Lut0[p1] | Lut1[p2]
|
||||
For LSB6: Context ID = p1 & 0x3f
|
||||
For MSB6: Context ID = p1 >> 2
|
||||
For UTF8: Context ID = Lut0[p1] | Lut1[p2]
|
||||
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]
|
||||
.fi
|
||||
|
||||
@ -1466,7 +1470,7 @@ the following:
|
||||
fill bits are not zero, then the stream should be
|
||||
rejected as invalid)
|
||||
2 bits: MNIBBLES, # of nibbles to represent the uncompressed
|
||||
length, encoded with the following fixed length code:
|
||||
length, encoded with the following fixed-length code:
|
||||
|
||||
Value Bit Pattern
|
||||
----- -----------
|
||||
@ -1563,7 +1567,7 @@ the following:
|
||||
|
||||
2 bits: NPOSTFIX, parameter used in the distance coding
|
||||
|
||||
4 bits: four most significant bits of NDIRECT, to get the
|
||||
4 bits: four most-significant bits of NDIRECT, to get the
|
||||
actual value of the parameter NDIRECT, left-shift
|
||||
this four bit number by NPOSTFIX bits
|
||||
|
||||
@ -1773,7 +1777,7 @@ specification, where non-conformant compressed data sequences should be discarde
|
||||
A possible attack against a system containing a decompressor
|
||||
implementation (e.g. a web browser) is to exploit a buffer
|
||||
overflow caused by an invalid compressed data. Therefore decompressor
|
||||
implementations should perform bound-checking for each memory access
|
||||
implementations should perform bounds-checking for each memory access
|
||||
that result from values decoded from the compressed stream.
|
||||
|
||||
.ti 0
|
||||
@ -5678,8 +5682,8 @@ length is 122,784 bytes and the zlib CRC-32 of the byte sequence is
|
||||
9ce0a58de0a49ee0a4bee0a4aae0a4a8e0a495e0a4bee0a4b0e0a58de0a4b0e0
|
||||
a4b5e0a4bee0a488e0a4b8e0a495e0a58de0a4b0e0a4bfe0a4afe0a4a4e0a4be
|
||||
|
||||
The number of words for each length is given by the following bit-
|
||||
depth array:
|
||||
The number of words for each length is given by the following
|
||||
bit-depth array:
|
||||
|
||||
NDBITS := 0, 0, 0, 0, 10, 10, 11, 11, 10, 10,
|
||||
10, 10, 10, 9, 9, 8, 7, 7, 8, 7,
|
||||
|
Loading…
Reference in New Issue
Block a user