Merge pull request #339 from szabadka/master

Address review comments in the specification.
This commit is contained in:
szabadka 2016-04-20 11:23:04 +02:00
commit 769308d6dd
2 changed files with 1440 additions and 1034 deletions

View File

@ -7,9 +7,9 @@
.ds LF Alakuijala & Szabadka
.ds RF FORMFEED[Page %]
.ds LH Internet-Draft
.ds RH December 2015
.ds RH April 2016
.ds CH Brotli
.ds CF Expires June 10, 2016
.ds CF Expires October 19, 2016
.hy 0
.nh
.ad l
@ -18,13 +18,13 @@
.tl 'Network Working Group''J. Alakuijala'
.tl 'Internet-Draft''Z. Szabadka'
.tl 'Intended Status: Informational''Google, Inc'
.tl 'Expires: June 10, 2016''December 2015'
.tl 'Expires: October 19, 2016''April 2016'
.fi
.ce 2
Brotli Compressed Data Format
draft-alakuijala-brotli-08
draft-alakuijala-brotli-09
.fi
.in 3
@ -52,12 +52,12 @@ and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 10, 2016.
This Internet-Draft will expire on October 19, 2016.
.ti 0
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the document
Copyright (c) 2016 IETF Trust and the persons identified as the document
authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
@ -220,9 +220,27 @@ a sequence of bytes, starting with the first byte at the
*right* margin and proceeding to the *left*, with the
most-significant bit of each byte on the left as usual, one would
be able to parse the result from right to left, with fixed-width
elements in the correct MSB-to-LSB order and prefix codes in
elements in the correct msb-to-lsb order and prefix codes in
bit-reversed order (i.e., with the first bit of the code in the
relative LSB position).
relative lsb position).
As an example, consider packing the following data elements into
a sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2,
prefix code 110, prefix code 10, 12-bit integer value 3628.
.nf
byte 2 byte 1 byte 0
+--------+--------+--------+
|11100010|11000101|10010110|
+--------+--------+--------+
^ ^ ^ ^ ^
| | | | |
| | | | +------ integer value 6
| | | +---------- integer value 2
| | +-------------- prefix code 110
| +---------------- prefix code 10
+----------------------------- integer value 3628
.fi
.ti 0
2. Compressed representation overview
@ -693,26 +711,26 @@ are compressed using a prefix code. The alphabet for code lengths
is as follows:
.nf
0 - 15: Represent code lengths of 0 - 15
16: Copy the previous non-zero code length 3 - 6 times
0..15: Represent code lengths of 0..15
16: Copy the previous non-zero code length 3..6 times
The next 2 bits indicate repeat length
(0 = 3, ... , 3 = 6)
If this is the first code length, or all previous
code lengths are zero, a code length of 8 is
repeated 3 - 6 times
repeated 3..6 times
A repeated code length code of 16 modifies the
repeat count of the previous one as follows:
repeat count = (4 * (repeat count - 2)) +
(3 - 6 on the next 2 bits)
(3..6 on the next 2 bits)
Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10)
will expand to 22 code lengths of 7
(1 + 4 * (6 - 2) + 5)
17: Repeat a code length of 0 for 3 - 10 times.
17: Repeat a code length of 0 for 3..10 times.
(3 bits of length)
A repeated code length code of 17 modifies the
repeat count of the previous one as follows:
repeat count = (8 * (repeat count - 2)) +
(3 - 10 on the next 3 bits)
(3..10 on the next 3 bits)
.fi
Note that a code of 16 that follows an immediately preceding 16 modifies the
@ -763,7 +781,7 @@ We can now define the format of the complex prefix code as follows:
is for symbol 4.
The code lengths of code length symbols are between 0 and
5, and they are represented with 2 - 4 bits according to
5, and they are represented with 2..4 bits according to
the variable length code above. A code length of 0 means
the corresponding code length symbol is not used.
@ -816,7 +834,7 @@ represented with a pair <distance code, extra bits>. The distance
code and the extra bits are encoded back-to-back, the distance code
is encoded using a prefix code over the distance alphabet,
while the extra bits value is encoded as a fixed-width integer
value. The number of extra bits can be 0 - 24, and it is dependent
value. The number of extra bits can be 0..24, and it is dependent
on the distance code.
To convert a distance code and associated extra bits to a backward
@ -913,7 +931,7 @@ extra bits are encoded back-to-back, the insert-and-copy length code
is encoded using a prefix code over the insert-and-copy length code
alphabet, while the extra bits values are encoded as fixed-width
integer values. The number of insert and copy extra bits can be
0 - 24, and they are dependent on the insert-and-copy length code.
0..24, and they are dependent on the insert-and-copy length code.
Some of the insert-and-copy length codes also express the fact that
the distance symbol of the distance in the same command is 0, i.e. the
@ -929,17 +947,17 @@ are as follows:
.nf
.KS
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- -------
0 0 0 8 2 10-13 16 6 130-193
1 0 1 9 2 14-17 17 7 194-321
2 0 2 10 3 18-25 18 8 322-577
3 0 3 11 3 26-33 19 9 578-1089
4 0 4 12 4 34-49 20 10 1090-2113
5 0 5 13 4 50-65 21 12 2114-6209
6 1 6,7 14 5 66-97 22 14 6210-22593
7 1 8,9 15 5 98-129 23 24 22594-16799809
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------- ---- ---- ------- ---- ---- -------
0 0 0 8 2 10..13 16 6 130..193
1 0 1 9 2 14..17 17 7 194..321
2 0 2 10 3 18..25 18 8 322..577
3 0 3 11 3 26..33 19 9 578..1089
4 0 4 12 4 34..49 20 10 1090..2113
5 0 5 13 4 50..65 21 12 2114..6209
6 1 6,7 14 5 66..97 22 14 6210..22593
7 1 8,9 15 5 98..129 23 24 22594..16799809
.KE
.fi
@ -948,17 +966,17 @@ of copy extra bits, and the range of copy lengths are as follows:
.nf
.KS
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- -------
0 0 2 8 1 10,11 16 5 70-101
1 0 3 9 1 12,13 17 5 102-133
2 0 4 10 2 14-17 18 6 134-197
3 0 5 11 2 18-21 19 7 198-325
4 0 6 12 3 22-29 20 8 326-581
5 0 7 13 3 30-37 21 9 582-1093
6 0 8 14 4 38-53 22 10 1094-2117
7 0 9 15 4 54-69 23 24 2118-16779333
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------- ---- ---- ------- ---- ---- -------
0 0 2 8 1 10,11 16 5 70..101
1 0 3 9 1 12,13 17 5 102..133
2 0 4 10 2 14..17 18 6 134..197
3 0 5 11 2 18..21 19 7 198..325
4 0 6 12 3 22..29 20 8 326..581
5 0 7 13 3 30..37 21 9 582..1093
6 0 8 14 4 38..53 22 10 1094..2117
7 0 9 15 4 54..69 23 24 2118..16779333
.KE
.fi
@ -969,34 +987,34 @@ and a copy length code, the following table can be used:
.KS
Insert
length Copy length code
code 0-7 8-15 16-23
+---------+---------+
| | |
0-7 | 0-63 | 64-127 | <--- distance symbol 0
| | |
+---------+---------+---------+
| | | |
0-7 | 128-191 | 192-255 | 384-447 |
| | | |
+---------+---------+---------+
| | | |
8-15 | 256-319 | 320-383 | 512-575 |
| | | |
+---------+---------+---------+
| | | |
16-23 | 448-511 | 576-639 | 640-703 |
| | | |
+---------+---------+---------+
code 0..7 8..15 16..23
+----------+----------+
| | |
0..7 | 0..63 | 64..127 | <--- distance symbol 0
| | |
+----------+----------+----------+
| | | |
0..7 | 128..191 | 192..255 | 384..447 |
| | | |
+----------+----------+----------+
| | | |
8..15 | 256..319 | 320..383 | 512..575 |
| | | |
+----------+----------+----------+
| | | |
16..23 | 448..511 | 576..639 | 640..703 |
| | | |
+----------+----------+----------+
.KE
.fi
First, look up the cell with the 64 value range containing the
insert-and-copy length code, this gives the insert length code and
the copy length code ranges, both 8 values long.
The copy length code within its range is determined by bits 0-2
(counted from the LSB) of the insert-and-copy length code.
The insert length code within its range is determined by bits 3-5
(counted from the LSB) of the insert-and-copy length code.
The copy length code within its range is determined by bits 0..2
(counted from the lsb) of the insert-and-copy length code.
The insert length code within its range is determined by bits 3..5
(counted from the lsb) of the insert-and-copy length code.
Given the insert length and copy length codes, the actual insert
and copy lengths can be obtained by reading the number of extra
bits given by the tables above.
@ -1020,8 +1038,8 @@ the block type that preceded the current type,
while a block type symbol 1 means that the new block type equals the current
block type plus one. If the current block type is the maximal possible,
then a block type symbol of 1 results in wrapping to a new block type of 0.
Block type symbols 2 - 257
represent block types 0 - 255 respectively. The previous and current block types
Block type symbols 2..257
represent block types 0..255 respectively. The previous and current block types
are initialized to 1 and 0, respectively, at the end of the
meta-block header.
@ -1051,24 +1069,24 @@ Each block count in the compressed data is represented with a pair
bits are encoded back-to-back, the block count code is encoded using
a prefix code over the block count code alphabet, while the extra
bits value is encoded as a fixed-width integer value. The number of
extra bits can be 0 - 24, and it is dependent on the block count
extra bits can be 0..24, and it is dependent on the block count
code. The symbols of the block count code alphabet, along with the
number of extra bits, and the range of block counts are as follows:
.nf
.KS
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- -------
0 2 1-4 9 4 65-80 18 7 369-496
1 2 5-8 10 4 81-96 19 8 497-752
2 2 9-12 11 4 97-112 20 9 753-1264
3 2 13-16 12 5 113-144 21 10 1265-2288
4 3 17-24 13 5 145-176 22 11 2289-4336
5 3 25-32 14 5 177-208 23 12 4337-8432
6 3 33-40 15 5 209-240 24 13 8433-16624
7 3 41-48 16 6 241-304 25 24 16625-16793840
8 4 49-64 17 6 305-368
Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------- ---- ---- ------- ---- ---- -------
0 2 1..4 9 4 65..80 18 7 369..496
1 2 5..8 10 4 81..96 19 8 497..752
2 2 9..12 11 4 97..112 20 9 753..1264
3 2 13..16 12 5 113..144 21 10 1265..2288
4 3 17..24 13 5 145..176 22 11 2289..4336
5 3 25..32 14 5 177..208 23 12 4337..8432
6 3 33..40 15 5 209..240 24 13 8433..16624
7 3 41..48 16 6 241..304 25 24 16625..16793840
8 4 49..64 17 6 305..368
.KE
.fi
@ -1262,9 +1280,9 @@ now define the format of the context map (the same format is used
for literal and distance context maps):
.nf
1-5 bits: RLEMAX, 0 is encoded with one 0 bit, and values
1 - 16 are encoded with bit pattern xxxx1 (so 01001
is 5)
1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values
1..16 are encoded with bit pattern xxxx1 (so 01001
is 5)
Prefix code with alphabet size NTREES + RLEMAX
@ -1398,7 +1416,7 @@ The form of these elementary transforms is as follows:
.fi
For the purposes of UppercaseAll, word is parsed into UTF-8
characters and converted to upper-case by taking 1 - 3 bytes at a time,
characters and converted to upper-case by taking 1..3 bytes at a time,
using the algorithm below:
.nf
@ -1447,10 +1465,10 @@ previous sections.
The stream header has only the following one field:
.nf
1-7 bits: WBITS, a value in the range 10 - 24, encoded with
the following variable length code (as it appears in
the compressed data, where the bits are parsed from
right to left):
1..7 bits: WBITS, a value in the range 10..24, encoded with
the following variable length code (as it appears in
the compressed data, where the bits are parsed from
right to left):
Value Bit Pattern
----- -----------
@ -1527,7 +1545,7 @@ the following:
zeros, then the stream should be rejected
as invalid)
0 - 7 bits: fill bits until the next byte boundary,
0..7 bits: fill bits until the next byte boundary,
must be all zeros
MSKIPLEN bytes of metadata, not part of the
@ -1546,7 +1564,7 @@ the following:
ISLAST bit is not set (if the ignored bits are not
all zeros, the stream should be rejected as invalid)
1-11 bits: NBLTYPESL, # of literal block types, encoded with
1..11 bits: NBLTYPESL, # of literal block types, encoded with
the following variable length code (as it appears in
the compressed data, where the bits are parsed from
right to left, so 0110111 has the value 12):
@ -1555,13 +1573,13 @@ the following:
----- -----------
1 0
2 0001
3-4 x0011
5-8 xx0101
9-16 xxx0111
17-32 xxxx1001
33-64 xxxxx1011
65-128 xxxxxx1101
129-256 xxxxxxx1111
3..4 x0011
5..8 xx0101
9..16 xxx0111
17..32 xxxx1001
33..64 xxxxx1011
65..128 xxxxxx1101
129..256 xxxxxxx1111
Prefix code over the block type code alphabet for
literal block types, appears only if NBLTYPESL >= 2
@ -1572,8 +1590,8 @@ the following:
Block count code + extra bits for first literal
block count, appears only if NBLTYPESL >= 2
1-11 bits: NBLTYPESI, # of insert-and-copy block types, encoded
with the same variable length code as above
1..11 bits: NBLTYPESI, # of insert-and-copy block types, encoded
with the same variable length code as above
Prefix code over the block type code alphabet for
insert-and-copy block types, appears only if NBLTYPESI >= 2
@ -1584,8 +1602,8 @@ the following:
Block count code + extra bits for first insert-and-copy
block count, appears only if NBLTYPESI >= 2
1-11 bits: NBLTYPESD, # of distance block types, encoded
with the same variable length code as above
1..11 bits: NBLTYPESD, # of distance block types, encoded
with the same variable length code as above
Prefix code over the block type code alphabet for
distance block types, appears only if NBLTYPESD >= 2
@ -1604,15 +1622,15 @@ the following:
NBLTYPESL x 2 bits: context mode for each literal block type
1-11 bits: NTREESL, # of literal prefix trees, encoded
with the same variable length code as NBLTYPESL
1..11 bits: NTREESL, # of literal prefix trees, encoded
with the same variable length code as NBLTYPESL
Literal context map, encoded as described in Section 7.3.,
appears only if NTREESL >= 2, otherwise the context map
has only zero values
1-11 bits: NTREESD, # of distance prefix trees, encoded
with the same variable length code as NBLTYPESD
1..11 bits: NTREESD, # of distance prefix trees, encoded
with the same variable length code as NBLTYPESD
Distance context map, encoded as described in Section 7.3.,
appears only if NTREESD >= 2, otherwise the context map
@ -1806,19 +1824,183 @@ reference with <length = 5, distance = 2> adds X,Y,X,Y,X to the
uncompressed stream.
.ti 0
11. Security Considerations
11. Considerations for compressor implementations
Since the intent of this document is to define the brotli compressed data format
without reference to any particular compression algorithm, the material in this
section is not part of the definition of the format, and a compressor need not
follow it in order to be compliant.
.ti 0
11.1. Trivial compressor
In this section we present a very simple algorithm that produces a valid brotli
stream representing an arbitrary sequence of uncompressed bytes in the form of
the following C++ language function.
.nf
string BrotliCompressTrivial(const string& u) {
if (u.empty()) {
return string(1, 6);
}
int i;
string c;
c.append(1, 12);
for (i = 0; i + 65535 < u.size(); i += 65536) {
c.append(1, 248);
c.append(1, 255);
c.append(1, 15);
c.append(&u[i], 65536);
}
if (i < u.size()) {
int r = u.size() - i - 1;
c.append(1, (r & 31) << 3);
c.append(1, r >> 5);
c.append(1, 8 + (r >> 13));
c.append(&u[i], r + 1);
}
c.append(1, 3);
return c;
}
.fi
Note that this simple algorithm does not actually compress data, that is, the
brotli representation will always be bigger than the original, but it
shows that every sequence of N uncompressed bytes can be represented with a
valid brotli stream that is not longer than N + (3 * (N >> 16) + 5) bytes.
.ti 0
11.2. Aligning compressed meta-blocks to byte boundaries
As described in Section 9., only those meta-blocks that immediately follow an
uncompressed meta-block or a metadata meta-block are guaranteed to start on a
byte boundary. In some applications, it might be required that every
non-metadata meta-block starts on a byte boundary. This can be achieved by
appending an empty metadata meta-block after every non-metadata meta-block that
does not end on a byte boundary.
.ti 0
11.3. Creating self-contained parts within the compressed data
In some encoder implementations it might be required to make a sequence of
bytes within a brotli stream self-contained, that is, such that they
can be decompressed independently from previous parts of the compressed data.
This is a useful feature for three reasons. First, if a large compressed file
is damaged, it is possible to recover some of the file after the damage.
Second, it is useful when doing differential transfer of compressed data. If
a sequence of uncompressed bytes is unchanged and compressed independently
from previous data, then the compressed representation may also be
unchanged and can therefore be transferred very cheaply. Third, if sequences of
uncompressed bytes are compressed independently, it allows for parallel
compression of these byte sequences within the same file, in addition
to parallel compression of multiple files.
Given two sequences of uncompressed bytes, U0 and U1, we will now describe how
to create two sequences of compressed bytes, C0 and C1, such that the
concatenation of C0 and C1 is a valid brotli stream, and that C0 and C1
(together with the first byte of C0 that contains the window size)
can be decompressed independently from each other to U0 and U1.
When compressing the byte sequence U0 to produce C0, we can use any compressor
that works on the complete set of uncompressed bytes U0, with the following two
changes. First, the ISLAST bit of the last meta-block of C0 must not be set.
Second, C0 must end at a byte-boundary, which can be ensured by appending an
empty metadata meta-block to it, as in Section 11.2.
When compressing the byte sequence U1 to produce C1, we can use any compressor
that starts a new meta-block at the beginning of U1 within the U0+U1 input
stream, with the following two changes. First, backward distances in C1 must
not refer to static dictionary words or uncompressed bytes in U0.
Even if a sequence of bytes in U1 would match a static dictionary word, or a
sequence of bytes that overlaps U0, the compressor must represent this
sequence of bytes with a combination of literal insertions and backward
references to bytes in U1 instead. Second, the ring
buffer of last four distances must be replenished first with distances in C1
before using it to encode other distances in C1. Note that both compressors
producing C0 and C1 have to use the same window size, but the stream header is
emitted only by the compressor that produces C0.
Note that this method can be easily generalized to more than two sequences
of uncompressed bytes.
.ti 0
12. Security Considerations
As with any compressed file formats, decompressor implementations should
handle all compressed data byte sequences, not only those that conform to this
specification, where non-conformant compressed data sequences should be discarded.
specification, where non-conformant compressed data sequences should be
discarded.
A possible attack against a system containing a decompressor
implementation (e.g. a web browser) is to exploit a buffer
overflow caused by an invalid compressed data. Therefore decompressor
implementation (e.g. a web browser) is to exploit a buffer overflow
triggered by invalid compressed data. Therefore decompressor
implementations should perform bounds-checking for each memory access
that result from values decoded from the compressed stream.
that result from values decoded from the compressed stream and derivatives
therof.
Another possible attack against a system containing a decompressor
implementation is to provide it (either valid or invalid) compressed data
that can make the decompressor system's resource consumption (cpu, memory, or
storage) to be disproportionately large compared to the size of the
compressed data. In addition to the size of the compressed data, the amount of
cpu, memory and storage required to decompress a single compressed meta-block
within a brotli stream is controlled by the following two paramters: the size of
the uncompressed meta-block, which is encoded at the start of the compressed
meta-block, and the size of the sliding window, which is encoded at the start
of the brotli stream. Decompressor implementations in systems where
memory or storage is constrained should perform a sanity-check on these two
parameters. The uncompressed meta-block size that was decoded from the
compressed stream should be compared against either a hard limit, given by the
system's constraints or some expectation about the uncompressed data, or against
a certain multiple of the size of the compressed data. If the uncompressed
meta-block size is determined to be too high, the compressed data should be
rejected. Likewise, when the complete uncompressed stream is kept in the
system containing the decompressor implementation, the total uncompressed
size of the stream should be checked before decompressing each additional
meta-block. If the size of the sliding window that was decoded from the start
of the compressed stream is greater than a certain soft limit, then the
decompressor implementation should, at first, allocate a smaller sliding
window that fits the first uncompressed meta-block, and afterwards, before
decompressing each additional meta-block, it should increase the size of the
sliding window until the sliding window size specified in the compressed data
is reached.
Correspondingly, possible attacks against a system containing a compressor
implementation (e.g. a web server) are to exploit a buffer overflow or cause
disproportionately large resource consumption by providing e.g. uncompressible
data.
As described in Section 11.1., an output buffer of
.nf
S(N) = N + (3 * (N >> 16) + 5)
.fi
bytes is sufficient to hold a valid compressed brotli
stream representing an arbitrary sequence of N uncompressed bytes.
Therefore compressor implementations should allocate at least S(N) bytes of
output buffer before compressing N bytes of data with unknown compressibility
and should perform bounds-checking for each write into this output buffer.
If their output buffer is full, compresor implementations should
revert to the trivial compression algorithm described in Section 11.1.
The resourse consumption of a compressor implementation for a particular input
data depends mostly on the algorithm used to find backward matches and on the
algorithm used to construct context maps and prefix codes and only to a lesser
extent on the input data itself. If the system containing a compressor
implementation is overloaded, a possible way to reduce resource usage is to
switch to more simple algorithms for backward reference search and prefix code
construction, or to fall back to the trivial compression algorithm described in
Section 11.1.
A possible attack against a system that sends compressed data over an encrypted
channel is the following. An attacker who can repeatedly mix arbitrary
(attacker-supplied) data with secret data (passwords, cookies) and observe the
length of the ciphertext can potentially reconstruct the secret data. To
protect against this kind of attack, applications should not mix sensitive data
with non-sensitive, potentially attacker-supplied data in the same compressed
stream.
.ti 0
12. IANA Considerations
13. IANA Considerations
The "HTTP Content Coding Registry" has been updated with the
registration below:
@ -1834,7 +2016,7 @@ registration below:
.fi
.ti 0
13. Informative References
14. Informative References
.in 14
.ti 3
@ -1858,7 +2040,7 @@ http://www.ietf.org/rfc/rfc1951.txt
.in 3
.ti 0
14. Source code
15. Source code
Source code for a C language implementation of a brotli compliant
decompressor and a C++ language implementation of a compressor is
@ -1866,7 +2048,7 @@ available in the brotli open-source project:
https://github.com/google/brotli
.ti 0
15. Acknowledgments
16. Acknowledgments
The authors would like to thank Mark Adler, Robert Obryk, Thomas
Pickert, and Joe Tsai for providing helpful review comments,