mirror of
https://github.com/google/brotli.git
synced 2024-11-21 19:20:09 +00:00
1447345cbb
This change removes the redundant HCLEN, HLENINC and HLEN fields from the encoding of the complex Huffman codes and derives these from an invariant of the code length sequence. Based on a patch by Robert Obryk.
1259 lines
56 KiB
Plaintext
1259 lines
56 KiB
Plaintext
J. Alakuijala
|
|
Z. Szabadka
|
|
______ _______ _______ _______ _________
|
|
( __ \ ( ____ )( ___ )( ____ \\__ __/
|
|
| ( \ )| ( )|| ( ) || ( \/ ) (
|
|
| | ) || (____)|| (___) || (__ | |
|
|
| | | || __)| ___ || __) | |
|
|
| | ) || (\ ( | ( ) || ( | |
|
|
| (__/ )| ) \ \__| ) ( || ) | |
|
|
(______/ |/ \__/|/ \||/ )_(
|
|
|
|
|
|
DRAFT of
|
|
Brotli Compression Algorithm Compressed Data Format Specification 1.0
|
|
|
|
Status of This Memo
|
|
|
|
This memo provides information for the Internet community. This memo
|
|
does not specify an Internet standard of any kind. Distribution of
|
|
this memo is unlimited.
|
|
|
|
Notices
|
|
|
|
Copyright (c) 2013 J. Alakuijala and Z. Szabadka
|
|
|
|
Permission is granted to copy and distribute this document for any
|
|
purpose and without charge, including translations into other
|
|
languages and incorporation into compilations, provided that the
|
|
copyright notice and this notice are preserved, and that any
|
|
substantive changes or deletions from the original are clearly
|
|
marked.
|
|
|
|
Abstract
|
|
|
|
This specification defines a lossless compressed data format that
|
|
compresses data using a combination of the LZ77 algorithm and Huffman
|
|
coding, with efficiency comparable to the best currently available
|
|
general-purpose compression methods.
|
|
|
|
1. Introduction
|
|
|
|
1.1. Purpose
|
|
|
|
The purpose of this specification is to define a lossless
|
|
compressed data format that:
|
|
* Is independent of CPU type, operating system, file system,
|
|
and character set, and hence can be used for interchange;
|
|
* Can be produced or consumed, even for an arbitrarily long
|
|
sequentially presented input data stream, using only an a
|
|
priori bounded amount of intermediate storage, and hence
|
|
can be used in data communications or similar structures,
|
|
such as Unix filters;
|
|
* Compresses data with efficiency comparable to the best
|
|
currently available general-purpose compression methods,
|
|
and in particular considerably better than the gzip program;
|
|
* Decompresses much faster than the LZMA implementations.
|
|
|
|
The data format defined by this specification does not attempt to:
|
|
* Allow random access to compressed data;
|
|
* Compress specialized data (e.g., raster graphics) as well
|
|
as the best currently available specialized algorithms.
|
|
|
|
1.2. Intended audience
|
|
|
|
This specification is intended for use by software implementers
|
|
to compress data into and/or decompress data from "brotli" format.
|
|
|
|
The text of the specification assumes a basic background in
|
|
programming at the level of bits and other primitive data
|
|
representations. Familiarity with the technique of Huffman coding
|
|
is helpful but not required.
|
|
|
|
This specification uses heavily the notations and terminology
|
|
introduced in the DEFLATE format specification (RFC 1951, see
|
|
reference [3] below). For the sake of completeness, we always
|
|
include the whole text of the relevant parts of RFC 1951,
|
|
therefore familiarity with the DEFLATE format is helpful but not
|
|
required.
|
|
|
|
1.3. Scope
|
|
|
|
The specification specifies a method for representing a sequence
|
|
of bytes as a (usually shorter) sequence of bits, and a method for
|
|
packing the latter bit sequence into bytes.
|
|
|
|
1.4. Compliance
|
|
|
|
Unless otherwise indicated below, a compliant decompressor must be
|
|
able to accept and decompress any data set that conforms to all
|
|
the specifications presented here. A compliant compressor must
|
|
produce data sets that conform to all the specifications presented
|
|
here.
|
|
|
|
1.5. Definitions of terms and conventions used
|
|
|
|
Byte: 8 bits stored or transmitted as a unit (same as an octet).
|
|
For this specification, a byte is exactly 8 bits, even on machines
|
|
which store a character on a number of bits different from eight.
|
|
See below for the numbering of bits within a byte.
|
|
|
|
String: a sequence of arbitrary bytes.
|
|
|
|
Bytes stored within a computer do not have a "bit order", since
|
|
they are always treated as a unit. However, a byte considered as
|
|
an integer between 0 and 255 does have a most- and least-
|
|
significant bit, and since we write numbers with the most-
|
|
significant digit on the left, we also write bytes with the most-
|
|
significant bit on the left. In the diagrams below, we number the
|
|
bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
|
the bits are numbered:
|
|
|
|
+--------+
|
|
|76543210|
|
|
+--------+
|
|
|
|
Within a computer, a number may occupy multiple bytes. All
|
|
multi-byte numbers in the format described here are stored with
|
|
the least-significant byte first (at the lower memory address).
|
|
For example, the decimal number 520 is stored as:
|
|
|
|
0 1
|
|
+--------+--------+
|
|
|00001000|00000010|
|
|
+--------+--------+
|
|
^ ^
|
|
| |
|
|
| + more significant byte = 2 x 256
|
|
+ less significant byte = 8
|
|
|
|
|
|
1.5.1. Packing into bytes
|
|
|
|
This document does not address the issue of the order in which
|
|
bits of a byte are transmitted on a bit-sequential medium,
|
|
since the final data format described here is byte- rather than
|
|
bit-oriented. However, we describe the compressed block format
|
|
below as a sequence of data elements of various bit
|
|
lengths, not a sequence of bytes. We must therefore specify
|
|
how to pack these data elements into bytes to form the final
|
|
compressed byte sequence:
|
|
|
|
* Data elements are packed into bytes in order of
|
|
increasing bit number within the byte, i.e., starting
|
|
with the least-significant bit of the byte.
|
|
* Data elements other than Huffman codes are packed
|
|
starting with the least-significant bit of the data
|
|
element.
|
|
* Huffman codes are packed starting with the most-
|
|
significant bit of the code.
|
|
|
|
In other words, if one were to print out the compressed data as
|
|
a sequence of bytes, starting with the first byte at the
|
|
*right* margin and proceeding to the *left*, with the most-
|
|
significant bit of each byte on the left as usual, one would be
|
|
able to parse the result from right to left, with fixed-width
|
|
elements in the correct MSB-to-LSB order and Huffman codes in
|
|
bit-reversed order (i.e., with the first bit of the code in the
|
|
relative LSB position).
|
|
|
|
2. Compressed representation overview
|
|
|
|
A compressed data set consists of a header and a series of meta-
|
|
blocks corresponding to successive meta-blocks of input data. The
|
|
meta-block sizes are limited to bytes and the maximum meta-block size
|
|
is 268,435,456 bytes.
|
|
|
|
The header contains the size of a sliding window on the input data
|
|
that is sufficient to keep on the intermediate storage at any given
|
|
point during decoding the stream.
|
|
|
|
Each meta-block is compressed using a combination of the LZ77
|
|
algorithm (Lempel-Ziv 1977, see reference [2] below) and Huffman
|
|
coding. The Huffman trees for each block are independent of those for
|
|
previous or subsequent blocks; the LZ77 algorithm may use a
|
|
reference to a duplicated string occurring in a previous meta-block,
|
|
up to sliding window size input bytes before.
|
|
|
|
Each meta-block consists of two parts: a meta-block header that
|
|
describes the representation of the compressed data part, and a
|
|
compressed data part. The compressed data consists of a series of
|
|
commands. Each command consists of two parts: a sequence of literal
|
|
bytes (of strings that have not been detected as duplicated within
|
|
the sliding window), and a pointer to a duplicated string,
|
|
represented as a pair <length, backward distance>.
|
|
|
|
Each command in the compressed data is represented using three kinds
|
|
of Huffman codes: one kind of code tree for the literal sequence
|
|
lengths (also referred to as literal insertion lengths) and backward
|
|
copy lengths (that is, a single code word represents two lengths,
|
|
one of the literal sequence and one of the backward copy), a separate
|
|
kind of code tree for literals, and a third kind of code tree for
|
|
distances. The code trees for each meta-block appear in a compact
|
|
form just before the compressed data in the meta-block header.
|
|
|
|
The sequence of each type of value in the representation of a command
|
|
(insert-and-copy lengths, literals and distances) within a meta-
|
|
block is further divided into blocks. In the "brotli" format, blocks
|
|
are not contiguous chunks of compressed data, but rather the pieces
|
|
of compressed data belonging to a block are interleaved with pieces
|
|
of data belonging to other blocks. Each meta-block can be logically
|
|
decomposed into a series of insert-and-copy length blocks, a series
|
|
of literal blocks and a series of distance blocks. These are also
|
|
called the three block categories: a meta-block has a series of
|
|
blocks for each block category. Note that the physical structure of
|
|
the meta-block is a series of commands, while the three series of
|
|
blocks is the logical structure. Consider the following example:
|
|
|
|
(IaC0, L0, L1, L2, D0)(IaC1, D1)(IaC2, L3, L4, D2)(IaC3, L5, D3)
|
|
|
|
The meta-block here has 4 commands, and each three types of symbols
|
|
within these commands can be rearranged for example into the
|
|
following logical block structure:
|
|
|
|
[IaC0, IaC1][IaC2, IaC3] <-- block types 0 and 1
|
|
|
|
[L0, L1][L2, L3, L4][L5] <-- block types 0, 1, and 0
|
|
|
|
[D0][D1, D2, D3] <-- block types 0 and 1
|
|
|
|
The subsequent blocks within each block category must have different
|
|
block types, but blocks further away in the block sequence can have
|
|
the same types. The block types are numbered from 0 to the maximum
|
|
block type number of 255 and the first block of each block category
|
|
must have type 0. The block structure of a meta-block is represented
|
|
by the sequence of block-switch commands for each block category,
|
|
where a block-switch command is a pair <block type, block length>.
|
|
The block-switch commands are represented in the compressed data
|
|
before the start of each new block using a Huffman code tree for
|
|
block types and a separate Huffman code tree for block lengths for
|
|
each block category. In the above example the physical layout of the
|
|
meta-block is the following:
|
|
|
|
IaC0 L0 L1 LBlockSwitch(1, 3) L2 D0 IaC1 DBlockSwitch(1, 1) D1
|
|
IaCBlockSwitch(1, 2) IaC2 L3 L4 D2 IaC3 LBlockSwitch(0, 1) D3
|
|
|
|
Note that the block switch commands for the first blocks are not part
|
|
of the meta-block compressed data part, they are encoded in the meta-
|
|
block header. The code trees for block types and lengths (total of
|
|
six Huffman code trees) appear in a compact form in the meta-block
|
|
header.
|
|
|
|
Each type of value (insert-and-copy lengths, literals and distances)
|
|
can be encoded with any Huffman tree from a collection of Huffman
|
|
trees of the same kind appearing in the meta-block header. The
|
|
particular Huffman tree used can depend on two factors: the block type
|
|
of the block the value appears in, and the context of the value. In
|
|
the case of the literals, the context is the previous two bytes in
|
|
the input data, and in the case of distances, the context is the copy
|
|
length from the same command. For insert-and-copy lengths, no context
|
|
is used and the Huffman tree depends only on the block type (in fact,
|
|
the index of the Huffman tree is the block type number). In the case
|
|
of literals and distances, the context is mapped to a context ID in
|
|
the rage [0, 63] for literals and [0, 3] for distances and the matrix
|
|
of the Huffman tree indices for each block type and context ID,
|
|
called the context map, is encoded in a compact form in the meta-
|
|
block header.
|
|
|
|
In addition to the parts listed above (Huffman code trees for insert-
|
|
and-copy lengths, literals, distances, block types and block lengths
|
|
and the context map), the meta-block header contains the number of
|
|
input bytes in the meta-block and two additional parameters used in
|
|
the representation of copy distances (number of "postfix bits" and
|
|
number of direct distance codes).
|
|
|
|
3. Compressed representation of Huffman codes
|
|
|
|
3.1. Introduction to prefix and Huffman coding
|
|
|
|
Prefix coding represents symbols from an a priori known alphabet
|
|
by bit sequences (codes), one code for each symbol, in a manner
|
|
such that different symbols may be represented by bit sequences of
|
|
different lengths, but a parser can always parse an encoded string
|
|
unambiguously symbol-by-symbol.
|
|
|
|
We define a prefix code in terms of a binary tree in which the two
|
|
edges descending from each non-leaf node are labeled 0 and 1 and
|
|
in which the leaf nodes correspond one-for-one with (are labeled
|
|
with) the symbols of the alphabet; then the code for a symbol is
|
|
the sequence of 0's and 1's on the edges leading from the root to
|
|
the leaf labeled with that symbol. For example:
|
|
|
|
/\ Symbol Code
|
|
0 1 ------ ----
|
|
/ \ A 00
|
|
/\ B B 1
|
|
0 1 C 011
|
|
/ \ D 010
|
|
A /\
|
|
0 1
|
|
/ \
|
|
D C
|
|
|
|
A parser can decode the next symbol from an encoded input stream
|
|
by walking down the tree from the root, at each step choosing the
|
|
edge corresponding to the next input bit.
|
|
|
|
Given an alphabet with known symbol frequencies, the Huffman
|
|
algorithm allows the construction of an optimal prefix code (one
|
|
which represents strings with those symbol frequencies using the
|
|
fewest bits of any possible prefix codes for that alphabet). Such
|
|
a code is called a Huffman code. (See reference [1] in Chapter 5,
|
|
references for additional information on Huffman codes.)
|
|
|
|
Note that in the "brotli" format, the Huffman codes for the
|
|
various alphabets must not exceed certain maximum code lengths.
|
|
This constraint complicates the algorithm for computing code
|
|
lengths from symbol frequencies. Again, see Chapter 5, references
|
|
for details.
|
|
|
|
3.2. Use of Huffman coding in the "brotli" format
|
|
|
|
The Huffman codes used for each alphabet in the "brotli" format
|
|
are canonical Huffman codes, which have two additional rules:
|
|
|
|
* All codes of a given bit length have lexicographically
|
|
consecutive values, in the same order as the symbols they
|
|
represent;
|
|
|
|
* Shorter codes lexicographically precede longer codes.
|
|
|
|
We could recode the example above to follow this rule as follows,
|
|
assuming that the order of the alphabet is ABCD:
|
|
|
|
Symbol Code
|
|
------ ----
|
|
A 10
|
|
B 0
|
|
C 110
|
|
D 111
|
|
|
|
I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
|
|
lexicographically consecutive.
|
|
|
|
Given this rule, we can define the canonical Huffman code for an
|
|
alphabet just by giving the bit lengths of the codes for each
|
|
symbol of the alphabet in order; this is sufficient to determine
|
|
the actual codes. In our example, the code is completely defined
|
|
by the sequence of bit lengths (2, 1, 3, 3). The following
|
|
algorithm generates the codes as integers, intended to be read
|
|
from most- to least-significant bit. The code lengths are
|
|
initially in tree[I].Len; the codes are produced in tree[I].Code.
|
|
|
|
1) Count the number of codes for each code length. Let
|
|
bl_count[N] be the number of codes of length N, N >= 1.
|
|
|
|
2) Find the numerical value of the smallest code for each
|
|
code length:
|
|
|
|
code = 0;
|
|
bl_count[0] = 0;
|
|
for (bits = 1; bits <= MAX_BITS; bits++) {
|
|
code = (code + bl_count[bits-1]) << 1;
|
|
next_code[bits] = code;
|
|
}
|
|
|
|
3) Assign numerical values to all codes, using consecutive
|
|
values for all codes of the same length with the base
|
|
values determined at step 2. Codes that are never used
|
|
(which have a bit length of zero) must not be assigned a
|
|
value.
|
|
|
|
for (n = 0; n <= max_code; n++) {
|
|
len = tree[n].Len;
|
|
if (len != 0) {
|
|
tree[n].Code = next_code[len];
|
|
next_code[len]++;
|
|
}
|
|
}
|
|
|
|
Example:
|
|
|
|
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 3,
|
|
2, 4, 4). After step 1, we have:
|
|
|
|
N bl_count[N]
|
|
- -----------
|
|
2 1
|
|
3 5
|
|
4 2
|
|
|
|
Step 2 computes the following next_code values:
|
|
|
|
N next_code[N]
|
|
- ------------
|
|
1 0
|
|
2 0
|
|
3 2
|
|
4 14
|
|
|
|
Step 3 produces the following code values:
|
|
|
|
Symbol Length Code
|
|
------ ------ ----
|
|
A 3 010
|
|
B 3 011
|
|
C 3 100
|
|
D 3 101
|
|
E 3 110
|
|
F 2 00
|
|
G 4 1110
|
|
H 4 1111
|
|
|
|
3.3. Alphabet sizes
|
|
|
|
Huffman codes are used for different purposes in the "brotli"
|
|
format, and each purpose has a different alphabet size. For
|
|
literal codes the alphabet size is 256. For insert-and-copy
|
|
length codes the alphabet size is 704. For block length codes,
|
|
the alphabet size is 26. For distance codes, block type codes and
|
|
the Huffman codes used in compressing the context map, the
|
|
alphabet size is dynamic and is based on other parameters.
|
|
|
|
3.4. Simple Huffman codes
|
|
|
|
The first bit of the compressed representation of each Huffman
|
|
code distinguishes between simple and complex Huffman codes. If
|
|
the first bit is 1, then a simple, otherwise a complex Huffman
|
|
code follows.
|
|
|
|
A simple Huffman code can have only up to four symbols with non-
|
|
zero code length. The format of the simple Huffman code is as
|
|
follows:
|
|
|
|
1 bit: 1, indicating a simple Huffman code
|
|
2 bits: NSYM - 1, where NSYM = # of symbols with non-zero
|
|
code length
|
|
|
|
NSYM symbols, each encoded using ALPHABET_BITS bits
|
|
|
|
1 bit: tree-select, present only for NSYM = 4
|
|
|
|
The value of ALPHABET_BITS depends on the alphabet of the Huffman
|
|
code: it is the smallest number of bits that can represent all
|
|
symbols in the alphabet. E.g. for the alphabet of literal bytes,
|
|
ALPHABET_BITS is 8.
|
|
|
|
The (non-zero) code lengths of the symbols can be reconstructed as
|
|
follows:
|
|
|
|
* if NSYM = 1, the code length for the one symbol is one at
|
|
this stage, but only to distinguish it from the other zero
|
|
code length symbols, when encoding this symbol in the
|
|
compressed data stream using this Huffman code later, no
|
|
actual bits are emitted. Similarly, when decoding a symbol
|
|
using this Huffman code, no bits are read and the one symbol
|
|
is returned.
|
|
|
|
* if NSYM = 2, both symbols have code length 1.
|
|
|
|
* if NSYM = 3, the code lengths for the symbols are 1, 2, 2 in
|
|
the order they appear in the representation of the simple
|
|
Huffman code.
|
|
|
|
* if NSYM = 4, the code lengths (in order of symbols decoded)
|
|
depend on the tree-select bit: 2, 2, 2, 2, (tree-select bit 0)
|
|
or 1, 2, 3, 3 (tree-select bit 1).
|
|
|
|
3.5. Complex Huffman codes
|
|
|
|
A complex Huffman code is a canonical Huffman code, defined by the
|
|
sequence of code lengths, as discussed in Paragraph 3.2, above.
|
|
For even greater compactness, the code length sequences themselves
|
|
are compressed using a Huffman code. The alphabet for code lengths
|
|
is as follows:
|
|
|
|
0 - 15: Represent code lengths of 0 - 15
|
|
16: Copy the previous non-zero code length 3 - 6 times
|
|
The next 2 bits indicate repeat length
|
|
(0 = 3, ... , 3 = 6)
|
|
If this is the first code length, or all previous
|
|
code lengths are zero, a code length of 8 is
|
|
repeated 3 - 6 times
|
|
A repeated code length code of 16 modifies the
|
|
repeat count of the previous one as follows:
|
|
repeat count = (4 * (repeat count - 2)) +
|
|
(3 - 6 on the next 2 bits)
|
|
Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10)
|
|
will expand to 22 code lengths of 7
|
|
(1 + 4 * (6 - 2) + 5)
|
|
17: Repeat a code length of 0 for 3 - 10 times.
|
|
(3 bits of length)
|
|
A repeated code length code of 17 modifies the
|
|
repeat count of the previous one as follows:
|
|
repeat count = (8 * (repeat count - 2)) +
|
|
(3 - 10 on the next 3 bits)
|
|
|
|
A code length of 0 indicates that the corresponding symbol in the
|
|
alphabet will not occur in the compressed data, and should not
|
|
participate in the Huffman code construction algorithm given
|
|
earlier. A complex Huffman code must have at least two non-zero
|
|
code lengths.
|
|
|
|
The bit lengths of the Huffman code over the code length alphabet
|
|
are compressed with the following static Huffman code:
|
|
|
|
Symbol Code
|
|
------ ----
|
|
0 00
|
|
1 1010
|
|
2 100
|
|
3 11
|
|
4 01
|
|
5 1011
|
|
|
|
We can now define the format of the complex Huffman code as
|
|
follows:
|
|
|
|
1 bit: 0, indicating a complex Huffman code
|
|
1 bit : HSKIP, if 1, skip over first two code length codes
|
|
|
|
Code lengths for symbols in the code length alphabet given
|
|
just above, in the order: 1, 2, 3, 4, 0, 17, 5, 6, 16, 7,
|
|
8, 9, 10, 11, 12, 13, 14, 15
|
|
|
|
The code lengths of code length symbols are between 0 and
|
|
5 and they are represented with 2 - 5 bits according to
|
|
the static Huffman code above. A code length of 0 means
|
|
the corresponding code length symbol is not used.
|
|
|
|
If HSKIP is 1, code lengths of code length symbols 1 and
|
|
2 are implicit zeros and are not present in the code
|
|
lengths sequence above. If there are at least two non-
|
|
zero code lengths, any trailing zero code lengths are
|
|
omitted, i.e. the last code length in the sequence must
|
|
be non-zero. In this case the sum of (32 >> code length)
|
|
over all the non-zero code lengths must equal to 32.
|
|
|
|
Sequence of code lengths symbols, encoded using the code
|
|
length Huffman code. Any trailing 0 or 17 must be
|
|
omitted, i.e. the last encoded code length symbol must be
|
|
between 1 and 16. The sum of (32768 >> code length) over
|
|
all the non-zero code lengths in the alphabet, including
|
|
those encoded using repeat code(s) of 16, must equal to
|
|
32768.
|
|
|
|
4. Encoding of distances
|
|
|
|
As described in Section 2, one component of a compressed meta-block
|
|
is a sequence of backward distances. In this section we provide the
|
|
details to the encoding of distances.
|
|
|
|
Each distance in the compressed data part of a meta-block is
|
|
represented with a pair <distance code, extra bits>. The distance
|
|
code and the extra bits are encoded back-to-back, the distance code
|
|
is encoded using a Huffman code over the distance code alphabet,
|
|
while the extra bits value is encoded as a fixed-width machine
|
|
integer. The number of extra bits can be 0 - 24, and it is dependent
|
|
on the distance code.
|
|
|
|
To convert a distance code and associated extra bits to a backward
|
|
distance, we need the sequence of past distances and two additional
|
|
parameters, the number of "postfix bits", denoted by NPOSTFIX, and
|
|
the number of direct distance codes, denoted by NDIRECT. Both of
|
|
these parameters are encoded in the meta-block header. We will also
|
|
use the following derived parameter:
|
|
|
|
POSTFIX_MASK = ((1 << NPOSTFIX) - 1)
|
|
|
|
The first 16 distance codes are special short codes that reference
|
|
past distances as follows:
|
|
|
|
0: last distance
|
|
1: second last distance
|
|
2: third last distance
|
|
3: fourth last distance
|
|
4: last distance - 1
|
|
5: last distance + 1
|
|
6: last distance - 2
|
|
7: last distance + 2
|
|
8: last distance - 3
|
|
9: last disatnce + 3
|
|
10: second last distance - 1
|
|
11: second last distance + 1
|
|
12: second last distance - 2
|
|
13: second last distance + 2
|
|
14: second last distance - 3
|
|
15: second last distance + 3
|
|
|
|
The ring-buffer of four last distances is initialized by the values
|
|
16, 15, 11 and 4 (i.e. the fourth last is set to 16, the third last
|
|
to 15, the second last to 11 and the last distance to 4) at the
|
|
beginning of the *stream* (as opposed to the beginning of the meta-
|
|
block) and it is not reset at meta-block boundaries. When a distance
|
|
code 0 appears, the distance it represents (i.e. the last distance
|
|
in the sequence of distances) is not pushed to the ring-buffer of
|
|
last distances, in other words, the expression "(second, third,
|
|
fourth) last distance" means the (second, third, fourth) last
|
|
distance that was not represented by a 0 distance code.
|
|
|
|
The next NDIRECT distance codes, from 16 to 15 + NDIRECT, represent
|
|
distances from 1 to NDIRECT. Neither the distance short codes, nor
|
|
the NDIRECT direct distance codes have any extra bits.
|
|
|
|
Distance codes 16 + NDIRECT and greater all have extra bits, the
|
|
number of extra bits for a distance code "dcode" is given by the
|
|
following formula:
|
|
|
|
ndistbits = 1 + ((dcode - NDIRECT - 16) >> (NPOSTFIX + 1))
|
|
|
|
The maximum number of extra bits is 24, therefore the size of the
|
|
distance code alphabet is (16 + NDIRECT + (48 << NPOSTFIX)).
|
|
|
|
Given a distance code "dcode" (>= 16 + NDIRECT), and extra bits
|
|
"dextra", the backward distance is given by the following formula:
|
|
|
|
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
|
|
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
|
|
offset = ((2 + (hcode & 1)) << ndistbits) - 4;
|
|
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
|
|
|
|
5. Encoding of literal insertion lengths and copy lengths
|
|
|
|
As described in Section 2, the literal insertion lengths and backward
|
|
copy lengths are encoded using a single Huffman code. This section
|
|
provides the details to this encoding.
|
|
|
|
Each <insertion length, copy length> pair in the compressed data part
|
|
of a meta-block is represented with the following triplet:
|
|
|
|
<insert-and-copy length code, insert extra bits, copy extra bits>
|
|
|
|
The insert-and-copy length code, the insert extra bits and the copy
|
|
extra bits are encoded back-to-back, the insert-and-copy length code
|
|
is encoded using a Huffman code over the insert-and-copy length code
|
|
alphabet, while the extra bits values are encoded as fixed-width
|
|
machine integers. The number of insert and copy extra bits can be
|
|
0 - 24, and they are dependent on the insert-and-copy length code.
|
|
|
|
Some of the insert-and-copy length codes also express the fact that
|
|
the distance code of the distance in the same command is 0, i.e. the
|
|
distance component of the command is the same as that of the previous
|
|
command. In this case, the distance code and extra bits for the
|
|
distance are omitted from the compressed data stream.
|
|
|
|
We describe the insert-and-copy length code alphabet in terms of the
|
|
(not directly used) insert length code and copy length code
|
|
alphabets. The symbols of the insert length code alphabet, along with
|
|
the number of insert extra bits and the range of the insert lengths
|
|
are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 0 0 8 2 10-13 16 6 130-193
|
|
1 0 1 9 2 14-17 17 7 194-321
|
|
2 0 2 10 3 18-25 18 8 322-527
|
|
3 0 3 11 3 26-33 19 9 578-1089
|
|
4 0 4 12 4 34-49 20 10 1090-2113
|
|
5 0 5 13 4 50-65 21 12 2114-6209
|
|
6 1 6,7 14 5 66-97 22 14 6210-22593
|
|
7 1 8,9 15 5 98-129 23 24 22594-16799809
|
|
|
|
The symbols of the copy length code alphabet, along with the number
|
|
of copy extra bits and the range of copy lengths are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 0 2 8 1 10,11 16 5 70-101
|
|
1 0 3 9 1 12,13 17 5 102-133
|
|
2 0 4 10 2 14-17 18 6 134-197
|
|
3 0 5 11 2 18-21 19 7 198-325
|
|
4 0 6 12 3 22-29 20 8 326-581
|
|
5 0 7 13 3 30-37 21 9 582-1093
|
|
6 0 8 14 4 38-53 22 10 1094-2117
|
|
7 0 9 15 4 54-69 23 24 2118-16779333
|
|
|
|
To convert an insert-and-copy length code to an insert length code
|
|
and a copy length code, the following table can be used:
|
|
|
|
Insert
|
|
length Copy length code
|
|
code 0-7 8-15 16-23
|
|
+---------+---------+
|
|
| | |
|
|
0-7 | 0-63 | 64-127 | <--- distance code 0
|
|
| | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
0-7 | 128-191 | 192-255 | 383-447 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
8-15 | 256-319 | 320-383 | 512-575 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
16-23 | 448-551 | 576-639 | 640-703 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
|
|
First, look up the cell with the 64 value range containing the
|
|
insert-and-copy length code, this gives the insert length code and
|
|
the copy length code ranges, both 8 values long. The copy length
|
|
code within its range is determined by the lowest 3 bits of the
|
|
insert-and-copy length code, and the insert length code within its
|
|
range is determined by bits 3-5 (counted from the LSB) of the insert-
|
|
and-copy length code. Given the insert length and copy length codes,
|
|
the actual insert and copy lengths can be obtained by reading the
|
|
number of extra bits given by the tables above.
|
|
|
|
If the insert-and-copy length code is between 0 and 127, the distance
|
|
code of the command is set to zero (the last distance reused).
|
|
|
|
6. Encoding of block switch commands
|
|
|
|
As described in Section 2, a block-switch command is a pair
|
|
<block type, block length>. These are encoded in the compressed data
|
|
part of the meta-block, right before the start of each new block of a
|
|
particular block category.
|
|
|
|
Each block type in the compressed data is represented with a block
|
|
type code, encoded using a Huffman code over the block type code
|
|
alphabet. A block type code 0 means that the block type is the same
|
|
as the type of the second last block from the same block category,
|
|
while a block type code 1 means that the block type equals the last
|
|
block type plus one. Block type codes 2 - 257 represent block types
|
|
0 - 255. The second last and last block types are initialized with 0
|
|
and 1, respectively, at the beginning of each meta-block.
|
|
|
|
The first block type of each block category must be 0 and the block
|
|
type of the first block switch command is therefore not encoded in
|
|
the compressed data.
|
|
|
|
The number of different block types in each block category, denoted
|
|
by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
|
|
lengths and distances, respectively, is encoded in the meta-block
|
|
header, and it must equal to the largest block type plus one in that
|
|
block category. In other words, the set of literal, insert-and-copy
|
|
length and distance block types must be [0..NBLTYPESL-1],
|
|
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
|
|
follows that the alphabet size of literal, insert-and-copy length and
|
|
distance block type codes is NBLTYPES + 2, NBLTYPESI + 2 and
|
|
NBLTYPESD + 2, respectively.
|
|
|
|
Each block length in the compressed data is represented with a pair
|
|
<block length code, extra bits>. The block length code and the extra
|
|
bits are encoded back-to-back, the block length code is encoded using
|
|
a Huffman code over the block length code alphabet, while the extra
|
|
bits value is encoded as a fixed-width machine integer. The number of
|
|
extra bits can be 0 - 24, and it is dependent on the block length
|
|
code. The symbols of the block length code alphabet, along with the
|
|
number of extra bits and the range of block lengths are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 2 1-4 9 4 65-80 18 7 369-496
|
|
1 2 5-8 10 4 81-96 19 8 497-752
|
|
2 2 9-12 11 4 97-112 20 9 753-1264
|
|
3 2 13-16 12 5 113-144 21 10 1265-2288
|
|
4 3 17-24 13 5 145-176 22 11 2289-4336
|
|
5 3 25-32 14 5 177-208 23 12 4337-8432
|
|
6 3 33-40 15 5 209-240 24 13 8433-16624
|
|
7 3 41-48 16 6 241-304 25 24 16625-16793840
|
|
8 4 49-64 17 6 305-368
|
|
|
|
The first block switch command of each block category is special in
|
|
the sense that it is encoded in the meta-block header, and as
|
|
described earlier the block type code is omitted, since it is an
|
|
implicit zero.
|
|
|
|
7. Context modeling
|
|
|
|
As described in Section 2, the Huffman tree used to encode a literal
|
|
byte or a distance code depends on the context ID and the block type.
|
|
This section specifies how to compute the context ID for a particular
|
|
literal and distance code, and how to encode the context map that
|
|
maps a <context ID, block type> pair to the index of a Huffman
|
|
tree in the array of literal and distance Huffman trees.
|
|
|
|
7.1. Context modes and context ID lookup for literals
|
|
|
|
The context for encoding the next literal is defined by the last
|
|
two bytes in the stream (p1, p2, where p1 is the most recent
|
|
byte), regardless if these bytes are produced by backward
|
|
references or by literal insertions.
|
|
|
|
There are four methods, called context modes, to compute the
|
|
Context ID:
|
|
* MSB6, where the Context ID is the value of six most
|
|
significant bits of p1,
|
|
* LSB6, where the Context ID is the value of six least
|
|
significant bits of p1,
|
|
* UTF8, where the Context ID is a complex function of p1, p2,
|
|
optimized for text compression, and
|
|
* Signed, where Context ID is a complex function of p1, p2,
|
|
optimized for compressing sequences of signed integers.
|
|
|
|
The Context ID for the UTF8 and Signed context modes is computed
|
|
using the following lookup tables Lut0, Lut1, and Lut2.
|
|
|
|
Lut0 :=
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 4, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
|
|
44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
|
|
12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
|
|
52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
|
|
12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
|
|
60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12, 0,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
|
|
|
|
Lut1 :=
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
|
|
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
|
|
1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
|
|
|
|
Lut2 :=
|
|
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7
|
|
|
|
Given p1 is the last decoded byte and p2 is the second last
|
|
decoded byte the context IDs can be computed as follows:
|
|
|
|
For LSB6 : Context ID = p1 & 0x3f
|
|
For MSB6 : Context ID = p1 >> 2
|
|
For UTF8 : Context ID = Lut0[p1] | Lut1[p2]
|
|
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]
|
|
|
|
The context modes LSB6, MSB6, UTF8, and Signed are denoted by
|
|
integers 0, 1, 2, 3.
|
|
|
|
The context mode is defined for each literal block type and they
|
|
are stored in a consecutive array of bits in the meta-block
|
|
header, always two bits per block type.
|
|
|
|
7.2. Context ID for distances
|
|
|
|
The context for encoding a distance code is defined by the copy
|
|
length corresponding to the distance. The context IDs are 0, 1, 2,
|
|
and 3 for copy lengths 2, 3, 4, and more than 4, respectively.
|
|
|
|
7.3. Encoding of the context map
|
|
|
|
There are two kinds of context maps, for literals and for
|
|
distances. The size of the context map is 64 * NBLTYPESL for
|
|
literals, and 4 * NBLTYPESD for distances. Each value in the
|
|
context map is an integer between 0 and 255, indicating the index
|
|
of the Huffman tree to be used when encoding the next literal or
|
|
distance.
|
|
|
|
The context map is encoded as a one-dimensional array,
|
|
CMAPL[0..(64 * NBLTYPESL - 1)] and CMAPD[0..(4 * NBLTYPESD - 1)].
|
|
|
|
The index of the Huffman tree for encoding a literal or distance
|
|
code with context ID "cid" and block type "bltype" is
|
|
|
|
index of literal Huffman tree = CMAPL[bltype * 64 + cid]
|
|
|
|
index of distance Huffman tree = CMAPD[bltype * 4 + cid]
|
|
|
|
The values of the context map are encoded with the combination
|
|
of run length encoding for zero values and Huffman coding. Let
|
|
RLEMAX denote the number of run length codes and NTREES denote the
|
|
maximum value in the context map plus one. NTREES must equal the
|
|
number of different values in the context map, in other words,
|
|
the different values in the context map must be the [0..NTREES-1]
|
|
interval. The alphabet of the Huffman code has the following
|
|
RLEMAX + NTREES symbols:
|
|
|
|
0: value zero
|
|
1: repeat a zero 2-3 times, read 1 bit for repeat length
|
|
2: repeat a zero 4-7 times, read 2 bits for repeat length
|
|
...
|
|
RLEMAX: repeat a zero (2^RLEMAX)-(2^(RLEMAX+1) - 1) times,
|
|
read RLEMAX bits for repeat length
|
|
RLEMAX + 1: value 1
|
|
...
|
|
RLEMAX + NTREES - 1: value NTREES - 1
|
|
|
|
If RLEMAX = 0, the run length coding is not used, and the symbols
|
|
of the alphabet are directly the values in the context map. We can
|
|
now define the format of the context map (the same format is used
|
|
for literal and distance context maps):
|
|
|
|
1-5 bits: RLEMAX, 0 is encoded with one 0 bit, and values
|
|
1 - 16 are encoded with bit pattern 1xxxx
|
|
|
|
Huffman code with alphabet size NTREES + RLEMAX
|
|
|
|
Context map size values encoded with the above Huffman code
|
|
and run length coding for zero values
|
|
|
|
1 bit: IMTF bit, if set, we do an inverse move-to-front
|
|
transform on the values in the context map to get
|
|
the Huffman code indexes
|
|
|
|
For the encoding of NTREES see Section 9.2.
|
|
|
|
8. Static dictionary
|
|
|
|
At any given point during decoding the compressed data, a reference
|
|
to a duplicated string in the output produced so far has a maximum
|
|
backward distance value, which is the minimum of the window size and
|
|
the number of output bytes produced. However, decoding a distance
|
|
from the input stream, as described in section 4, can produce
|
|
distances that are greater than this maximum allowed value. The
|
|
difference between these distances and the first invalid distance
|
|
value is treated as reference to a word in the static dictionary
|
|
given in Appendix A. The maximum valid copy length for a static
|
|
dictionary reference is 24. The static dictionary has three parts:
|
|
|
|
* DICT[0..DICTSIZE], an array of bytes
|
|
* DOFFSET[0..24], an array of byte offset values for each length
|
|
* NDBITS[0..24], an array of bit-depth values for each length
|
|
|
|
The number of static dictionary words for a given length is:
|
|
|
|
NWORDS[length] = 0 (if length < 3)
|
|
NWORDS[length] = (1 << NDBITS[lengths]) (if length >= 3)
|
|
|
|
DOFFSET and DICTSIZE are defined by the following recursion:
|
|
|
|
DOFFSET[0] = 0
|
|
DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
|
|
DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]
|
|
|
|
The offset of a word within the DICT array for a given length and
|
|
index is:
|
|
|
|
offset(length, index) = DOFFSET[length] + index * length
|
|
|
|
Each static dictionary word has 64 different forms, given by applying
|
|
a word transformation to a base word in the DICT array. The list of
|
|
word transformations is given in Appendix B. The static dictionary
|
|
word for a <length, distance> pair can be reconstructed as follows:
|
|
|
|
word_id = distance - (max allowed distance + 1)
|
|
index = word_id % NWORDS[length]
|
|
base_word = DICT[offset(length, index)..offset(length, index+1))
|
|
transform_id = word_id >> NBITS[length]
|
|
|
|
The string copied to the output stream is computed by applying the
|
|
transformation to the base dictionary word. If transform_id is
|
|
greater than 63 or length is greater than 24, the compressed data set
|
|
is invalid and must be discarded.
|
|
|
|
9. Compressed data format
|
|
|
|
In this section we describe the format of the compressed data set in
|
|
terms of the format of the individual data items described in the
|
|
previous sections.
|
|
|
|
9.1. Format of the stream header
|
|
|
|
The stream header has only the following one field:
|
|
|
|
1-4 bits: WBITS, a value in the range 16 - 24, value 16 is
|
|
encoded with one 0 bit, and values 17 - 24 are
|
|
encoded with bit pattern 1xxx
|
|
|
|
The size of the sliding window, which is the maximum value of any
|
|
non-dictionary reference backward distance, is given by the
|
|
following formula:
|
|
|
|
window size = (1 << WBITS) - 16
|
|
|
|
9.2. Format of the meta-block header
|
|
|
|
A compliant compressed data set has at least one meta-block. Each
|
|
meta-block contains a header with information about the
|
|
uncompressed length of the meta-block, and a bit signaling if the
|
|
meta-block is the last one. The format of the meta-block header is
|
|
the following:
|
|
|
|
1 bit: ISLAST, set to 1 if this is the last meta-block
|
|
1 bit: ISEMPTY, set to 1 if the meta-block is empty, this
|
|
field is only present if ISLAST bit is set, since
|
|
only the last meta-block can be empty
|
|
2 bits: MNIBBLES, (# of nibbles to represent the length) - 4
|
|
|
|
(MNIBBLES + 4) x 4 bits: MLEN - 1, where MLEN is the length
|
|
of the meta-block in the input data in bytes
|
|
|
|
1 bit: ISUNCOMPRESSED, if set to 1, any bits of input up to
|
|
the next byte boundary are ignored, and the rest of
|
|
the meta-block contains MLEN bytes of literal data;
|
|
this field is only present if ISLAST bit is not set
|
|
|
|
1-11 bits: NBLTYPESL, # of literal block types, encoded with
|
|
the following variable length code:
|
|
|
|
Value Bit Pattern
|
|
----- -----------
|
|
1 0
|
|
2 1000
|
|
3-4 1001x
|
|
5-8 1010xx
|
|
9-16 1011xxx
|
|
17-32 1100xxxx
|
|
33-64 1101xxxxx
|
|
65-128 1110xxxxxx
|
|
129-256 1111xxxxxxx
|
|
|
|
Huffman code over the block type code alphabet for literal
|
|
block types, appears only if NBLTYPESL >= 2
|
|
|
|
Huffman code over the block length code alphabet for literal
|
|
block lengths, appears only if NBLTYPESL >= 2
|
|
|
|
Block length code + Extra bits for first literal block
|
|
length, appears only if NBLTYPESL >= 2
|
|
|
|
1-11 bits: NBLTYPESI, # of insert-and-copy block types, encoded
|
|
with the same variable length code as above
|
|
|
|
Huffman code over the block type code alphabet for insert-
|
|
and-copy block types, only if NBLTYPESI >= 2
|
|
|
|
Huffman code over the block length code alphabet for insert-
|
|
and-copy block lengths, only if NBLTYPESI >= 2
|
|
|
|
Block length code + Extra bits for first insert-and-copy
|
|
block length, only if NBLTYPESI >= 2
|
|
|
|
1-11 bits: NBLTYPESD, # of distance block types, encoded with
|
|
the same variable length code as above
|
|
|
|
Huffman code over the block type code alphabet for distance
|
|
block types, appears only if NBLTYPESD >= 2
|
|
|
|
Huffman code over the block length code alphabet for
|
|
distance block lengths, only if NBLTYPESD >= 2
|
|
|
|
Block length code + Extra bits for first distance block
|
|
length, only if NBLTYPESD >= 2
|
|
|
|
2 bits: NPOSTFIX, parameter used in the distance coding
|
|
|
|
4 bits: four most significant bits of NDIRECT, to get the
|
|
actual value of the parameter NDIRECT, left-shift
|
|
this four bit number by NPOSTFIX bits
|
|
|
|
NBLTYPESL x 2 bits: context mode for each literal block type
|
|
|
|
1-11 bits: NTREESL, # of literal Huffman trees, encoded with
|
|
the same variable length code as NBLTYPESL
|
|
|
|
Literal context map, encoded as described in Paragraph 7.3,
|
|
appears only if NTREESL >= 2, otherwise the context map
|
|
has only zero values
|
|
|
|
1-11 bits: NTREESD, # of distance Huffman trees, encoded with
|
|
the same variable length code as NBLTYPESD
|
|
|
|
Distance context map, encoded as described in Paragraph 7.3,
|
|
appears only if NTREESD >= 2, otherwise the context map
|
|
has only zero values
|
|
|
|
NTREESL Huffman codes for literals
|
|
|
|
NBLTYPESI Huffman codes for insert-and-copy lengths
|
|
|
|
NTREESD Huffman codes for distances
|
|
|
|
9.3. Format of the meta-block data
|
|
|
|
The compressed data part of a meta-block consists of a series of
|
|
commands. Each command has the following format:
|
|
|
|
Block type code for next insert-and-copy block type, appears
|
|
only if NBLTYPESI >= 2 and the previous insert-and-copy
|
|
block has ended
|
|
|
|
Block length code + Extra bits for next insert-and-copy
|
|
block length, appears only if NBLTYPESI >= 2 and the
|
|
previous insert and-copy block has ended
|
|
|
|
Insert-and-copy length, encoded as in section 5, using the
|
|
insert-and-copy length Huffman code with the current
|
|
insert-and-copy block type index
|
|
|
|
Insert length number of literals, with the following format:
|
|
|
|
Block type code for next literal block type, appears
|
|
only if NBLTYPESL >= 2 and the previous literal
|
|
block has ended
|
|
|
|
Block length code + Extra bits for next literal block
|
|
length, appears only if NBLTYPESL >= 2 and the
|
|
previous literal block has ended
|
|
|
|
Next byte of the input data, encoded with the literal
|
|
Huffman code with the index determined by the
|
|
previuos two bytes of the input data, the current
|
|
literal block type and the context map, as
|
|
described in Paragraph 7.3.
|
|
|
|
Block type code for next distance block type, appears only
|
|
if NBLTYPESD >= 2 and the previous distance block has
|
|
ended
|
|
|
|
Block length code + Extra bits for next distance block
|
|
length, appears only if NBLTYPESD >= 2 and the previous
|
|
distance block has ended
|
|
|
|
Distance code, encoded as in section 4, using the distance
|
|
Huffman code with the current distance block type index,
|
|
appears only if the distance code is not an implicit 0,
|
|
as indicated by the insert-and-copy length code
|
|
|
|
The number of commands in the meta-block is such that the sum of
|
|
insert lengths and copy lengths over all the commands gives the
|
|
uncompressed length, MLEN encoded in the meta-block header.
|
|
|
|
10. Decoding algorithm
|
|
|
|
The decoding algorithm that produces the output data is as follows:
|
|
|
|
read window size
|
|
do
|
|
read ISLAST bit
|
|
if ISLAST
|
|
read ISEMPTY bit
|
|
if ISEMPTY
|
|
break from loop
|
|
read MLEN
|
|
if not ISLAST
|
|
read ISUNCOMPRESSED bit
|
|
if ISUNCOMPRESSED
|
|
skip any bits up to the next byte boundary
|
|
copy MLEN bytes of input to the output stream
|
|
continue to the next meta-block
|
|
loop for each three block categories (i = L, I, D)
|
|
read NBLTYPESi
|
|
if NBLTYPESi >= 2
|
|
read Huffman code for block types, HTREE_BTYPE_i
|
|
read Huffman code for block lengths, HTREE_BLEN_i
|
|
read block length, BLEN_i
|
|
set block type, BTYPE_i to 0
|
|
initialize second last and last block types to 0 and 1
|
|
else
|
|
set block type, BTYPE_i to 0
|
|
set block length, BLEN_i to 268435456
|
|
read NPOSTFIX and NDIRECT
|
|
read array of literal context modes, CMODE[]
|
|
read NTREESL
|
|
if NTREESL >= 2
|
|
read literal context map, CMAPL[]
|
|
else
|
|
fill CMAPL[] with zeros
|
|
read NTREESD
|
|
if NTREESD >= 2
|
|
read distance context map, CMAPD[]
|
|
else
|
|
fill CMAPD[] with zeros
|
|
read array of Huffman codes for literals, HTREEL[]
|
|
read array of Huffman codes for insert-and-copy, HTREEI[]
|
|
read array of Huffman codes for distances, HTREED[]
|
|
do
|
|
if BLEN_I is zero
|
|
read block type using HTREE_BTYPE_I and set BTYPE_I
|
|
read block length using HTREE_BLEN_I and set BLEN_I
|
|
decrement BLEN_I
|
|
read insert and copy length, ILEN, CLEN with HTREEI[BTYPE_I]
|
|
loop for ILEN
|
|
if BLEN_L is zero
|
|
read block type using HTREE_BTYPE_L and set BTYPE_L
|
|
read block length using HTREE_BLEN_L and set BLEN_L
|
|
decrement BLEN_L
|
|
look up context mode CMODE[BTYPE_L]
|
|
compute context ID, CIDL from last two bytes of output
|
|
read literal using HTREEL[CMAPL[64 * BTYPE_L + CIDL]]
|
|
copy literal to output stream
|
|
if number of output bytes produced in the loop is MLEN
|
|
break from loop
|
|
if distance code is implicit zero from insert-and-copy code
|
|
set backward distance to the last distance
|
|
else
|
|
if BLEN_D is zero
|
|
read block type using HTREE_BTYPE_D and set BTYPE_D
|
|
read block length using HTREE_BLEN_D and set BLEN_D
|
|
decrement BLEN_D
|
|
compute context ID, CIDD from CLEN
|
|
read distance code with HTREED[CMAPD[4 * BTYPE_D + CIDD]]
|
|
compute distance by distance short code substitution
|
|
move backwards distance bytes in the output stream, and
|
|
copy CLEN bytes from this position to the output stream,
|
|
or look up the static dictionary word and copy it to the
|
|
output stram
|
|
while number of output bytes produced in the loop < MLEN
|
|
while not ISLAST
|
|
|
|
Note that a duplicated string reference may refer to a string in a
|
|
previous meta-block, i.e. the backward distance may cross one or
|
|
more meta-block boundaries. However a backward copy distance
|
|
cannot refer past the beginning of the output stream and it can
|
|
not be greater than the window size; any such distance must be
|
|
interpreted as a reference to a static dictionary word. Also note
|
|
that the referenced string may overlap the current position, for
|
|
example, if the last 2 bytes decoded have values X and Y, a string
|
|
reference with <length = 5, distance = 2> adds X,Y,X,Y,X to the
|
|
output stream.
|
|
|
|
11. References
|
|
|
|
[1] Huffman, D. A., "A Method for the Construction of Minimum
|
|
Redundancy Codes", Proceedings of the Institute of Radio
|
|
Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
|
|
|
|
[2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
|
|
Compression", IEEE Transactions on Information Theory, Vol. 23,
|
|
No. 3, pp. 337-343.
|
|
|
|
[3] Deutsch, P., "DEFLATE Compressed Data Format Specification
|
|
version 1.3", RFC 1951, Aladdin Enterprises, May 1996.
|
|
http://www.ietf.org/rfc/rfc1951.txt
|
|
|
|
12. Source code
|
|
|
|
Source code for a C language implementation of a "brotli" compliant
|
|
decompressor and a C++ language implementation of a compressor is
|
|
available in the brotli/ directory within the font-compression-
|
|
reference open-source project:
|
|
https://code.google.com/p/font-compression-reference/source/browse/
|
|
|
|
Appendix A. List of dictionary words
|
|
|
|
TO BE WRITTEN
|
|
|
|
Appendix B. List of word transformations
|
|
|
|
TO BE WRITTEN
|