mirror of
https://github.com/google/brotli.git
synced 2024-11-24 12:30:15 +00:00
2bcd58bb5a
Combine the HSKIP and the simple/complex Huffman code type bits.
1265 lines
57 KiB
Plaintext
1265 lines
57 KiB
Plaintext
J. Alakuijala
|
|
Z. Szabadka
|
|
______ _______ _______ _______ _________
|
|
( __ \ ( ____ )( ___ )( ____ \\__ __/
|
|
| ( \ )| ( )|| ( ) || ( \/ ) (
|
|
| | ) || (____)|| (___) || (__ | |
|
|
| | | || __)| ___ || __) | |
|
|
| | ) || (\ ( | ( ) || ( | |
|
|
| (__/ )| ) \ \__| ) ( || ) | |
|
|
(______/ |/ \__/|/ \||/ )_(
|
|
|
|
|
|
DRAFT of
|
|
Brotli Compression Algorithm Compressed Data Format Specification 1.0
|
|
|
|
Status of This Memo
|
|
|
|
This memo provides information for the Internet community. This memo
|
|
does not specify an Internet standard of any kind. Distribution of
|
|
this memo is unlimited.
|
|
|
|
Notices
|
|
|
|
Copyright (c) 2013 J. Alakuijala and Z. Szabadka
|
|
|
|
Permission is granted to copy and distribute this document for any
|
|
purpose and without charge, including translations into other
|
|
languages and incorporation into compilations, provided that the
|
|
copyright notice and this notice are preserved, and that any
|
|
substantive changes or deletions from the original are clearly
|
|
marked.
|
|
|
|
Abstract
|
|
|
|
This specification defines a lossless compressed data format that
|
|
compresses data using a combination of the LZ77 algorithm and Huffman
|
|
coding, with efficiency comparable to the best currently available
|
|
general-purpose compression methods.
|
|
|
|
1. Introduction
|
|
|
|
1.1. Purpose
|
|
|
|
The purpose of this specification is to define a lossless
|
|
compressed data format that:
|
|
* Is independent of CPU type, operating system, file system,
|
|
and character set, and hence can be used for interchange;
|
|
* Can be produced or consumed, even for an arbitrarily long
|
|
sequentially presented input data stream, using only an a
|
|
priori bounded amount of intermediate storage, and hence
|
|
can be used in data communications or similar structures,
|
|
such as Unix filters;
|
|
* Compresses data with efficiency comparable to the best
|
|
currently available general-purpose compression methods,
|
|
and in particular considerably better than the gzip program;
|
|
* Decompresses much faster than the LZMA implementations.
|
|
|
|
The data format defined by this specification does not attempt to:
|
|
* Allow random access to compressed data;
|
|
* Compress specialized data (e.g., raster graphics) as well
|
|
as the best currently available specialized algorithms.
|
|
|
|
1.2. Intended audience
|
|
|
|
This specification is intended for use by software implementers
|
|
to compress data into and/or decompress data from "brotli" format.
|
|
|
|
The text of the specification assumes a basic background in
|
|
programming at the level of bits and other primitive data
|
|
representations. Familiarity with the technique of Huffman coding
|
|
is helpful but not required.
|
|
|
|
This specification uses heavily the notations and terminology
|
|
introduced in the DEFLATE format specification (RFC 1951, see
|
|
reference [3] below). For the sake of completeness, we always
|
|
include the whole text of the relevant parts of RFC 1951,
|
|
therefore familiarity with the DEFLATE format is helpful but not
|
|
required.
|
|
|
|
1.3. Scope
|
|
|
|
The specification specifies a method for representing a sequence
|
|
of bytes as a (usually shorter) sequence of bits, and a method for
|
|
packing the latter bit sequence into bytes.
|
|
|
|
1.4. Compliance
|
|
|
|
Unless otherwise indicated below, a compliant decompressor must be
|
|
able to accept and decompress any data set that conforms to all
|
|
the specifications presented here. A compliant compressor must
|
|
produce data sets that conform to all the specifications presented
|
|
here.
|
|
|
|
1.5. Definitions of terms and conventions used
|
|
|
|
Byte: 8 bits stored or transmitted as a unit (same as an octet).
|
|
For this specification, a byte is exactly 8 bits, even on machines
|
|
which store a character on a number of bits different from eight.
|
|
See below for the numbering of bits within a byte.
|
|
|
|
String: a sequence of arbitrary bytes.
|
|
|
|
Bytes stored within a computer do not have a "bit order", since
|
|
they are always treated as a unit. However, a byte considered as
|
|
an integer between 0 and 255 does have a most- and least-
|
|
significant bit, and since we write numbers with the most-
|
|
significant digit on the left, we also write bytes with the most-
|
|
significant bit on the left. In the diagrams below, we number the
|
|
bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
|
the bits are numbered:
|
|
|
|
+--------+
|
|
|76543210|
|
|
+--------+
|
|
|
|
Within a computer, a number may occupy multiple bytes. All
|
|
multi-byte numbers in the format described here are stored with
|
|
the least-significant byte first (at the lower memory address).
|
|
For example, the decimal number 520 is stored as:
|
|
|
|
0 1
|
|
+--------+--------+
|
|
|00001000|00000010|
|
|
+--------+--------+
|
|
^ ^
|
|
| |
|
|
| + more significant byte = 2 x 256
|
|
+ less significant byte = 8
|
|
|
|
|
|
1.5.1. Packing into bytes
|
|
|
|
This document does not address the issue of the order in which
|
|
bits of a byte are transmitted on a bit-sequential medium,
|
|
since the final data format described here is byte- rather than
|
|
bit-oriented. However, we describe the compressed block format
|
|
below as a sequence of data elements of various bit
|
|
lengths, not a sequence of bytes. We must therefore specify
|
|
how to pack these data elements into bytes to form the final
|
|
compressed byte sequence:
|
|
|
|
* Data elements are packed into bytes in order of
|
|
increasing bit number within the byte, i.e., starting
|
|
with the least-significant bit of the byte.
|
|
* Data elements other than Huffman codes are packed
|
|
starting with the least-significant bit of the data
|
|
element.
|
|
* Huffman codes are packed starting with the most-
|
|
significant bit of the code.
|
|
|
|
In other words, if one were to print out the compressed data as
|
|
a sequence of bytes, starting with the first byte at the
|
|
*right* margin and proceeding to the *left*, with the most-
|
|
significant bit of each byte on the left as usual, one would be
|
|
able to parse the result from right to left, with fixed-width
|
|
elements in the correct MSB-to-LSB order and Huffman codes in
|
|
bit-reversed order (i.e., with the first bit of the code in the
|
|
relative LSB position).
|
|
|
|
2. Compressed representation overview
|
|
|
|
A compressed data set consists of a header and a series of meta-
|
|
blocks corresponding to successive meta-blocks of input data. The
|
|
meta-block sizes are limited to bytes and the maximum meta-block size
|
|
is 268,435,456 bytes.
|
|
|
|
The header contains the size of a sliding window on the input data
|
|
that is sufficient to keep on the intermediate storage at any given
|
|
point during decoding the stream.
|
|
|
|
Each meta-block is compressed using a combination of the LZ77
|
|
algorithm (Lempel-Ziv 1977, see reference [2] below) and Huffman
|
|
coding. The Huffman trees for each block are independent of those for
|
|
previous or subsequent blocks; the LZ77 algorithm may use a
|
|
reference to a duplicated string occurring in a previous meta-block,
|
|
up to sliding window size input bytes before.
|
|
|
|
Each meta-block consists of two parts: a meta-block header that
|
|
describes the representation of the compressed data part, and a
|
|
compressed data part. The compressed data consists of a series of
|
|
commands. Each command consists of two parts: a sequence of literal
|
|
bytes (of strings that have not been detected as duplicated within
|
|
the sliding window), and a pointer to a duplicated string,
|
|
represented as a pair <length, backward distance>.
|
|
|
|
Each command in the compressed data is represented using three kinds
|
|
of Huffman codes: one kind of code tree for the literal sequence
|
|
lengths (also referred to as literal insertion lengths) and backward
|
|
copy lengths (that is, a single code word represents two lengths,
|
|
one of the literal sequence and one of the backward copy), a separate
|
|
kind of code tree for literals, and a third kind of code tree for
|
|
distances. The code trees for each meta-block appear in a compact
|
|
form just before the compressed data in the meta-block header.
|
|
|
|
The sequence of each type of value in the representation of a command
|
|
(insert-and-copy lengths, literals and distances) within a meta-
|
|
block is further divided into blocks. In the "brotli" format, blocks
|
|
are not contiguous chunks of compressed data, but rather the pieces
|
|
of compressed data belonging to a block are interleaved with pieces
|
|
of data belonging to other blocks. Each meta-block can be logically
|
|
decomposed into a series of insert-and-copy length blocks, a series
|
|
of literal blocks and a series of distance blocks. These are also
|
|
called the three block categories: a meta-block has a series of
|
|
blocks for each block category. Note that the physical structure of
|
|
the meta-block is a series of commands, while the three series of
|
|
blocks is the logical structure. Consider the following example:
|
|
|
|
(IaC0, L0, L1, L2, D0)(IaC1, D1)(IaC2, L3, L4, D2)(IaC3, L5, D3)
|
|
|
|
The meta-block here has 4 commands, and each three types of symbols
|
|
within these commands can be rearranged for example into the
|
|
following logical block structure:
|
|
|
|
[IaC0, IaC1][IaC2, IaC3] <-- block types 0 and 1
|
|
|
|
[L0, L1][L2, L3, L4][L5] <-- block types 0, 1, and 0
|
|
|
|
[D0][D1, D2, D3] <-- block types 0 and 1
|
|
|
|
The subsequent blocks within each block category must have different
|
|
block types, but blocks further away in the block sequence can have
|
|
the same types. The block types are numbered from 0 to the maximum
|
|
block type number of 255 and the first block of each block category
|
|
must have type 0. The block structure of a meta-block is represented
|
|
by the sequence of block-switch commands for each block category,
|
|
where a block-switch command is a pair <block type, block length>.
|
|
The block-switch commands are represented in the compressed data
|
|
before the start of each new block using a Huffman code tree for
|
|
block types and a separate Huffman code tree for block lengths for
|
|
each block category. In the above example the physical layout of the
|
|
meta-block is the following:
|
|
|
|
IaC0 L0 L1 LBlockSwitch(1, 3) L2 D0 IaC1 DBlockSwitch(1, 1) D1
|
|
IaCBlockSwitch(1, 2) IaC2 L3 L4 D2 IaC3 LBlockSwitch(0, 1) D3
|
|
|
|
Note that the block switch commands for the first blocks are not part
|
|
of the meta-block compressed data part, they are encoded in the meta-
|
|
block header. The code trees for block types and lengths (total of
|
|
six Huffman code trees) appear in a compact form in the meta-block
|
|
header.
|
|
|
|
Each type of value (insert-and-copy lengths, literals and distances)
|
|
can be encoded with any Huffman tree from a collection of Huffman
|
|
trees of the same kind appearing in the meta-block header. The
|
|
particular Huffman tree used can depend on two factors: the block type
|
|
of the block the value appears in, and the context of the value. In
|
|
the case of the literals, the context is the previous two bytes in
|
|
the input data, and in the case of distances, the context is the copy
|
|
length from the same command. For insert-and-copy lengths, no context
|
|
is used and the Huffman tree depends only on the block type (in fact,
|
|
the index of the Huffman tree is the block type number). In the case
|
|
of literals and distances, the context is mapped to a context ID in
|
|
the rage [0, 63] for literals and [0, 3] for distances and the matrix
|
|
of the Huffman tree indices for each block type and context ID,
|
|
called the context map, is encoded in a compact form in the meta-
|
|
block header.
|
|
|
|
In addition to the parts listed above (Huffman code trees for insert-
|
|
and-copy lengths, literals, distances, block types and block lengths
|
|
and the context map), the meta-block header contains the number of
|
|
input bytes in the meta-block and two additional parameters used in
|
|
the representation of copy distances (number of "postfix bits" and
|
|
number of direct distance codes).
|
|
|
|
3. Compressed representation of Huffman codes
|
|
|
|
3.1. Introduction to prefix and Huffman coding
|
|
|
|
Prefix coding represents symbols from an a priori known alphabet
|
|
by bit sequences (codes), one code for each symbol, in a manner
|
|
such that different symbols may be represented by bit sequences of
|
|
different lengths, but a parser can always parse an encoded string
|
|
unambiguously symbol-by-symbol.
|
|
|
|
We define a prefix code in terms of a binary tree in which the two
|
|
edges descending from each non-leaf node are labeled 0 and 1 and
|
|
in which the leaf nodes correspond one-for-one with (are labeled
|
|
with) the symbols of the alphabet; then the code for a symbol is
|
|
the sequence of 0's and 1's on the edges leading from the root to
|
|
the leaf labeled with that symbol. For example:
|
|
|
|
/\ Symbol Code
|
|
0 1 ------ ----
|
|
/ \ A 00
|
|
/\ B B 1
|
|
0 1 C 011
|
|
/ \ D 010
|
|
A /\
|
|
0 1
|
|
/ \
|
|
D C
|
|
|
|
A parser can decode the next symbol from an encoded input stream
|
|
by walking down the tree from the root, at each step choosing the
|
|
edge corresponding to the next input bit.
|
|
|
|
Given an alphabet with known symbol frequencies, the Huffman
|
|
algorithm allows the construction of an optimal prefix code (one
|
|
which represents strings with those symbol frequencies using the
|
|
fewest bits of any possible prefix codes for that alphabet). Such
|
|
a code is called a Huffman code. (See reference [1] in Chapter 5,
|
|
references for additional information on Huffman codes.)
|
|
|
|
Note that in the "brotli" format, the Huffman codes for the
|
|
various alphabets must not exceed certain maximum code lengths.
|
|
This constraint complicates the algorithm for computing code
|
|
lengths from symbol frequencies. Again, see Chapter 5, references
|
|
for details.
|
|
|
|
3.2. Use of Huffman coding in the "brotli" format
|
|
|
|
The Huffman codes used for each alphabet in the "brotli" format
|
|
are canonical Huffman codes, which have two additional rules:
|
|
|
|
* All codes of a given bit length have lexicographically
|
|
consecutive values, in the same order as the symbols they
|
|
represent;
|
|
|
|
* Shorter codes lexicographically precede longer codes.
|
|
|
|
We could recode the example above to follow this rule as follows,
|
|
assuming that the order of the alphabet is ABCD:
|
|
|
|
Symbol Code
|
|
------ ----
|
|
A 10
|
|
B 0
|
|
C 110
|
|
D 111
|
|
|
|
I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
|
|
lexicographically consecutive.
|
|
|
|
Given this rule, we can define the canonical Huffman code for an
|
|
alphabet just by giving the bit lengths of the codes for each
|
|
symbol of the alphabet in order; this is sufficient to determine
|
|
the actual codes. In our example, the code is completely defined
|
|
by the sequence of bit lengths (2, 1, 3, 3). The following
|
|
algorithm generates the codes as integers, intended to be read
|
|
from most- to least-significant bit. The code lengths are
|
|
initially in tree[I].Len; the codes are produced in tree[I].Code.
|
|
|
|
1) Count the number of codes for each code length. Let
|
|
bl_count[N] be the number of codes of length N, N >= 1.
|
|
|
|
2) Find the numerical value of the smallest code for each
|
|
code length:
|
|
|
|
code = 0;
|
|
bl_count[0] = 0;
|
|
for (bits = 1; bits <= MAX_BITS; bits++) {
|
|
code = (code + bl_count[bits-1]) << 1;
|
|
next_code[bits] = code;
|
|
}
|
|
|
|
3) Assign numerical values to all codes, using consecutive
|
|
values for all codes of the same length with the base
|
|
values determined at step 2. Codes that are never used
|
|
(which have a bit length of zero) must not be assigned a
|
|
value.
|
|
|
|
for (n = 0; n <= max_code; n++) {
|
|
len = tree[n].Len;
|
|
if (len != 0) {
|
|
tree[n].Code = next_code[len];
|
|
next_code[len]++;
|
|
}
|
|
}
|
|
|
|
Example:
|
|
|
|
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 3,
|
|
2, 4, 4). After step 1, we have:
|
|
|
|
N bl_count[N]
|
|
- -----------
|
|
2 1
|
|
3 5
|
|
4 2
|
|
|
|
Step 2 computes the following next_code values:
|
|
|
|
N next_code[N]
|
|
- ------------
|
|
1 0
|
|
2 0
|
|
3 2
|
|
4 14
|
|
|
|
Step 3 produces the following code values:
|
|
|
|
Symbol Length Code
|
|
------ ------ ----
|
|
A 3 010
|
|
B 3 011
|
|
C 3 100
|
|
D 3 101
|
|
E 3 110
|
|
F 2 00
|
|
G 4 1110
|
|
H 4 1111
|
|
|
|
3.3. Alphabet sizes
|
|
|
|
Huffman codes are used for different purposes in the "brotli"
|
|
format, and each purpose has a different alphabet size. For
|
|
literal codes the alphabet size is 256. For insert-and-copy
|
|
length codes the alphabet size is 704. For block length codes,
|
|
the alphabet size is 26. For distance codes, block type codes and
|
|
the Huffman codes used in compressing the context map, the
|
|
alphabet size is dynamic and is based on other parameters.
|
|
|
|
3.4. Simple Huffman codes
|
|
|
|
The first two bits of the compressed representation of each Huffman
|
|
code distinguishes between simple and complex Huffman codes. If
|
|
this value is 1, then a simple Huffman code follows. Otherwise
|
|
the value indicates the number of leading zeros.
|
|
|
|
A simple Huffman code can have only up to four symbols with non-
|
|
zero code length. The format of the simple Huffman code is as
|
|
follows:
|
|
|
|
2 bits: value of 1 indicates a simple Huffman code
|
|
2 bits: NSYM - 1, where NSYM = # of symbols with non-zero
|
|
code length
|
|
|
|
NSYM symbols, each encoded using ALPHABET_BITS bits
|
|
|
|
1 bit: tree-select, present only for NSYM = 4
|
|
|
|
The value of ALPHABET_BITS depends on the alphabet of the Huffman
|
|
code: it is the smallest number of bits that can represent all
|
|
symbols in the alphabet. E.g. for the alphabet of literal bytes,
|
|
ALPHABET_BITS is 8. The value of each of the NSYM symbols above is
|
|
the value of the ALPHABETS_BITS width machine integer representing
|
|
the symbol modulo the alphabet size of the Huffman code.
|
|
|
|
The (non-zero) code lengths of the symbols can be reconstructed as
|
|
follows:
|
|
|
|
* if NSYM = 1, the code length for the one symbol is one at
|
|
this stage, but only to distinguish it from the other zero
|
|
code length symbols, when encoding this symbol in the
|
|
compressed data stream using this Huffman code later, no
|
|
actual bits are emitted. Similarly, when decoding a symbol
|
|
using this Huffman code, no bits are read and the one symbol
|
|
is returned.
|
|
|
|
* if NSYM = 2, both symbols have code length 1.
|
|
|
|
* if NSYM = 3, the code lengths for the symbols are 1, 2, 2 in
|
|
the order they appear in the representation of the simple
|
|
Huffman code.
|
|
|
|
* if NSYM = 4, the code lengths (in order of symbols decoded)
|
|
depend on the tree-select bit: 2, 2, 2, 2, (tree-select bit 0)
|
|
or 1, 2, 3, 3 (tree-select bit 1).
|
|
|
|
3.5. Complex Huffman codes
|
|
|
|
A complex Huffman code is a canonical Huffman code, defined by the
|
|
sequence of code lengths, as discussed in Paragraph 3.2, above.
|
|
For even greater compactness, the code length sequences themselves
|
|
are compressed using a Huffman code. The alphabet for code lengths
|
|
is as follows:
|
|
|
|
0 - 15: Represent code lengths of 0 - 15
|
|
16: Copy the previous non-zero code length 3 - 6 times
|
|
The next 2 bits indicate repeat length
|
|
(0 = 3, ... , 3 = 6)
|
|
If this is the first code length, or all previous
|
|
code lengths are zero, a code length of 8 is
|
|
repeated 3 - 6 times
|
|
A repeated code length code of 16 modifies the
|
|
repeat count of the previous one as follows:
|
|
repeat count = (4 * (repeat count - 2)) +
|
|
(3 - 6 on the next 2 bits)
|
|
Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10)
|
|
will expand to 22 code lengths of 7
|
|
(1 + 4 * (6 - 2) + 5)
|
|
17: Repeat a code length of 0 for 3 - 10 times.
|
|
(3 bits of length)
|
|
A repeated code length code of 17 modifies the
|
|
repeat count of the previous one as follows:
|
|
repeat count = (8 * (repeat count - 2)) +
|
|
(3 - 10 on the next 3 bits)
|
|
|
|
A code length of 0 indicates that the corresponding symbol in the
|
|
alphabet will not occur in the compressed data, and should not
|
|
participate in the Huffman code construction algorithm given
|
|
earlier. A complex Huffman code must have at least two non-zero
|
|
code lengths.
|
|
|
|
The bit lengths of the Huffman code over the code length alphabet
|
|
are compressed with the following static Huffman code:
|
|
|
|
Symbol Code
|
|
------ ----
|
|
0 00
|
|
1 1010
|
|
2 100
|
|
3 11
|
|
4 01
|
|
5 1011
|
|
|
|
We can now define the format of the complex Huffman code as
|
|
follows:
|
|
|
|
2 bits: HSKIP, values of 0, 2 or 3 represent the respective
|
|
number of leading zeros. (Value of 1 indicates the
|
|
Simple Huffman code.)
|
|
|
|
|
|
Code lengths for symbols in the code length alphabet given
|
|
just above, in the order: 1, 2, 3, 4, 0, 17, 5, 6, 16, 7,
|
|
8, 9, 10, 11, 12, 13, 14, 15
|
|
|
|
The code lengths of code length symbols are between 0 and
|
|
5 and they are represented with 2 - 5 bits according to
|
|
the static Huffman code above. A code length of 0 means
|
|
the corresponding code length symbol is not used.
|
|
|
|
If HSKIP is 2 or 3, a respective number of leading code
|
|
lengths are implicit zeros and are not present in the
|
|
code lengths sequence above. If there are at least two non-
|
|
zero code lengths, any trailing zero code lengths are
|
|
omitted, i.e. the last code length in the sequence must
|
|
be non-zero. In this case the sum of (32 >> code length)
|
|
over all the non-zero code lengths must equal to 32.
|
|
|
|
Sequence of code lengths symbols, encoded using the code
|
|
length Huffman code. Any trailing 0 or 17 must be
|
|
omitted, i.e. the last encoded code length symbol must be
|
|
between 1 and 16. The sum of (32768 >> code length) over
|
|
all the non-zero code lengths in the alphabet, including
|
|
those encoded using repeat code(s) of 16, must equal to
|
|
32768.
|
|
|
|
4. Encoding of distances
|
|
|
|
As described in Section 2, one component of a compressed meta-block
|
|
is a sequence of backward distances. In this section we provide the
|
|
details to the encoding of distances.
|
|
|
|
Each distance in the compressed data part of a meta-block is
|
|
represented with a pair <distance code, extra bits>. The distance
|
|
code and the extra bits are encoded back-to-back, the distance code
|
|
is encoded using a Huffman code over the distance code alphabet,
|
|
while the extra bits value is encoded as a fixed-width machine
|
|
integer. The number of extra bits can be 0 - 24, and it is dependent
|
|
on the distance code.
|
|
|
|
To convert a distance code and associated extra bits to a backward
|
|
distance, we need the sequence of past distances and two additional
|
|
parameters, the number of "postfix bits", denoted by NPOSTFIX, and
|
|
the number of direct distance codes, denoted by NDIRECT. Both of
|
|
these parameters are encoded in the meta-block header. We will also
|
|
use the following derived parameter:
|
|
|
|
POSTFIX_MASK = ((1 << NPOSTFIX) - 1)
|
|
|
|
The first 16 distance codes are special short codes that reference
|
|
past distances as follows:
|
|
|
|
0: last distance
|
|
1: second last distance
|
|
2: third last distance
|
|
3: fourth last distance
|
|
4: last distance - 1
|
|
5: last distance + 1
|
|
6: last distance - 2
|
|
7: last distance + 2
|
|
8: last distance - 3
|
|
9: last disatnce + 3
|
|
10: second last distance - 1
|
|
11: second last distance + 1
|
|
12: second last distance - 2
|
|
13: second last distance + 2
|
|
14: second last distance - 3
|
|
15: second last distance + 3
|
|
|
|
The ring-buffer of four last distances is initialized by the values
|
|
16, 15, 11 and 4 (i.e. the fourth last is set to 16, the third last
|
|
to 15, the second last to 11 and the last distance to 4) at the
|
|
beginning of the *stream* (as opposed to the beginning of the meta-
|
|
block) and it is not reset at meta-block boundaries. When a distance
|
|
code 0 appears, the distance it represents (i.e. the last distance
|
|
in the sequence of distances) is not pushed to the ring-buffer of
|
|
last distances, in other words, the expression "(second, third,
|
|
fourth) last distance" means the (second, third, fourth) last
|
|
distance that was not represented by a 0 distance code.
|
|
|
|
The next NDIRECT distance codes, from 16 to 15 + NDIRECT, represent
|
|
distances from 1 to NDIRECT. Neither the distance short codes, nor
|
|
the NDIRECT direct distance codes have any extra bits.
|
|
|
|
Distance codes 16 + NDIRECT and greater all have extra bits, the
|
|
number of extra bits for a distance code "dcode" is given by the
|
|
following formula:
|
|
|
|
ndistbits = 1 + ((dcode - NDIRECT - 16) >> (NPOSTFIX + 1))
|
|
|
|
The maximum number of extra bits is 24, therefore the size of the
|
|
distance code alphabet is (16 + NDIRECT + (48 << NPOSTFIX)).
|
|
|
|
Given a distance code "dcode" (>= 16 + NDIRECT), and extra bits
|
|
"dextra", the backward distance is given by the following formula:
|
|
|
|
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
|
|
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
|
|
offset = ((2 + (hcode & 1)) << ndistbits) - 4;
|
|
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
|
|
|
|
5. Encoding of literal insertion lengths and copy lengths
|
|
|
|
As described in Section 2, the literal insertion lengths and backward
|
|
copy lengths are encoded using a single Huffman code. This section
|
|
provides the details to this encoding.
|
|
|
|
Each <insertion length, copy length> pair in the compressed data part
|
|
of a meta-block is represented with the following triplet:
|
|
|
|
<insert-and-copy length code, insert extra bits, copy extra bits>
|
|
|
|
The insert-and-copy length code, the insert extra bits and the copy
|
|
extra bits are encoded back-to-back, the insert-and-copy length code
|
|
is encoded using a Huffman code over the insert-and-copy length code
|
|
alphabet, while the extra bits values are encoded as fixed-width
|
|
machine integers. The number of insert and copy extra bits can be
|
|
0 - 24, and they are dependent on the insert-and-copy length code.
|
|
|
|
Some of the insert-and-copy length codes also express the fact that
|
|
the distance code of the distance in the same command is 0, i.e. the
|
|
distance component of the command is the same as that of the previous
|
|
command. In this case, the distance code and extra bits for the
|
|
distance are omitted from the compressed data stream.
|
|
|
|
We describe the insert-and-copy length code alphabet in terms of the
|
|
(not directly used) insert length code and copy length code
|
|
alphabets. The symbols of the insert length code alphabet, along with
|
|
the number of insert extra bits and the range of the insert lengths
|
|
are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 0 0 8 2 10-13 16 6 130-193
|
|
1 0 1 9 2 14-17 17 7 194-321
|
|
2 0 2 10 3 18-25 18 8 322-527
|
|
3 0 3 11 3 26-33 19 9 578-1089
|
|
4 0 4 12 4 34-49 20 10 1090-2113
|
|
5 0 5 13 4 50-65 21 12 2114-6209
|
|
6 1 6,7 14 5 66-97 22 14 6210-22593
|
|
7 1 8,9 15 5 98-129 23 24 22594-16799809
|
|
|
|
The symbols of the copy length code alphabet, along with the number
|
|
of copy extra bits and the range of copy lengths are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 0 2 8 1 10,11 16 5 70-101
|
|
1 0 3 9 1 12,13 17 5 102-133
|
|
2 0 4 10 2 14-17 18 6 134-197
|
|
3 0 5 11 2 18-21 19 7 198-325
|
|
4 0 6 12 3 22-29 20 8 326-581
|
|
5 0 7 13 3 30-37 21 9 582-1093
|
|
6 0 8 14 4 38-53 22 10 1094-2117
|
|
7 0 9 15 4 54-69 23 24 2118-16779333
|
|
|
|
To convert an insert-and-copy length code to an insert length code
|
|
and a copy length code, the following table can be used:
|
|
|
|
Insert
|
|
length Copy length code
|
|
code 0-7 8-15 16-23
|
|
+---------+---------+
|
|
| | |
|
|
0-7 | 0-63 | 64-127 | <--- distance code 0
|
|
| | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
0-7 | 128-191 | 192-255 | 383-447 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
8-15 | 256-319 | 320-383 | 512-575 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
| | | |
|
|
16-23 | 448-551 | 576-639 | 640-703 |
|
|
| | | |
|
|
+---------+---------+---------+
|
|
|
|
First, look up the cell with the 64 value range containing the
|
|
insert-and-copy length code, this gives the insert length code and
|
|
the copy length code ranges, both 8 values long. The copy length
|
|
code within its range is determined by the lowest 3 bits of the
|
|
insert-and-copy length code, and the insert length code within its
|
|
range is determined by bits 3-5 (counted from the LSB) of the insert-
|
|
and-copy length code. Given the insert length and copy length codes,
|
|
the actual insert and copy lengths can be obtained by reading the
|
|
number of extra bits given by the tables above.
|
|
|
|
If the insert-and-copy length code is between 0 and 127, the distance
|
|
code of the command is set to zero (the last distance reused).
|
|
|
|
6. Encoding of block switch commands
|
|
|
|
As described in Section 2, a block-switch command is a pair
|
|
<block type, block length>. These are encoded in the compressed data
|
|
part of the meta-block, right before the start of each new block of a
|
|
particular block category.
|
|
|
|
Each block type in the compressed data is represented with a block
|
|
type code, encoded using a Huffman code over the block type code
|
|
alphabet. A block type code 0 means that the block type is the same
|
|
as the type of the second last block from the same block category,
|
|
while a block type code 1 means that the block type equals the last
|
|
block type plus one. If the last block type is the maximal possible,
|
|
then a block type code 1 means block type 0. Block type codes 2 - 257
|
|
represent block types 0 - 255. The second last and last block types
|
|
are initialized with 0 and 1, respectively, at the beginning of each
|
|
meta-block.
|
|
|
|
The first block type of each block category must be 0 and the block
|
|
type of the first block switch command is therefore not encoded in
|
|
the compressed data.
|
|
|
|
The number of different block types in each block category, denoted
|
|
by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
|
|
lengths and distances, respectively, is encoded in the meta-block
|
|
header, and it must equal to the largest block type plus one in that
|
|
block category. In other words, the set of literal, insert-and-copy
|
|
length and distance block types must be [0..NBLTYPESL-1],
|
|
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
|
|
follows that the alphabet size of literal, insert-and-copy length and
|
|
distance block type codes is NBLTYPES + 2, NBLTYPESI + 2 and
|
|
NBLTYPESD + 2, respectively.
|
|
|
|
Each block length in the compressed data is represented with a pair
|
|
<block length code, extra bits>. The block length code and the extra
|
|
bits are encoded back-to-back, the block length code is encoded using
|
|
a Huffman code over the block length code alphabet, while the extra
|
|
bits value is encoded as a fixed-width machine integer. The number of
|
|
extra bits can be 0 - 24, and it is dependent on the block length
|
|
code. The symbols of the block length code alphabet, along with the
|
|
number of extra bits and the range of block lengths are as follows:
|
|
|
|
Extra Extra Extra
|
|
Code Bits Lengths Code Bits Lengths Code Bits Lengths
|
|
---- ---- ------ ---- ---- ------- ---- ---- -------
|
|
0 2 1-4 9 4 65-80 18 7 369-496
|
|
1 2 5-8 10 4 81-96 19 8 497-752
|
|
2 2 9-12 11 4 97-112 20 9 753-1264
|
|
3 2 13-16 12 5 113-144 21 10 1265-2288
|
|
4 3 17-24 13 5 145-176 22 11 2289-4336
|
|
5 3 25-32 14 5 177-208 23 12 4337-8432
|
|
6 3 33-40 15 5 209-240 24 13 8433-16624
|
|
7 3 41-48 16 6 241-304 25 24 16625-16793840
|
|
8 4 49-64 17 6 305-368
|
|
|
|
The first block switch command of each block category is special in
|
|
the sense that it is encoded in the meta-block header, and as
|
|
described earlier the block type code is omitted, since it is an
|
|
implicit zero.
|
|
|
|
7. Context modeling
|
|
|
|
As described in Section 2, the Huffman tree used to encode a literal
|
|
byte or a distance code depends on the context ID and the block type.
|
|
This section specifies how to compute the context ID for a particular
|
|
literal and distance code, and how to encode the context map that
|
|
maps a <context ID, block type> pair to the index of a Huffman
|
|
tree in the array of literal and distance Huffman trees.
|
|
|
|
7.1. Context modes and context ID lookup for literals
|
|
|
|
The context for encoding the next literal is defined by the last
|
|
two bytes in the stream (p1, p2, where p1 is the most recent
|
|
byte), regardless if these bytes are produced by backward
|
|
references or by literal insertions.
|
|
|
|
There are four methods, called context modes, to compute the
|
|
Context ID:
|
|
* MSB6, where the Context ID is the value of six most
|
|
significant bits of p1,
|
|
* LSB6, where the Context ID is the value of six least
|
|
significant bits of p1,
|
|
* UTF8, where the Context ID is a complex function of p1, p2,
|
|
optimized for text compression, and
|
|
* Signed, where Context ID is a complex function of p1, p2,
|
|
optimized for compressing sequences of signed integers.
|
|
|
|
The Context ID for the UTF8 and Signed context modes is computed
|
|
using the following lookup tables Lut0, Lut1, and Lut2.
|
|
|
|
Lut0 :=
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 4, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
|
|
44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
|
|
12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
|
|
52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
|
|
12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
|
|
60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12, 0,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
|
|
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
|
|
|
|
Lut1 :=
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
|
|
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
|
|
1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
|
|
|
|
Lut2 :=
|
|
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
|
|
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7
|
|
|
|
Given p1 is the last decoded byte and p2 is the second last
|
|
decoded byte the context IDs can be computed as follows:
|
|
|
|
For LSB6 : Context ID = p1 & 0x3f
|
|
For MSB6 : Context ID = p1 >> 2
|
|
For UTF8 : Context ID = Lut0[p1] | Lut1[p2]
|
|
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]
|
|
|
|
The context modes LSB6, MSB6, UTF8, and Signed are denoted by
|
|
integers 0, 1, 2, 3.
|
|
|
|
The context mode is defined for each literal block type and they
|
|
are stored in a consecutive array of bits in the meta-block
|
|
header, always two bits per block type.
|
|
|
|
7.2. Context ID for distances
|
|
|
|
The context for encoding a distance code is defined by the copy
|
|
length corresponding to the distance. The context IDs are 0, 1, 2,
|
|
and 3 for copy lengths 2, 3, 4, and more than 4, respectively.
|
|
|
|
7.3. Encoding of the context map
|
|
|
|
There are two kinds of context maps, for literals and for
|
|
distances. The size of the context map is 64 * NBLTYPESL for
|
|
literals, and 4 * NBLTYPESD for distances. Each value in the
|
|
context map is an integer between 0 and 255, indicating the index
|
|
of the Huffman tree to be used when encoding the next literal or
|
|
distance.
|
|
|
|
The context map is encoded as a one-dimensional array,
|
|
CMAPL[0..(64 * NBLTYPESL - 1)] and CMAPD[0..(4 * NBLTYPESD - 1)].
|
|
|
|
The index of the Huffman tree for encoding a literal or distance
|
|
code with context ID "cid" and block type "bltype" is
|
|
|
|
index of literal Huffman tree = CMAPL[bltype * 64 + cid]
|
|
|
|
index of distance Huffman tree = CMAPD[bltype * 4 + cid]
|
|
|
|
The values of the context map are encoded with the combination
|
|
of run length encoding for zero values and Huffman coding. Let
|
|
RLEMAX denote the number of run length codes and NTREES denote the
|
|
maximum value in the context map plus one. NTREES must equal the
|
|
number of different values in the context map, in other words,
|
|
the different values in the context map must be the [0..NTREES-1]
|
|
interval. The alphabet of the Huffman code has the following
|
|
RLEMAX + NTREES symbols:
|
|
|
|
0: value zero
|
|
1: repeat a zero 2-3 times, read 1 bit for repeat length
|
|
2: repeat a zero 4-7 times, read 2 bits for repeat length
|
|
...
|
|
RLEMAX: repeat a zero (2^RLEMAX)-(2^(RLEMAX+1) - 1) times,
|
|
read RLEMAX bits for repeat length
|
|
RLEMAX + 1: value 1
|
|
...
|
|
RLEMAX + NTREES - 1: value NTREES - 1
|
|
|
|
If RLEMAX = 0, the run length coding is not used, and the symbols
|
|
of the alphabet are directly the values in the context map. We can
|
|
now define the format of the context map (the same format is used
|
|
for literal and distance context maps):
|
|
|
|
1-5 bits: RLEMAX, 0 is encoded with one 0 bit, and values
|
|
1 - 16 are encoded with bit pattern 1xxxx
|
|
|
|
Huffman code with alphabet size NTREES + RLEMAX
|
|
|
|
Context map size values encoded with the above Huffman code
|
|
and run length coding for zero values
|
|
|
|
1 bit: IMTF bit, if set, we do an inverse move-to-front
|
|
transform on the values in the context map to get
|
|
the Huffman code indexes
|
|
|
|
For the encoding of NTREES see Section 9.2.
|
|
|
|
8. Static dictionary
|
|
|
|
At any given point during decoding the compressed data, a reference
|
|
to a duplicated string in the output produced so far has a maximum
|
|
backward distance value, which is the minimum of the window size and
|
|
the number of output bytes produced. However, decoding a distance
|
|
from the input stream, as described in section 4, can produce
|
|
distances that are greater than this maximum allowed value. The
|
|
difference between these distances and the first invalid distance
|
|
value is treated as reference to a word in the static dictionary
|
|
given in Appendix A. The maximum valid copy length for a static
|
|
dictionary reference is 24. The static dictionary has three parts:
|
|
|
|
* DICT[0..DICTSIZE], an array of bytes
|
|
* DOFFSET[0..24], an array of byte offset values for each length
|
|
* NDBITS[0..24], an array of bit-depth values for each length
|
|
|
|
The number of static dictionary words for a given length is:
|
|
|
|
NWORDS[length] = 0 (if length < 3)
|
|
NWORDS[length] = (1 << NDBITS[lengths]) (if length >= 3)
|
|
|
|
DOFFSET and DICTSIZE are defined by the following recursion:
|
|
|
|
DOFFSET[0] = 0
|
|
DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
|
|
DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]
|
|
|
|
The offset of a word within the DICT array for a given length and
|
|
index is:
|
|
|
|
offset(length, index) = DOFFSET[length] + index * length
|
|
|
|
Each static dictionary word has 64 different forms, given by applying
|
|
a word transformation to a base word in the DICT array. The list of
|
|
word transformations is given in Appendix B. The static dictionary
|
|
word for a <length, distance> pair can be reconstructed as follows:
|
|
|
|
word_id = distance - (max allowed distance + 1)
|
|
index = word_id % NWORDS[length]
|
|
base_word = DICT[offset(length, index)..offset(length, index+1))
|
|
transform_id = word_id >> NBITS[length]
|
|
|
|
The string copied to the output stream is computed by applying the
|
|
transformation to the base dictionary word. If transform_id is
|
|
greater than 63 or length is greater than 24, the compressed data set
|
|
is invalid and must be discarded.
|
|
|
|
9. Compressed data format
|
|
|
|
In this section we describe the format of the compressed data set in
|
|
terms of the format of the individual data items described in the
|
|
previous sections.
|
|
|
|
9.1. Format of the stream header
|
|
|
|
The stream header has only the following one field:
|
|
|
|
1-4 bits: WBITS, a value in the range 16 - 24, value 16 is
|
|
encoded with one 0 bit, and values 17 - 24 are
|
|
encoded with bit pattern 1xxx
|
|
|
|
The size of the sliding window, which is the maximum value of any
|
|
non-dictionary reference backward distance, is given by the
|
|
following formula:
|
|
|
|
window size = (1 << WBITS) - 16
|
|
|
|
9.2. Format of the meta-block header
|
|
|
|
A compliant compressed data set has at least one meta-block. Each
|
|
meta-block contains a header with information about the
|
|
uncompressed length of the meta-block, and a bit signaling if the
|
|
meta-block is the last one. The format of the meta-block header is
|
|
the following:
|
|
|
|
1 bit: ISLAST, set to 1 if this is the last meta-block
|
|
1 bit: ISEMPTY, set to 1 if the meta-block is empty, this
|
|
field is only present if ISLAST bit is set, since
|
|
only the last meta-block can be empty
|
|
2 bits: MNIBBLES, (# of nibbles to represent the length) - 4
|
|
|
|
(MNIBBLES + 4) x 4 bits: MLEN - 1, where MLEN is the length
|
|
of the meta-block in the input data in bytes
|
|
|
|
1 bit: ISUNCOMPRESSED, if set to 1, any bits of input up to
|
|
the next byte boundary are ignored, and the rest of
|
|
the meta-block contains MLEN bytes of literal data;
|
|
this field is only present if ISLAST bit is not set
|
|
|
|
1-11 bits: NBLTYPESL, # of literal block types, encoded with
|
|
the following variable length code:
|
|
|
|
Value Bit Pattern
|
|
----- -----------
|
|
1 0
|
|
2 1000
|
|
3-4 1001x
|
|
5-8 1010xx
|
|
9-16 1011xxx
|
|
17-32 1100xxxx
|
|
33-64 1101xxxxx
|
|
65-128 1110xxxxxx
|
|
129-256 1111xxxxxxx
|
|
|
|
Huffman code over the block type code alphabet for literal
|
|
block types, appears only if NBLTYPESL >= 2
|
|
|
|
Huffman code over the block length code alphabet for literal
|
|
block lengths, appears only if NBLTYPESL >= 2
|
|
|
|
Block length code + Extra bits for first literal block
|
|
length, appears only if NBLTYPESL >= 2
|
|
|
|
1-11 bits: NBLTYPESI, # of insert-and-copy block types, encoded
|
|
with the same variable length code as above
|
|
|
|
Huffman code over the block type code alphabet for insert-
|
|
and-copy block types, only if NBLTYPESI >= 2
|
|
|
|
Huffman code over the block length code alphabet for insert-
|
|
and-copy block lengths, only if NBLTYPESI >= 2
|
|
|
|
Block length code + Extra bits for first insert-and-copy
|
|
block length, only if NBLTYPESI >= 2
|
|
|
|
1-11 bits: NBLTYPESD, # of distance block types, encoded with
|
|
the same variable length code as above
|
|
|
|
Huffman code over the block type code alphabet for distance
|
|
block types, appears only if NBLTYPESD >= 2
|
|
|
|
Huffman code over the block length code alphabet for
|
|
distance block lengths, only if NBLTYPESD >= 2
|
|
|
|
Block length code + Extra bits for first distance block
|
|
length, only if NBLTYPESD >= 2
|
|
|
|
2 bits: NPOSTFIX, parameter used in the distance coding
|
|
|
|
4 bits: four most significant bits of NDIRECT, to get the
|
|
actual value of the parameter NDIRECT, left-shift
|
|
this four bit number by NPOSTFIX bits
|
|
|
|
NBLTYPESL x 2 bits: context mode for each literal block type
|
|
|
|
1-11 bits: NTREESL, # of literal Huffman trees, encoded with
|
|
the same variable length code as NBLTYPESL
|
|
|
|
Literal context map, encoded as described in Paragraph 7.3,
|
|
appears only if NTREESL >= 2, otherwise the context map
|
|
has only zero values
|
|
|
|
1-11 bits: NTREESD, # of distance Huffman trees, encoded with
|
|
the same variable length code as NBLTYPESD
|
|
|
|
Distance context map, encoded as described in Paragraph 7.3,
|
|
appears only if NTREESD >= 2, otherwise the context map
|
|
has only zero values
|
|
|
|
NTREESL Huffman codes for literals
|
|
|
|
NBLTYPESI Huffman codes for insert-and-copy lengths
|
|
|
|
NTREESD Huffman codes for distances
|
|
|
|
9.3. Format of the meta-block data
|
|
|
|
The compressed data part of a meta-block consists of a series of
|
|
commands. Each command has the following format:
|
|
|
|
Block type code for next insert-and-copy block type, appears
|
|
only if NBLTYPESI >= 2 and the previous insert-and-copy
|
|
block has ended
|
|
|
|
Block length code + Extra bits for next insert-and-copy
|
|
block length, appears only if NBLTYPESI >= 2 and the
|
|
previous insert and-copy block has ended
|
|
|
|
Insert-and-copy length, encoded as in section 5, using the
|
|
insert-and-copy length Huffman code with the current
|
|
insert-and-copy block type index
|
|
|
|
Insert length number of literals, with the following format:
|
|
|
|
Block type code for next literal block type, appears
|
|
only if NBLTYPESL >= 2 and the previous literal
|
|
block has ended
|
|
|
|
Block length code + Extra bits for next literal block
|
|
length, appears only if NBLTYPESL >= 2 and the
|
|
previous literal block has ended
|
|
|
|
Next byte of the input data, encoded with the literal
|
|
Huffman code with the index determined by the
|
|
previuos two bytes of the input data, the current
|
|
literal block type and the context map, as
|
|
described in Paragraph 7.3.
|
|
|
|
Block type code for next distance block type, appears only
|
|
if NBLTYPESD >= 2 and the previous distance block has
|
|
ended
|
|
|
|
Block length code + Extra bits for next distance block
|
|
length, appears only if NBLTYPESD >= 2 and the previous
|
|
distance block has ended
|
|
|
|
Distance code, encoded as in section 4, using the distance
|
|
Huffman code with the current distance block type index,
|
|
appears only if the distance code is not an implicit 0,
|
|
as indicated by the insert-and-copy length code
|
|
|
|
The number of commands in the meta-block is such that the sum of
|
|
insert lengths and copy lengths over all the commands gives the
|
|
uncompressed length, MLEN encoded in the meta-block header.
|
|
|
|
10. Decoding algorithm
|
|
|
|
The decoding algorithm that produces the output data is as follows:
|
|
|
|
read window size
|
|
do
|
|
read ISLAST bit
|
|
if ISLAST
|
|
read ISEMPTY bit
|
|
if ISEMPTY
|
|
break from loop
|
|
read MLEN
|
|
if not ISLAST
|
|
read ISUNCOMPRESSED bit
|
|
if ISUNCOMPRESSED
|
|
skip any bits up to the next byte boundary
|
|
copy MLEN bytes of input to the output stream
|
|
continue to the next meta-block
|
|
loop for each three block categories (i = L, I, D)
|
|
read NBLTYPESi
|
|
if NBLTYPESi >= 2
|
|
read Huffman code for block types, HTREE_BTYPE_i
|
|
read Huffman code for block lengths, HTREE_BLEN_i
|
|
read block length, BLEN_i
|
|
set block type, BTYPE_i to 0
|
|
initialize second last and last block types to 0 and 1
|
|
else
|
|
set block type, BTYPE_i to 0
|
|
set block length, BLEN_i to 268435456
|
|
read NPOSTFIX and NDIRECT
|
|
read array of literal context modes, CMODE[]
|
|
read NTREESL
|
|
if NTREESL >= 2
|
|
read literal context map, CMAPL[]
|
|
else
|
|
fill CMAPL[] with zeros
|
|
read NTREESD
|
|
if NTREESD >= 2
|
|
read distance context map, CMAPD[]
|
|
else
|
|
fill CMAPD[] with zeros
|
|
read array of Huffman codes for literals, HTREEL[]
|
|
read array of Huffman codes for insert-and-copy, HTREEI[]
|
|
read array of Huffman codes for distances, HTREED[]
|
|
do
|
|
if BLEN_I is zero
|
|
read block type using HTREE_BTYPE_I and set BTYPE_I
|
|
read block length using HTREE_BLEN_I and set BLEN_I
|
|
decrement BLEN_I
|
|
read insert and copy length, ILEN, CLEN with HTREEI[BTYPE_I]
|
|
loop for ILEN
|
|
if BLEN_L is zero
|
|
read block type using HTREE_BTYPE_L and set BTYPE_L
|
|
read block length using HTREE_BLEN_L and set BLEN_L
|
|
decrement BLEN_L
|
|
look up context mode CMODE[BTYPE_L]
|
|
compute context ID, CIDL from last two bytes of output
|
|
read literal using HTREEL[CMAPL[64 * BTYPE_L + CIDL]]
|
|
copy literal to output stream
|
|
if number of output bytes produced in the loop is MLEN
|
|
break from loop
|
|
if distance code is implicit zero from insert-and-copy code
|
|
set backward distance to the last distance
|
|
else
|
|
if BLEN_D is zero
|
|
read block type using HTREE_BTYPE_D and set BTYPE_D
|
|
read block length using HTREE_BLEN_D and set BLEN_D
|
|
decrement BLEN_D
|
|
compute context ID, CIDD from CLEN
|
|
read distance code with HTREED[CMAPD[4 * BTYPE_D + CIDD]]
|
|
compute distance by distance short code substitution
|
|
move backwards distance bytes in the output stream, and
|
|
copy CLEN bytes from this position to the output stream,
|
|
or look up the static dictionary word and copy it to the
|
|
output stram
|
|
while number of output bytes produced in the loop < MLEN
|
|
while not ISLAST
|
|
|
|
Note that a duplicated string reference may refer to a string in a
|
|
previous meta-block, i.e. the backward distance may cross one or
|
|
more meta-block boundaries. However a backward copy distance
|
|
cannot refer past the beginning of the output stream and it can
|
|
not be greater than the window size; any such distance must be
|
|
interpreted as a reference to a static dictionary word. Also note
|
|
that the referenced string may overlap the current position, for
|
|
example, if the last 2 bytes decoded have values X and Y, a string
|
|
reference with <length = 5, distance = 2> adds X,Y,X,Y,X to the
|
|
output stream.
|
|
|
|
11. References
|
|
|
|
[1] Huffman, D. A., "A Method for the Construction of Minimum
|
|
Redundancy Codes", Proceedings of the Institute of Radio
|
|
Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
|
|
|
|
[2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
|
|
Compression", IEEE Transactions on Information Theory, Vol. 23,
|
|
No. 3, pp. 337-343.
|
|
|
|
[3] Deutsch, P., "DEFLATE Compressed Data Format Specification
|
|
version 1.3", RFC 1951, Aladdin Enterprises, May 1996.
|
|
http://www.ietf.org/rfc/rfc1951.txt
|
|
|
|
12. Source code
|
|
|
|
Source code for a C language implementation of a "brotli" compliant
|
|
decompressor and a C++ language implementation of a compressor is
|
|
available in the brotli/ directory within the font-compression-
|
|
reference open-source project:
|
|
https://code.google.com/p/font-compression-reference/source/browse/
|
|
|
|
Appendix A. List of dictionary words
|
|
|
|
TO BE WRITTEN
|
|
|
|
Appendix B. List of word transformations
|
|
|
|
TO BE WRITTEN
|