Documentation:
- add note that brotli is a "stream" format, not an archive-like
- regenerate .1 with Pandoc
Build:
- drop legacy "BROTLI_BUILD_PORTABLE" option
- drop "BROTLI_SANITIZED" definition
Code:
- c: comb includes
- c/enc: extract encoder state into separate header
- c/enc: drop designated q10 codepath
- c/enc: dealing better with flushing of empty stream
- fix MSVC compilation
API:
- py: use library version instead of one in version.h
- c: add plugable API to report consumed input / produced output
- c/java: support "lean" prepared dictionaries (without copy of source)
Not all combinations are migrated to the initial configuration; corresponding TODOs added.
Drive-by: additional combinations uncovered minor portability problems -> fixed
Drive-by: remove no-longer used "script" files.
Co-authored-by: Eugene Kliuchnikov <eustas@chromium.org>
- fix formatting
- fix type conversion
- fix no-op arithmetic with null-pointer
- improve performance of hash_longest_match64
- go: detect read after close
- java decoder: support compound dictionary
- remove executable flag on non-scripts
* New feature: "Large Window Brotli"
By setting special encoder/decoder flag it is now possible to extend
LZ-window up to 30 bits; though produced stream will not be RFC7932
compliant.
Added new dictionary generator - "DSH". It combines speed of "Sieve"
and quality of "DM". Plus utilities to prepare train corpora
(remove unique strings).
Improved compression ratio: now two sub-blocks could be stitched:
the last copy command could be extended to span the next sub-block.
Fixed compression ineffectiveness caused by floating numbers rounding and
wrong cost heuristic.
Other C changes:
- combined / moved `context.h` to `common`
- moved transforms to `common`
- unified some aspects of code formatting
- added an abstraction for encoder (static) dictionary
- moved default allocator/deallocator functions to `common`
brotli CLI:
- window size is auto-adjusted if not specified explicitly
Java:
- added "eager" decoding both to JNI wrapper and pure decoder
- huge speed-up of `DictionaryData` initialization
* Add dictionaryless compressed dictionary
* Fix `sources.lst`
* Fix `sources.lst` and add a note that `libtool` is also required.
* Update setup.py
* Fix `EagerStreamTest`
* Fix BUILD file
* Add missing `libdivsufsort` dependency
* Fix "unused parameter" warning.
* remove `const` on `BrotliDictionary` members
* extend `ZofliNode` distance range to 128MiB
* add missing `port.h` include to `quality.h`
* fix typo in encoder API-doc
* regenerate `decode.min.js`
* Add .nf and .fi tags everywhere they were missing
* Consistently use Section X.X. instead of the following:
Paragraph X.X.
section X
* Fix minor grammar issues
* At the beginning of the simple prefix code section, telling us that "a value
of 1 indicates the number of leading zeros" is not very helpful. Instead, it
should indicate that it means a complex prefix code and point the reader to
the relevant section (which repeats this information in more detail)
* Clearly indicate that reusing a value is an error! This seems to be the
behavior of the of the reference implementation.
* Clarify what the termination conditions are while reading the prefix codes.
Also, indicate that it is an error if the prefix tree is over-subscribed or
under-subscribed.
* Clearly state what is the maximum number of individual symbols that may be
read. This ensures that it is forbidden to an stream that continually says that
the symbols have zero length.
* In the description about "three categories", explicitly number them instead
of using a giant paragraph that is harder to follow.
* Switch lists of items to consistently use American style commas. The American
style lists is better for clarity purposes. Consider the following:
-Each category of value (insert and copy lengths, literals and distances)
+Each category of value (insert and copy lengths, literals, and distances)
* Make sure not to break a hyphenated phrase with a newline. When the nroff
file is processed, "insert-\nand-copy" becomes "insert- and-copy", making it
inconsistent with other uses of the hyphenated phrase.
* Consistently use the same hyphenated phrase if referred to as a single unit.
"insert and copy" -> "insert-and-copy"
"least significant" -> "least-significant"
"most significant" -> "most-significant"
"fixed length" -> "fixed-length"
"block switch" -> "block-switch".
* Consistently use "indexes" instead of "indices"
Many of the fields are copy-pastes of each other, but differ slightly
in placement of words, capitalization, or other random
oddities. This commit makes it so that if you simply do a search
replace on these following passages, you get the same thing:
s/NBLTYPESX/(NBLTYPESI|NBLTYPESL|NBLTYPESD)/g
s/CATEGORY/(insert-and-copy|literal|distance)/g
>>>
1-11 bits: NBLTYPESX, # of CATEGORY block types, encoded
with the same variable length code as above
Prefix code over the block type code alphabet for
CATEGORY block types, appears only if NBLTYPESX >= 2
Prefix code over the block count code alphabet for
CATEGORY block counts, appears only if NBLTYPESX >= 2
Block count code + Extra bits for first CATEGORY
block count, appears only if NBLTYPESX >= 2
<<<
>>>
Block type code for next CATEGORY block type, appears
only if NBLTYPESX >= 2 and the previous CATEGORY
block count is zero
Block count code + extra bits for next CATEGORY
block count, appears only if NBLTYPESX >= 2 and the
previous CATEGORY block count is zero
<<<
* Acknowledge the fact that the context map is conceptually really a
two-dimensional matrix with 2 different keys, but in reality stored
as a one-dimensional array.
* Mention that InverseMoveToFrontTransform will not cause the
context map to have invalid indexes. This gives someone implementing
a decoder sanity that they do not have to go through the context
map again and check that all values are less than NTREES.
* The phrase "difference between these distances" can either refer to
the conceptual difference (i.e. they hae different semantic meaning)
or to the mathematical difference (i.e. use substraction for the two).
Instead, just remove the sentence since the equations below make it
clear what we're supposed to do here.
* This value is useful in implementing the decoder since we can know
ahead-of-time what size buffer is needed to contain the output of a
transformed word.
* Rather than say "lower 3 bits" in one sentence and "bits 3-5" in
the sentence right after, just consistently use the same convention
and say "0-2" and "3-5".
* Provide exhaustive list of all the ways the last two bytes can be
sourced from.
* Also make a clear connection in this section that there are only 64
context IDs for literals. This is important for the indexing math
in context maps to make sense.