brotli/research
Eugene Kliuchnikov 4b2b2d4f83
Update (#749)
Update:

 * Bazel: fix MSVC configuration
 * C: common: extended documentation and helpers around distance codes
 * C: common: enable BROTLI_DCHECK in "debug" builds
 * C: common: fix implicit trailing zero in `kPrefixSuffix`
 * C: dec: fix possible bit reader discharge for "large-window" mode
 * C: dec: simplify distance decoding via lookup table
 * C: dec: reuse decoder state members memory via union with lookup table
 * C: dec: add decoder state diagram
 * C: enc: clarify access to static dictionary
 * C: enc: improve static dictionary hash
 * C: enc: add "stream offset" parameter for parallel encoding
 * C: enc: reorganize hasher; now Q2-Q3 require exactly 256KiB
           to avoid global TCMalloc lock
 * C: enc: fix rare access to uninitialized data in ring-buffer
 * C: enc: reorganize logging / checks in `write_bits.h`
 * Java: dec: add "large-window" support
 * Java: dec: improve speed
 * Java: dec: debug and 32-bit mode are now activated via system properties
 * Java: dec: demystify some state variables (use better names)
 * Dictionary generator: add single input mode
 * Java: dec: modernize tests
 * Bazel: js: pick working commit for closure rules
2019-04-12 13:57:42 +02:00
..
esaxx@ca7cb33201 Replace sais.hxx by submodule hillbig/esaxx. 2016-09-19 19:12:30 +02:00
img Update research tools description. 2016-09-15 17:19:26 +02:00
libdivsufsort@5f60d6f026 New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
brotli_decoder.c Update (#643) 2018-03-02 15:49:58 +01:00
brotlidump.py Fix brotlidump.py crashing when complex prefix code has exactly 1 non-zero code length (#635) 2018-02-08 12:48:24 +01:00
BUILD Inverse bazel project/workspace tree (#677) 2018-06-04 17:53:16 +02:00
BUILD.libdivsufsort New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
deorummolae.cc New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
deorummolae.h New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
dictionary_generator.cc Update (#749) 2019-04-12 13:57:42 +02:00
draw_diff.cc Update (#680) 2018-06-09 11:17:13 +02:00
draw_histogram.cc Update research 2016-09-22 11:32:23 +02:00
durchschlag.cc Update (#651) 2018-03-20 17:37:41 +06:00
durchschlag.h New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
find_opt_references.cc Research (#491) 2016-12-22 13:03:28 +01:00
Makefile Update research 2016-09-22 11:32:23 +02:00
read_dist.h Update (#680) 2018-06-09 11:17:13 +02:00
README.md Update research tools description. 2016-09-15 17:19:26 +02:00
sieve.cc New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
sieve.h New feature: "Large Window Brotli" (#640) 2018-02-26 09:04:36 -05:00
WORKSPACE Inverse bazel project/workspace tree (#677) 2018-06-04 17:53:16 +02:00

Introduction

In this directory we publish simple tools to analyze backward reference distance distributions in LZ77 compression. We developed these tools to be able to make more efficient encoding of distances in large-window brotli. In large-window compression the average cost of a backward reference distance is higher, and this may allow for more advanced encoding strategies, such as delta coding or an increase in context size, to bring significant compression density improvements. Our tools visualize the backward references as histogram images, i.e., one pixel in the image shows how many distances of a certain range exist at a certain locality in the data. The human visual system is excellent at pattern detection, so we tried to roughly identify patterns visually before going into more quantitative analysis. These tools can turn out to be useful in development of other LZ77-based compressors and we hope you try them out.

Tools

find_opt_references

This tool generates optimal (match-length-wise) backward references for every position in the input files and stores them in *.dist file described below.

Example usage:

find_opt_references input.txt output.dist

draw_histogram

This tool generates a visualization of the distribution of backward references stored in *.dist file. The original file size has to be specified as a second parameter. The output is a grayscale PGM (binary) image.

Example usage:

draw_histogram input.dist 65536 output.pgm

Here's an example of resulting image:

draw_diff

This tool generates a diff PPM (binary) image between two input 8-bit PGM (binary) images. Input images must be of same size. Useful for comparing different backward references distributions for same input file. Normally used for comparison of output images from draw_histogram tool.

Example usage:

draw_diff image1.pgm image2.pgm diff.ppm

For example the diff of this image

and this image

looks like this:

Backward distance file format

The format of *.dist files is as follows:

[[     0| match length][     1|position|distance]...]
 [1 byte|      4 bytes][1 byte| 4 bytes| 4 bytes]

More verbose explanation: for each backward reference there is a position-distance pair, also a copy length may be specified. Copy length is prefixed with flag byte 0, position-distance pair is prefixed with flag byte 1. Each number is a 32-bit integer. Copy length always comes before position-distance pair. Standalone copy length is allowed, in this case it is ignored.

Here's an example of how to read from *.dist file:

#include "read_dist.h"

FILE* f;
int copy, pos, dist;
while (ReadBackwardReference(fin, &copy, &pos, &dist)) {
   ...
}