From c23a07830e8b63d610a6018a4a1e1bf8cd7975e7 Mon Sep 17 00:00:00 2001 From: inikep Date: Wed, 24 Aug 2016 21:42:43 +0200 Subject: [PATCH 1/7] updated projects\README.md --- projects/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/projects/README.md b/projects/README.md index c2fa7478..96202c1c 100644 --- a/projects/README.md +++ b/projects/README.md @@ -5,6 +5,19 @@ projects for various integrated development environments (IDE) The following projects are included with the zstd distribution: - cmake - CMake project contributed by Artyom Dymchenko +- VS2005 - Visual Studio 2005 project - VS2008 - Visual Studio 2008 project - VS2010 - Visual Studio 2010 project (which also works well with Visual Studio 2012, 2013, 2015) - build - command line scripts prepared for Visual Studio compilation without IDE + + +#### How to compile zstd with Visual Studio + +1. Install Visual Studio e.g. VS 2015 Community Edition (it's free). +2. Download the latest version of zstd from https://github.com/Cyan4973/zstd/releases +3. Decompress ZIP archive. +4. Go to decompressed directory then to "projects" then "VS2010" and open zstd.sln +5. Visual Studio will ask about converting VS2010 project to VS2015 and you should agree. +6. Change "Debug" to "Release" and if you have 64-bit Windows change also "Win32" to "x64". +7. Press F7 on keyboard or select "BUILD" from the menu bar and choose "Build Solution". +8. If compilation will be fine a compiled executable will be in projects\VS2010\bin\x64\Release\zstd.exe From 7f18c6e5ef36e0c0e4921384c45502d726ac13fb Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 08:47:18 +0200 Subject: [PATCH 2/7] improved tests/README.md --- projects/README.md | 8 ++++---- tests/Makefile | 10 +++++++--- tests/README.md | 13 ++++++++++++- tests/zbufftest.c | 2 +- 4 files changed, 24 insertions(+), 9 deletions(-) diff --git a/projects/README.md b/projects/README.md index 96202c1c..d724cdcb 100644 --- a/projects/README.md +++ b/projects/README.md @@ -16,8 +16,8 @@ The following projects are included with the zstd distribution: 1. Install Visual Studio e.g. VS 2015 Community Edition (it's free). 2. Download the latest version of zstd from https://github.com/Cyan4973/zstd/releases 3. Decompress ZIP archive. -4. Go to decompressed directory then to "projects" then "VS2010" and open zstd.sln +4. Go to decompressed directory then to `projects` then `VS2010` and open `zstd.sln` 5. Visual Studio will ask about converting VS2010 project to VS2015 and you should agree. -6. Change "Debug" to "Release" and if you have 64-bit Windows change also "Win32" to "x64". -7. Press F7 on keyboard or select "BUILD" from the menu bar and choose "Build Solution". -8. If compilation will be fine a compiled executable will be in projects\VS2010\bin\x64\Release\zstd.exe +6. Change `Debug` to `Release` and if you have 64-bit Windows change also `Win32` to `x64`. +7. Press F7 on keyboard or select `BUILD` from the menu bar and choose `Build Solution`. +8. If compilation will be fine a compiled executable will be in `projects\VS2010\bin\x64\Release\zstd.exe` diff --git a/tests/Makefile b/tests/Makefile index a8fb97c2..c054bbbb 100644 --- a/tests/Makefile +++ b/tests/Makefile @@ -22,13 +22,17 @@ # - zstd homepage : http://www.zstd.net/ # ########################################################################## # datagen : Synthetic and parametrable data generator, for tests +# fullbench : Precisely measure speed for each zstd inner functions +# fullbench32: Same as fullbench, but forced to compile in 32-bits mode # fuzzer : Test tool, to check zstd integrity on target platform # fuzzer32: Same as fuzzer, but forced to compile in 32-bits mode +# paramgrill : parameter tester for zstd +# test-zstd-speed.py : script for testing zstd speed difference between commits +# versionsTest : compatibility test between zstd versions stored on Github (v0.1+) # zbufftest : Test tool, to check ZBUFF integrity on target platform # zbufftest32: Same as zbufftest, but forced to compile in 32-bits mode -# fullbench : Precisely measure speed for each zstd inner function -# fullbench32: Same as fullbench, but forced to compile in 32-bits mode -# versionstest : Compatibility test between zstd versions stored on Github (v0.1+) +# zstreamtest : Fuzzer test tool for zstd streaming API +# zbufftest32: Same as zstreamtest, but forced to compile in 32-bits mode # ########################################################################## DESTDIR?= diff --git a/tests/README.md b/tests/README.md index 7b379541..309638f9 100644 --- a/tests/README.md +++ b/tests/README.md @@ -1,6 +1,17 @@ -scripts for automated testing of zstd +programs and scripts for automated testing of zstd ================================ +This directory contains the following programs and scripts: +- datagen : Synthetic and parametrable data generator, for tests +- fullbench : Precisely measure speed for each zstd inner functions +- fuzzer : Test tool, to check zstd integrity on target platform +- paramgrill : parameter tester for zstd +- test-zstd-speed.py : script for testing zstd speed difference between commits +- test-zstd-versions.py : compatibility test between zstd versions stored on Github (v0.1+) +- zbufftest : Test tool, to check ZBUFF integrity on target platform +- zstreamtest : Fuzzer test tool for zstd streaming API + + #### test-zstd-versions.py - script for testing zstd interoperability between versions This script creates `versionsTest` directory to which zstd repository is cloned. diff --git a/tests/zbufftest.c b/tests/zbufftest.c index ce6beb24..151785c6 100644 --- a/tests/zbufftest.c +++ b/tests/zbufftest.c @@ -1,5 +1,5 @@ /* - Fuzzer test tool for zstd_buffered + Fuzzer test tool for ZBUFF - a buffered streaming API for ZSTD Copyright (C) Yann Collet 2015-2016 GPL v2 License From ab2f7708c132aa85b3cca59698829a41baf7ac86 Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 10:07:20 +0200 Subject: [PATCH 3/7] created programs\README.md --- programs/Makefile | 6 +-- programs/README.md | 93 ++++++++++++++++++++++++++++++++++++++++++++++ projects/README.md | 10 ++--- tests/README.md | 20 +++++----- 4 files changed, 111 insertions(+), 18 deletions(-) create mode 100644 programs/README.md diff --git a/programs/Makefile b/programs/Makefile index d738f40c..dd6e7838 100644 --- a/programs/Makefile +++ b/programs/Makefile @@ -24,9 +24,9 @@ # zstd : Command Line Utility, supporting gzip-like arguments # zstd32 : Same as zstd, but forced to compile in 32-bits mode # zstd_nolegacy : zstd without support of decompression of legacy versions -# zstd-small: minimal zstd without dictBuilder and bench -# zstd-compress: compressor-only version of zstd -# zstd-decompress: decompressor-only version of zstd +# zstd-small : minimal zstd without dictionary builder and benchmark +# zstd-compress : compressor-only version of zstd +# zstd-decompress : decompressor-only version of zstd # ########################################################################## DESTDIR?= diff --git a/programs/README.md b/programs/README.md new file mode 100644 index 00000000..f38e6c49 --- /dev/null +++ b/programs/README.md @@ -0,0 +1,93 @@ +zstd - Command Line Interface +================================ + +Command Line Interface (CLI) can be created using the `make` command without any additional parameters. +There are however other Makefile targets that create different variations of CLI: +- `zstd` : default CLI supporting gzip-like arguments; includes dictionary builder, benchmark, and support for decompression of legacy zstd versions +- `zstd32` : Same as `zstd`, but forced to compile in 32-bits mode +- `zstd_nolegacy` : Same as `zstd` except of support for decompression of legacy zstd versions +- `zstd-small` : CLI optimized for minimal size; without dictionary builder, benchmark, and support for decompression of legacy zstd versions +- `zstd-compress` : compressor-only version of CLI; without dictionary builder, benchmark, and support for decompression of legacy zstd versions +- `zstd-decompress` : decompressor-only version of CLI; without dictionary builder, benchmark, and support for decompression of legacy zstd versions + + +#### Aggregation of parameters +CLI supports aggregation of parameters i.e. `-b1`, `-e18`, and `-i1` can be joined into `-b1e18i1`. + + +#### Dictionary builder in Command Line Interface +Zstd offers a training mode, which can be used to tune the algorithm for a selected +type of data, by providing it with a few samples. The result of the training is stored +in a file selected with the `-o` option (default name is `dictionary`), +which can be loaded before compression and decompression. + +Using a dictionary, the compression ratio achievable on small data improves dramatically. +These compression gains are achieved while simultaneously providing faster compression and decompression speeds. +Dictionary work if there is some correlation in a family of small data (there is no universal dictionary). +Hence, deploying one dictionary per type of data will provide the greater benefits. +Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm +will rely more and more on previously decoded content to compress the rest of the file. + +Usage of the dictionary builder and created dictionaries with CLI: +1) Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName` +2) Compress with dictionary: `zstd FILE -D dictionaryName` +3) Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName` + + + +#### Benchmark in Command Line Interface +CLI includes in-memory compression benchmark module for zstd. +The benchmark is conducted using given filenames which are read into memory and joined together. +It makes benchmark more precise as it eliminates I/O overhead. +Many filenames can be supplied as multiple parameters, parameters with wildcards or +names of directories can be used as parameters with the `-r` option. + +The benchmark measures ratio, compressed size, compression and decompression speed. +One can select compression levels starting from `-b` and ending with `-e`. +The `-i` parameter selects minimal time used for each of tested levels. + + + +#### Usage of Command Line Interface +The full list of options can be obtained with `-h` or `-H` parameter: +``` +Usage : + zstd [args] [FILE(s)] [-o file] + +FILE : a filename + with no FILE, or when FILE is - , read standard input +Arguments : + -# : # compression level (1-19, default:3) + -d : decompression + -D file: use `file` as Dictionary + -o file: result stored into `file` (only if 1 input file) + -f : overwrite output without prompting +--rm : remove source file(s) after successful de/compression + -k : preserve source file(s) (default) + -h/-H : display help/long help and exit + +Advanced arguments : + -V : display Version number and exit + -v : verbose mode; specify multiple times to increase log level (default:2) + -q : suppress warnings; specify twice to suppress errors too + -c : force write to standard output, even if it is the console + -r : operate recursively on directories +--ultra : enable levels beyond 19, up to 22 (requires more memory) +--no-dictID : don't write dictID into header (dictionary compression) +--[no-]check : integrity check (default:enabled) +--test : test compressed file integrity +--[no-]sparse : sparse mode (default:enabled on file, disabled on stdout) + +Dictionary builder : +--train ## : create a dictionary from a training set of files + -o file : `file` is dictionary name (default: dictionary) +--maxdict ## : limit dictionary to specified size (default : 112640) + -s# : dictionary selectivity level (default: 9) +--dictID ## : force dictionary ID to specified value (default: random) + +Benchmark arguments : + -b# : benchmark file(s), using # compression level (default : 1) + -e# : test all compression levels from -bX to # (default: 1) + -i# : minimum evaluation time in seconds (default : 3s) + -B# : cut file into independent blocks of size # (default: no block) + ``` \ No newline at end of file diff --git a/projects/README.md b/projects/README.md index d724cdcb..6353623b 100644 --- a/projects/README.md +++ b/projects/README.md @@ -4,11 +4,11 @@ projects for various integrated development environments (IDE) #### Included projects The following projects are included with the zstd distribution: -- cmake - CMake project contributed by Artyom Dymchenko -- VS2005 - Visual Studio 2005 project -- VS2008 - Visual Studio 2008 project -- VS2010 - Visual Studio 2010 project (which also works well with Visual Studio 2012, 2013, 2015) -- build - command line scripts prepared for Visual Studio compilation without IDE +- `cmake` - CMake project contributed by Artyom Dymchenko +- `VS2005` - Visual Studio 2005 project +- `VS2008` - Visual Studio 2008 project +- `VS2010` - Visual Studio 2010 project (which also works well with Visual Studio 2012, 2013, 2015) +- `build` - command line scripts prepared for Visual Studio compilation without IDE #### How to compile zstd with Visual Studio diff --git a/tests/README.md b/tests/README.md index 309638f9..72be1a9b 100644 --- a/tests/README.md +++ b/tests/README.md @@ -2,24 +2,24 @@ programs and scripts for automated testing of zstd ================================ This directory contains the following programs and scripts: -- datagen : Synthetic and parametrable data generator, for tests -- fullbench : Precisely measure speed for each zstd inner functions -- fuzzer : Test tool, to check zstd integrity on target platform -- paramgrill : parameter tester for zstd -- test-zstd-speed.py : script for testing zstd speed difference between commits -- test-zstd-versions.py : compatibility test between zstd versions stored on Github (v0.1+) -- zbufftest : Test tool, to check ZBUFF integrity on target platform -- zstreamtest : Fuzzer test tool for zstd streaming API +- `datagen` : Synthetic and parametrable data generator, for tests +- `fullbench` : Precisely measure speed for each zstd inner functions +- `fuzzer` : Test tool, to check zstd integrity on target platform +- `paramgrill` : parameter tester for zstd +- `test-zstd-speed.py` : script for testing zstd speed difference between commits +- `test-zstd-versions.py` : compatibility test between zstd versions stored on Github (v0.1+) +- `zbufftest` : Test tool to check ZBUFF integrity on target platform +- `zstreamtest` : Fuzzer test tool for zstd streaming API -#### test-zstd-versions.py - script for testing zstd interoperability between versions +#### `test-zstd-versions.py` - script for testing zstd interoperability between versions This script creates `versionsTest` directory to which zstd repository is cloned. Then all taged (released) versions of zstd are compiled. In the following step interoperability between zstd versions is checked. -#### test-zstd-speed.py - script for testing zstd speed difference between commits +#### `test-zstd-speed.py` - script for testing zstd speed difference between commits This script creates `speedTest` directory to which zstd repository is cloned. Then it compiles all branches of zstd and performs a speed benchmark for a given list of files (the `testFileNames` parameter). From 637d33566dabd11527b8ccc7b6ef0ca8094a3fcc Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 10:42:49 +0200 Subject: [PATCH 4/7] minor README.md corrections --- programs/README.md | 8 ++++---- tests/README.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/programs/README.md b/programs/README.md index f38e6c49..4028756c 100644 --- a/programs/README.md +++ b/programs/README.md @@ -29,15 +29,15 @@ Dictionary gains are mostly effective in the first few KB. Then, the compression will rely more and more on previously decoded content to compress the rest of the file. Usage of the dictionary builder and created dictionaries with CLI: -1) Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName` -2) Compress with dictionary: `zstd FILE -D dictionaryName` -3) Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName` +1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName` +2. Compress with dictionary: `zstd FILE -D dictionaryName` +3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName` #### Benchmark in Command Line Interface CLI includes in-memory compression benchmark module for zstd. -The benchmark is conducted using given filenames which are read into memory and joined together. +The benchmark is conducted using given filenames. The files are read into memory and joined together. It makes benchmark more precise as it eliminates I/O overhead. Many filenames can be supplied as multiple parameters, parameters with wildcards or names of directories can be used as parameters with the `-r` option. diff --git a/tests/README.md b/tests/README.md index 72be1a9b..0d64c98d 100644 --- a/tests/README.md +++ b/tests/README.md @@ -8,7 +8,7 @@ This directory contains the following programs and scripts: - `paramgrill` : parameter tester for zstd - `test-zstd-speed.py` : script for testing zstd speed difference between commits - `test-zstd-versions.py` : compatibility test between zstd versions stored on Github (v0.1+) -- `zbufftest` : Test tool to check ZBUFF integrity on target platform +- `zbufftest` : Test tool to check ZBUFF (a buffered streaming API) integrity - `zstreamtest` : Fuzzer test tool for zstd streaming API From 0132375ba6975b8acb781d83bb3f9614424ca817 Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 12:20:38 +0200 Subject: [PATCH 5/7] improved zstd_compression_format.md --- programs/README.md | 1 + zstd_compression_format.md | 105 +++++++++++++++++++------------------ 2 files changed, 56 insertions(+), 50 deletions(-) diff --git a/programs/README.md b/programs/README.md index 4028756c..0fbb8a35 100644 --- a/programs/README.md +++ b/programs/README.md @@ -29,6 +29,7 @@ Dictionary gains are mostly effective in the first few KB. Then, the compression will rely more and more on previously decoded content to compress the rest of the file. Usage of the dictionary builder and created dictionaries with CLI: + 1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName` 2. Compress with dictionary: `zstd FILE -D dictionaryName` 3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName` diff --git a/zstd_compression_format.md b/zstd_compression_format.md index 61f6a57b..867a9b03 100644 --- a/zstd_compression_format.md +++ b/zstd_compression_format.md @@ -97,6 +97,42 @@ to decode all concatenated frames in their sequential order, delivering the final decompressed result as if it was a single content. +Skippable Frames +---------------- + +| `Magic_Number` | `Frame_Size` | `User_Data` | +|:--------------:|:------------:|:-----------:| +| 4 bytes | 4 bytes | n bytes | + +Skippable frames allow the insertion of user-defined data +into a flow of concatenated frames. +Its design is pretty straightforward, +with the sole objective to allow the decoder to quickly skip +over user-defined data and continue decoding. + +Skippable frames defined in this specification are compatible with [LZ4] ones. + +[LZ4]:http://www.lz4.org + +__`Magic_Number`__ + +4 Bytes, little-endian format. +Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. +All 16 values are valid to identify a skippable frame. + +__`Frame_Size`__ + +This is the size, in bytes, of the following `User_Data` +(without including the magic number nor the size field itself). +This field is represented using 4 Bytes, little-endian format, unsigned 32-bits. +This means `User_Data` can’t be bigger than (2^32-1) bytes. + +__`User_Data`__ + +The `User_Data` can be anything. Data will just be skipped by the decoder. + + + General Structure of Zstandard Frame format ------------------------------------------- The structure of a single Zstandard frame is following: @@ -163,9 +199,9 @@ The `Flag_Value` can be converted into `Field_Size`, which is the number of bytes used by `Frame_Content_Size` according to the following table: -|`Flag_Value`| 0 | 1 | 2 | 3 | -| ---------- | --- | --- | --- | --- | -|`Field_Size`| 0-1 | 2 | 4 | 8 | +|`Flag_Value`| 0 | 1 | 2 | 3 | +| ---------- | ------ | --- | --- | --- | +|`Field_Size`| 0 or 1 | 2 | 4 | 8 | When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` : if `Single_Segment_flag` is set, `Field_Size` is 1. @@ -361,40 +397,6 @@ up to `Block_Maximum_Decompressed_Size`, which is the smallest of : - 128 KB -Skippable Frames ----------------- - -| `Magic_Number` | `Frame_Size` | `User_Data` | -|:--------------:|:------------:|:-----------:| -| 4 bytes | 4 bytes | n bytes | - -Skippable frames allow the insertion of user-defined data -into a flow of concatenated frames. -Its design is pretty straightforward, -with the sole objective to allow the decoder to quickly skip -over user-defined data and continue decoding. - -Skippable frames defined in this specification are compatible with [LZ4] ones. - -[LZ4]:http://www.lz4.org - -__`Magic_Number`__ - -4 Bytes, little-endian format. -Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. -All 16 values are valid to identify a skippable frame. - -__`Frame_Size`__ - -This is the size, in bytes, of the following `User_Data` -(without including the magic number nor the size field itself). -This field is represented using 4 Bytes, little-endian format, unsigned 32-bits. -This means `User_Data` can’t be bigger than (2^32-1) bytes. - -__`User_Data`__ - -The `User_Data` can be anything. Data will just be skipped by the decoder. - The format of `Compressed_Block` -------------------------------- @@ -447,9 +449,12 @@ __`Literals_Block_Type`__ This field uses 2 lowest bits of first byte, describing 4 different block types : -| Value | 0 | 1 | 2 | 3 | -| --------------------- | -------------------- | -------------------- | --------------------------- | ----------------------------- | -| `Literals_Block_Type` | `Raw_Literals_Block` | `RLE_Literals_Block` | `Compressed_Literals_Block` | `Repeat_Stats_Literals_Block` | +| `Literals_Block_Type` | Value | +| ----------------------------- | ----- | +| `Raw_Literals_Block` | 0 | +| `RLE_Literals_Block` | 1 | +| `Compressed_Literals_Block` | 2 | +| `Repeat_Stats_Literals_Block` | 3 | - `Raw_Literals_Block` - Literals are stored uncompressed. - `RLE_Literals_Block` - Literals consist of a single byte value repeated N times. @@ -466,37 +471,37 @@ __`Size_Format`__ - For `Compressed_Block`, it requires to decode both `Compressed_Size` and `Regenerated_Size` (the decompressed size). It will also decode the number of streams. -- For `Raw_Block` and `RLE_Block` it's enough to decode `Regenerated_Size`. +- For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`. For values spanning several bytes, convention is little-endian. __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : -- Value : x0 : `Regenerated_Size` uses 5 bits (0-31). +- Value x0 : `Regenerated_Size` uses 5 bits (0-31). `Literals_Section_Header` has 1 byte. `Regenerated_Size = Header[0]>>3` -- Value : 01 : `Regenerated_Size` uses 12 bits (0-4095). +- Value 01 : `Regenerated_Size` uses 12 bits (0-4095). `Literals_Section_Header` has 2 bytes. `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4)` -- Value : 11 : `Regenerated_Size` uses 20 bits (0-1048575). +- Value 11 : `Regenerated_Size` uses 20 bits (0-1048575). `Literals_Section_Header` has 3 bytes. `Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + (Header[2]<<12)` -Note : it's allowed to represent a short value (ex : `13`) -using a long format, accepting the reduced compacity. +Note : it's allowed to represent a short value (for example `13`) +using a long format, accepting the increased compressed data size. __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ : -- Value : 00 : _Single stream_. +- Value 00 : _A single stream_. Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). `Literals_Section_Header` has 3 bytes. -- Value : 01 : 4 streams. +- Value 01 : 4 streams. Both `Compressed_Size` and `Regenerated_Size` use 10 bits (0-1023). `Literals_Section_Header` has 3 bytes. -- Value : 10 : 4 streams. +- Value 10 : 4 streams. Both `Compressed_Size` and `Regenerated_Size` use 14 bits (0-16383). `Literals_Section_Header` has 4 bytes. -- Value : 11 : 4 streams. +- Value 11 : 4 streams. Both `Compressed_Size` and `Regenerated_Size` use 18 bits (0-262143). `Literals_Section_Header` has 5 bytes. From de9d13068ffc864c9ccc75700715700b27fe520d Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 14:59:08 +0200 Subject: [PATCH 6/7] updated zstd_compression_format.md --- programs/README.md | 4 +- zstd_compression_format.md | 131 ++++++++++++++++++++----------------- 2 files changed, 72 insertions(+), 63 deletions(-) diff --git a/programs/README.md b/programs/README.md index 0fbb8a35..9bd1e71b 100644 --- a/programs/README.md +++ b/programs/README.md @@ -31,8 +31,8 @@ will rely more and more on previously decoded content to compress the rest of th Usage of the dictionary builder and created dictionaries with CLI: 1. Create the dictionary : `zstd --train FullPathToTrainingSet/* -o dictionaryName` -2. Compress with dictionary: `zstd FILE -D dictionaryName` -3. Decompress with dictionary: `zstd --decompress FILE.zst -D dictionaryName` +2. Compress with the dictionary: `zstd FILE -D dictionaryName` +3. Decompress with the dictionary: `zstd --decompress FILE.zst -D dictionaryName` diff --git a/zstd_compression_format.md b/zstd_compression_format.md index 867a9b03..7143eea3 100644 --- a/zstd_compression_format.md +++ b/zstd_compression_format.md @@ -271,7 +271,7 @@ which can be any value from 1 to 2^64-1 bytes (16 EB). | ----------- | ---------- | ---------- | | Field name | `Exponent` | `Mantissa` | -Maximum distance is given by the following formulae : +Maximum distance is given by the following formulas : ``` windowLog = 10 + Exponent; windowBase = 1 << windowLog; @@ -415,7 +415,7 @@ To decode a compressed block, the following elements are necessary : or all previous blocks when `Single_Segment_flag` is set. - List of "recent offsets" from previous compressed block. - Decoding tables of previous compressed block for each symbol type - (literals, litLength, matchLength, offset). + (literals, literals lengths, match lengths, offsets). ### `Literals_Section` @@ -510,7 +510,7 @@ Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian conven #### `Huffman_Tree_Description` -This section is only present when `Literals_Block_Type` type is `Compressed_Block` (`2`). +This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). Prefix coding represents symbols from an a priori known alphabet by bit sequences (codewords), one codeword for each symbol, @@ -532,9 +532,11 @@ This specification limits maximum code length to 11 bits. ##### Representation All literal values from zero (included) to last present one (excluded) -are represented by `Weight` values, from 0 to `Max_Number_of_Bits`. -Transformation from `Weight` to `Number_of_Bits` follows this formulae : -`Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0` . +are represented by `Weight` with values from `0` to `Max_Number_of_Bits`. +Transformation from `Weight` to `Number_of_Bits` follows this formula : +``` +Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 +``` The last symbol's `Weight` is deduced from previously decoded ones, by completing to the nearest power of 2. This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. @@ -549,7 +551,10 @@ Let's presume the following Huffman tree must be described : The tree depth is 4, since its smallest element uses 4 bits. Value `5` will not be listed, nor will values above `5`. Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. -Weight formula is : `Weight = Number_of_Bits ? Max_Number_of_Bits + 1 - Number_of_Bits : 0`. +Weight formula is : +``` +Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 +``` It gives the following serie of weights : | `Weight` | 4 | 3 | 2 | 0 | 1 | @@ -580,9 +585,9 @@ which tells how to decode the list of weights. - if `headerByte` < 128 : the serie of weights is compressed by FSE. - The length of the FSE-compressed serie is `headerByte` (0-127). + The length of the FSE-compressed serie is equal to `headerByte` (0-127). -##### FSE (Finite State Entropy) compression of Huffman weights +##### Finite State Entropy (FSE) compression of Huffman weights The serie of weights is compressed using FSE compression. It's a single bitstream with 2 interleaved states, @@ -612,9 +617,10 @@ When both states have overflowed the bitstream, end is reached. ##### Conversion from weights to Huffman prefix codes All present symbols shall now have a `Weight` value. -It is possible to transform weights into Number_of_Bits, using this formula : -`Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0` . - +It is possible to transform weights into Number_of_Bits, using this formula: +``` +Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0 +``` Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order. Symbols with a `Weight` of zero are removed. Then, starting from lowest weight, prefix codes are distributed in order. @@ -636,21 +642,21 @@ it gives the following distribution : | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | -#### Literals bitstreams +#### The content of Huffman-compressed literal stream ##### Bitstreams sizes As seen in a previous paragraph, -there are 2 flavors of Huffman-compressed literals : -single stream, and 4-streams. +there are 2 types of Huffman-compressed literals : +a single stream and 4 streams. -4-streams is useful for CPU with multiple execution units and out-of-order operations. +Encoding using 4 streams is useful for CPU with multiple execution units and out-of-order operations. Since each stream can be decoded independently, it's possible to decode them up to 4x faster than a single stream, presuming the CPU has enough parallelism available. For single stream, header provides both the compressed and regenerated size. -For 4-streams though, +For 4 streams though, header only provides compressed and regenerated size of all 4 streams combined. In order to properly decode the 4 streams, it's necessary to know the compressed and regenerated size of each stream. @@ -663,8 +669,10 @@ bitstreams are preceded by 3 unsigned little-endian 16-bits values. Each value represents the compressed size of one stream, in order. The last stream size is deducted from total compressed size and from previously decoded stream sizes : + `stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`. + ##### Bitstreams read and decode Each bitstream must be read _backward_, @@ -706,23 +714,18 @@ When all _sequences_ are decoded, if there is any literal left in the _literal section_, these bytes are added at the end of the block. -The _Sequences_Section_ regroup all symbols required to decode commands. +The `Sequences_Section` regroup all symbols required to decode commands. There are 3 symbol types : literals lengths, offsets and match lengths. They are encoded together, interleaved, in a single _bitstream_. -Each symbol is a _code_ in its own context, -which specifies a baseline and a number of bits to add. -_Codes_ are FSE compressed, -and interleaved with raw additional bits in the same bitstream. - -The Sequences section starts by a header, -followed by optional Probability tables for each symbol type, +The `Sequences_Section` starts by a header, +followed by optional probability tables for each symbol type, followed by the bitstream. | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | -To decode the Sequence section, it's required to know its size. +To decode the `Sequences_Section`, it's required to know its size. This size is deducted from `blockSize - literalSectionSize`. @@ -753,8 +756,8 @@ This is a single byte, defining the compression mode of each symbol type. The last field, `Reserved`, must be all-zeroes. -`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the compression mode of -literals lengths, offsets and match lengths respectively. +`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of +literals lengths, offsets, and match lengths respectively. They follow the same enumeration : @@ -769,9 +772,14 @@ They follow the same enumeration : A distribution table will be present. It will be described in [next part](#distribution-tables). -#### Symbols decoding +#### The codes for literals lengths, match lengths, and offsets. -##### Literals Length codes +Each symbol is a _code_ in its own context, +which specifies `Baseline` and `Number_of_Bits` to add. +_Codes_ are FSE compressed, +and interleaved with raw additional bits in the same bitstream. + +##### Literals length codes Literals length codes are values ranging from `0` to `35` included. They define lengths from 0 to 131071 bytes. @@ -783,20 +791,20 @@ They define lengths from 0 to 131071 bytes. | `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | +| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | | `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | +| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | | `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | `Literals_Length_Code` | 32 | 33 | 34 | 35 | | ---------------------- | ---- | ---- | ---- | ---- | -| Baseline | 8192 |16384 |32768 |65536 | +| `Baseline` | 8192 |16384 |32768 |65536 | | `Number_of_Bits` | 13 | 14 | 15 | 16 | -__Default distribution__ +##### Default distribution for literals length codes When `Compression_Mode` is `Predefined_Mode`, a predefined distribution is used for FSE compression. @@ -809,7 +817,7 @@ short literalsLength_defaultDistribution[36] = -1,-1,-1,-1 }; ``` -##### Match Length codes +##### Match length codes Match length codes are values ranging from `0` to `52` included. They define lengths from 3 to 131074 bytes. @@ -821,25 +829,25 @@ They define lengths from 3 to 131074 bytes. | `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | +| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | | `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 | +| `Baseline` | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 | | `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | | `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | | ------------------- | ---- | ---- | ---- | ---- | ---- | -| Baseline | 4098 | 8194 |16486 |32770 |65538 | +| `Baseline` | 4098 | 8194 |16486 |32770 |65538 | | `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | -__Default distribution__ +##### Default distribution for match length codes When `Compression_Mode` is defined as `Predefined_Mode`, a predefined distribution is used for FSE compression. -Here is its definition. It uses an accuracy of 6 bits (64 states). +Below is its definition. It uses an accuracy of 6 bits (64 states). ``` short matchLengths_defaultDistribution[53] = { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, @@ -858,26 +866,27 @@ For information, at the time of this writing. the reference decoder supports a maximum `N` value of `28` in 64-bits mode. An offset code is also the number of additional bits to read, -and can be translated into an `Offset_Value` using the following formulae : +and can be translated into an `Offset_Value` using the following formulas : ``` Offset_Value = (1 << offsetCode) + readNBits(offsetCode); if (Offset_Value > 3) offset = Offset_Value - 3; ``` -It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to 2^(N+1))-4 +It means that maximum `Offset_Value` is `2^(N+1))-1` and it supports back-reference distance up to `2^(N+1))-4` but is limited by [maximum back-reference distance](#window_descriptor). -Offset_Value from 1 to 3 are special : they define "repeat codes", +`Offset_Value` from 1 to 3 are special : they define "repeat codes", which means one of the previous offsets will be repeated. They are sorted in recency order, with 1 meaning the most recent one. See [Repeat offsets](#repeat-offsets) paragraph. -__Default distribution__ + +##### Default distribution for offset codes When `Compression_Mode` is defined as `Predefined_Mode`, a predefined distribution is used for FSE compression. -Here is its definition. It uses an accuracy of 5 bits (32 states), +Below is its definition. It uses an accuracy of 5 bits (32 states), and supports a maximum `N` of 28, allowing offset values up to 536,870,908 . If any sequence in the compressed block requires an offset larger than this, @@ -918,7 +927,7 @@ The bitstream starts by reporting on which scale it operates. Note that maximum `Accuracy_Log` for literal and match lengths is `9`, and for offsets is `8`. Higher values are considered errors. -Then follow each symbol value, from `0` to last present one. +Then follows each symbol value, from `0` to last present one. The number of bits used by each field is variable. It depends on : @@ -947,11 +956,11 @@ It depends on : Symbols probabilities are read one by one, in order. -Probability is obtained from Value decoded by following formulae : +Probability is obtained from Value decoded by following formula : `Proba = value - 1` It means value `0` becomes negative probability `-1`. -`-1` is a special probability, which means `less than 1`. +`-1` is a special probability, which means "less than 1". Its effect on distribution table is described in [next paragraph]. For the purpose of calculating cumulated distribution, it counts as one. @@ -1006,7 +1015,7 @@ typically by a "less than 1" probability symbol. The result is a list of state values. Each state will decode the current symbol. -To get the Number of bits and baseline required for next state, +To get the `Number_of_Bits` and `Baseline` required for next state, it's first necessary to sort all states in their natural order. The lower states will need 1 more bit than higher ones. @@ -1030,11 +1039,11 @@ Numbering starts from higher states using less bits. | width | 32 | 32 | 32 | 16 | 16 | | `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | | range number | 2 | 4 | 6 | 0 | 1 | -| baseline | 32 | 64 | 96 | 0 | 16 | +| `Baseline` | 32 | 64 | 96 | 0 | 16 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | Next state is determined from current state -by reading the required number of bits, and adding the specified baseline. +by reading the required `Number_of_Bits`, and adding the specified `Baseline`. #### Bitstream @@ -1064,16 +1073,16 @@ Reminder : always keep in mind that all values are read _backward_. ##### Decoding a sequence A state gives a code. -A code provides a baseline and number of bits to add. +A code provides `Baseline` and `Number_of_Bits` to add. See [Symbol Decoding] section for details on each symbol. -Decoding starts by reading the number of bits required to decode offset. -It then does the same for match length, -and then for literals length. +Decoding starts by reading the `Number_of_Bits` required to decode `Offset`. +It then does the same for `Match_Length`, +and then for `Literals_Length`. -Offset / matchLength / litLength define a sequence. -It starts by inserting the number of literals defined by `litLength`, -then continue by copying `matchLength` bytes from `currentPos - offset`. +`Offset`, `Match_Length`, and `Literals_Length` define a sequence. +It starts by inserting the number of literals defined by `Literals_Length`, +then continue by copying `Match_Length` bytes from `currentPos - Offset`. The next operation is to update states. Using rules pre-calculated in the decoding tables, @@ -1085,7 +1094,7 @@ This operation will be repeated `Number_of_Sequences` times. At the end, the bitstream shall be entirely consumed, otherwise bitstream is considered corrupted. -[Symbol Decoding]:#symbols-decoding +[Symbol Decoding]:#the-codes-for-literals-lengths-match-lengths-and-offsets ##### Repeat offsets @@ -1143,8 +1152,8 @@ _Reserved ranges :_ __`Entropy_Tables`__ : following the same format as a [compressed blocks]. They are stored in following order : - Huffman tables for literals, FSE table for offset, - FSE table for matchLenth, and FSE table for litLength. + Huffman tables for literals, FSE table for offsets, + FSE table for match lengths, and FSE table for literals lengths. It's finally followed by 3 offset values, populating recent offsets, stored in order, 4-bytes little-endian each, for a total of 12 bytes. From 12731a9644e277e049286d7f96565b72b99ecef5 Mon Sep 17 00:00:00 2001 From: inikep Date: Thu, 25 Aug 2016 15:19:37 +0200 Subject: [PATCH 7/7] minor changes in zstd_compression_format.md --- zstd_compression_format.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/zstd_compression_format.md b/zstd_compression_format.md index 7143eea3..3facb321 100644 --- a/zstd_compression_format.md +++ b/zstd_compression_format.md @@ -993,7 +993,7 @@ The table has a size of `tableSize = 1 << Accuracy_Log`. Each cell describes the symbol decoded, and instructions to get the next state. -Symbols are scanned in their natural order for `less than 1` probabilities. +Symbols are scanned in their natural order for "less than 1" probabilities. Symbols with this probability are being attributed a single cell, starting from the end of the table. These symbols define a full state reset, reading `Accuracy_Log` bits. @@ -1098,13 +1098,13 @@ otherwise bitstream is considered corrupted. ##### Repeat offsets -As seen in [Offset Codes], the first 3 values define a repeated offset. -They are sorted in recency order, with 1 meaning "most recent one". +As seen in [Offset Codes], the first 3 values define a repeated offset and we will call them `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. +They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". There is an exception though, when current sequence's literals length is `0`. -In which case, repcodes are "pushed by one", -so 1 becomes 2, 2 becomes 3, -and 3 becomes "offset_1 - 1_byte". +In which case, repeated offsets are "pushed by one", +so `Repeated_Offset1` becomes `Repeated_Offset2`, `Repeated_Offset2` becomes `Repeated_Offset3`, +and `Repeated_Offset3` becomes `Repeated_Offset1 - 1_byte`. On first block, offset history is populated by the following values : 1, 4 and 8 (in order). @@ -1119,8 +1119,8 @@ they do not contribute to offset history. New offset take the lead in offset history, up to its previous place if it was already present. -It means that when repeat offset 1 (most recent) is used, history is unmodified. -When repeat offset 2 is used, it's swapped with offset 1. +It means that when `Repeated_Offset1` (most recent) is used, history is unmodified. +When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`. Dictionary format