b830599582
Summary: The idea behind wildcopy is that it can be cheaper to copy more bytes (say 8) than it is to copy less (say, 3). This change takes that further by exploiting some properties: 1. it's almost always OK to copy 16 bytes instead of 8, which means fewer copy instructions, and fewer branches 2. A 16 byte chunk size means that ~90% of wildcopy invocations will have a trip count of 1, so branch prediction will be improved. Speedup on Xeon E5-2680v4 is in the range of 3-5%. Measured wildcopy length distributions on silesia.tar: level <=8 <=16 <=24 >24 1 78.05% 11.49% 3.52% 6.94% 3 82.14% 8.99% 2.44% 6.43% 6 85.81% 6.51% 2.92% 4.76% 8 83.02% 7.31% 3.64% 6.03% 10 84.13% 6.67% 3.29% 5.91% 15 77.58% 7.55% 5.21% 9.66% 16 80.07% 7.20% 3.98% 8.75% Test Plan: benchmark silesia, make check |
||
---|---|---|
.. | ||
bitstream.h | ||
compiler.h | ||
cpu.h | ||
debug.c | ||
debug.h | ||
entropy_common.c | ||
error_private.c | ||
error_private.h | ||
fse_decompress.c | ||
fse.h | ||
huf.h | ||
mem.h | ||
pool.c | ||
pool.h | ||
threading.c | ||
threading.h | ||
xxhash.c | ||
xxhash.h | ||
zstd_common.c | ||
zstd_errors.h | ||
zstd_internal.h |