Merge pull request #2050 from terrelln/align

Align decompress sequences loop to 32+16 bytes
This commit is contained in:
Nick Terrell 2020-03-24 11:42:59 -07:00 committed by GitHub
commit f5029e285f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -974,6 +974,47 @@ ZSTD_decompressSequences_body( ZSTD_DCtx* dctx,
BIT_DStream_endOfBuffer < BIT_DStream_completed &&
BIT_DStream_completed < BIT_DStream_overflow);
#if defined(__GNUC__) && defined(__x86_64__)
/* Align the decompression loop to 32 + 16 bytes.
*
* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
* speed swings based on the alignment of the decompression loop. This
* performance swing is caused by parts of the decompression loop falling
* out of the DSB. The entire decompression loop should fit in the DSB,
* when it can't we get much worse performance. You can measure if you've
* hit the good case or the bad case with this perf command for some
* compressed file test.zst:
*
* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \
* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
*
* If you see most cycles served out of the MITE you've hit the bad case.
* If you see most cycles served out of the DSB you've hit the good case.
* If it is pretty even then you may be in an okay case.
*
* I've been able to reproduce this issue on the following CPUs:
* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
* Use Instruments->Counters to get DSB/MITE cycles.
* I never got performance swings, but I was able to
* go from the good case of mostly DSB to half of the
* cycles served from MITE.
* - Coffeelake: Intel i9-9900k
*
* I haven't been able to reproduce the instability or DSB misses on any
* of the following CPUS:
* - Haswell
* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
* - Skylake
*
* If you are seeing performance stability this script can help test.
* It tests on 4 commits in zstd where I saw performance change.
*
* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
*/
__asm__(".p2align 5");
__asm__("nop");
__asm__(".p2align 4");
#endif
for ( ; ; ) {
seq_t const sequence = ZSTD_decodeSequence(&seqState, isLongOffset, ZSTD_p_noPrefetch);
size_t const oneSeqSize = ZSTD_execSequence(op, oend, sequence, &litPtr, litEnd, prefixStart, vBase, dictEnd);