glibc/sysdeps
Noah Goldstein 9e5693b446 x86: Increase non_temporal_threshold to roughly sizeof_L3 / 4
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`

The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.

Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.

Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.

As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.

The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.

Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks

Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit af992e7abd)
2023-09-11 22:47:46 -05:00
..
aarch64 aarch64: Fix DT_AARCH64_VARIANT_PCS handling [BZ #26798] 2020-11-04 12:23:20 +00:00
alpha Add libm_alias_finite for _finite symbols 2020-01-03 10:02:04 -03:00
arm arm: CVE-2020-6096: Fix multiarch memcpy for negative length [BZ #25620] 2020-07-14 15:46:21 -04:00
csky Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
generic mips: Fix bracktrace result for signal frames 2020-03-30 23:18:26 +02:00
gnu Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
hppa Add new file missed in previous hppa commit. 2020-05-04 21:56:38 +02:00
htl htl: Add internal version of __pthread_mutex_timedlock 2020-01-13 20:41:07 +01:00
hurd Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
i386 x86: Check IFUNC definition in unrelocated executable [BZ #20019] 2021-01-13 14:02:11 -08:00
ia64 Linux: Use system call tables during build 2020-01-02 10:18:23 +01:00
ieee754 math/test-sinl-pseudo: Use stack protector only if available 2020-03-11 12:25:15 +01:00
init_array Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
m68k Add libm_alias_finite for _finite symbols 2020-01-03 10:02:04 -03:00
mach htl: Add internal version of __pthread_mutex_timedlock 2020-01-13 20:41:07 +01:00
microblaze Linux: Use system call tables during build 2020-01-02 10:18:23 +01:00
mips MIPS: Fix circular definition of __LDBL_MANT_DIG__ in ieee754.h 2020-01-08 14:28:23 -05:00
nios2 Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
nptl nptl: Avoid using PTHREAD_MUTEX_DEFAULT in macro definition [BZ #25271] 2020-01-30 15:54:49 +01:00
posix getcwd: Set errno to ERANGE for size == 1 (CVE-2021-3999) 2022-08-18 00:14:28 +02:00
powerpc powerpc64: Workaround sigtramp vdso return call 2021-03-08 11:16:27 -03:00
pthread Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
riscv Add libm_alias_finite for _finite symbols 2020-01-03 10:02:04 -03:00
s390 S390: Also check vector support in memmove ifunc-selector [BZ #27511] 2021-03-26 11:01:51 +01:00
sh sh: Add sh4 fpu Implies folder 2020-11-27 09:34:02 -03:00
sparc Add libm_alias_finite for _finite symbols 2020-01-03 10:02:04 -03:00
unix Linux: Require properly configured /dev/pts for PTYs 2022-08-18 12:28:36 +02:00
wordsize-32 Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
wordsize-64 Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
x86 x86: Increase non_temporal_threshold to roughly sizeof_L3 / 4 2023-09-11 22:47:46 -05:00
x86_64 x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591] 2022-11-24 17:15:54 -08:00