glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-13 14:50:17 +00:00

History

Noah Goldstein 9e5693b446 x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 / ncores_per_socket'. This patch updates that value to roughly 'sizeof_L3 / 4` The original value (specifically dividing the `ncores_per_socket`) was done to limit the amount of other threads' data a `memcpy`/`memset` could evict. Dividing by 'ncores_per_socket', however leads to exceedingly low non-temporal thresholds and leads to using non-temporal stores in cases where REP MOVSB is multiple times faster. Furthermore, non-temporal stores are written directly to main memory so using it at a size much smaller than L3 can place soon to be accessed data much further away than it otherwise could be. As well, modern machines are able to detect streaming patterns (especially if REP MOVSB is used) and provide LRU hints to the memory subsystem. This in affect caps the total amount of eviction at 1/cache_associativity, far below meaningfully thrashing the entire cache. As best I can tell, the benchmarks that lead this small threshold where done comparing non-temporal stores versus standard cacheable stores. A better comparison (linked below) is to be REP MOVSB which, on the measure systems, is nearly 2x faster than non-temporal stores at the low-end of the previous threshold, and within 10% for over 100MB copies (well past even the current threshold). In cases with a low number of threads competing for bandwidth, REP MOVSB is ~2x faster up to `sizeof_L3`. The divisor of `4` is a somewhat arbitrary value. From benchmarks it seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs such as Broadwell prefer something closer to `8`. This patch is meant to be followed up by another one to make the divisor cpu-specific, but in the meantime (and for easier backporting), this patch settles on `4` as a middle-ground. Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable stores where done using: https://github.com/goldsteinn/memcpy-nt-benchmarks Sheets results (also available in pdf on the github): https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `af992e7abd`)		2023-09-11 22:47:46 -05:00
..
aarch64	aarch64: Fix DT_AARCH64_VARIANT_PCS handling [BZ #26798 ]	2020-11-04 12:23:20 +00:00
alpha	Add libm_alias_finite for _finite symbols	2020-01-03 10:02:04 -03:00
arm	arm: CVE-2020-6096: Fix multiarch memcpy for negative length [BZ #25620 ]	2020-07-14 15:46:21 -04:00
csky	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
generic	mips: Fix bracktrace result for signal frames	2020-03-30 23:18:26 +02:00
gnu	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
hppa	Add new file missed in previous hppa commit.	2020-05-04 21:56:38 +02:00
htl	htl: Add internal version of __pthread_mutex_timedlock	2020-01-13 20:41:07 +01:00
hurd	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
i386	x86: Check IFUNC definition in unrelocated executable [BZ #20019 ]	2021-01-13 14:02:11 -08:00
ia64	Linux: Use system call tables during build	2020-01-02 10:18:23 +01:00
ieee754	math/test-sinl-pseudo: Use stack protector only if available	2020-03-11 12:25:15 +01:00
init_array	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
m68k	Add libm_alias_finite for _finite symbols	2020-01-03 10:02:04 -03:00
mach	htl: Add internal version of __pthread_mutex_timedlock	2020-01-13 20:41:07 +01:00
microblaze	Linux: Use system call tables during build	2020-01-02 10:18:23 +01:00
mips	MIPS: Fix circular definition of __LDBL_MANT_DIG__ in ieee754.h	2020-01-08 14:28:23 -05:00
nios2	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
nptl	nptl: Avoid using PTHREAD_MUTEX_DEFAULT in macro definition [BZ #25271 ]	2020-01-30 15:54:49 +01:00
posix	getcwd: Set errno to ERANGE for size == 1 (CVE-2021-3999)	2022-08-18 00:14:28 +02:00
powerpc	powerpc64: Workaround sigtramp vdso return call	2021-03-08 11:16:27 -03:00
pthread	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
riscv	Add libm_alias_finite for _finite symbols	2020-01-03 10:02:04 -03:00
s390	S390: Also check vector support in memmove ifunc-selector [BZ #27511 ]	2021-03-26 11:01:51 +01:00
sh	sh: Add sh4 fpu Implies folder	2020-11-27 09:34:02 -03:00
sparc	Add libm_alias_finite for _finite symbols	2020-01-03 10:02:04 -03:00
unix	Linux: Require properly configured /dev/pts for PTYs	2022-08-18 12:28:36 +02:00
wordsize-32	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
wordsize-64	Update copyright dates with scripts/update-copyrights.	2020-01-01 00:14:33 +00:00
x86	x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`	2023-09-11 22:47:46 -05:00
x86_64	x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591 ]	2022-11-24 17:15:54 -08:00