glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-29 05:51:10 +00:00

History

Noah Goldstein 5bf0ab8057 x86: Improve large memset perf with non-temporal stores [RHEL-29312] Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>		2024-05-30 12:36:09 -05:00
..
aarch64	aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777)	2024-05-23 09:36:08 -03:00
alpha	string: Use builtins for ffs and ffsll	2024-02-01 09:31:33 -03:00
arc	login: Check default sizes of structs utmp, utmpx, lastlog	2024-04-19 14:38:17 +02:00
arm	arm: Update ulps	2024-05-20 13:12:23 -03:00
csky	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)	2024-04-19 14:38:17 +02:00
generic	login: Check default sizes of structs utmp, utmpx, lastlog	2024-04-19 14:38:17 +02:00
gnu	login: Use unsigned 32-bit types for seconds-since-epoch	2024-04-19 14:38:17 +02:00
hppa	login: Check default sizes of structs utmp, utmpx, lastlog	2024-04-19 14:38:17 +02:00
htl	htl: Implement some support for TLS_DTV_AT_TP	2024-03-23 23:00:30 +01:00
hurd	hurd: Move internal functions to internal header	2024-03-23 22:43:07 +01:00
i386	math: Fix i386 and m68k exp10 on static build (BZ 31775)	2024-05-21 13:44:22 -03:00
ieee754	sysdeps/ieee754/ldbl-opt/Makefile: Split and sort libnldbl-calls	2024-05-24 10:25:40 -07:00
loongarch	LoongArch: Use "$fcsr0" instead of "$r0" in _FPU_{GET,SET}CW	2024-05-28 09:17:05 +08:00
m68k	math: Fix i386 and m68k exp10 on static build (BZ 31775)	2024-05-21 13:44:22 -03:00
mach	Implement C23 log2p1	2024-05-20 13:41:39 +00:00
microblaze	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)	2024-04-19 14:38:17 +02:00
mips	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)	2024-04-19 14:38:17 +02:00
nios2	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)	2024-04-19 14:38:17 +02:00
nptl	Always define __USE_TIME_BITS64 when 64 bit time_t is used	2024-04-02 15:28:36 -03:00
or1k	or1k: Add hard float support	2024-05-03 18:28:18 +01:00
posix	posix: Sync tempname with gnulib	2024-04-10 14:53:39 -03:00
powerpc	powerpc: Remove duplicated llrintf and llrintf32 from libm.a (BZ 31787)	2024-05-23 09:36:08 -03:00
pthread	Add crt1-2.0.o for glibc 2.0 compatibility tests	2024-05-06 07:49:40 -07:00
riscv	login: Check default sizes of structs utmp, utmpx, lastlog	2024-04-19 14:38:17 +02:00
s390	s390x: Regenerate ULPs.	2024-05-24 09:53:49 +02:00
sh	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)	2024-04-19 14:38:17 +02:00
sparc	elf: Only process multiple tunable once (BZ 31686)	2024-05-07 12:16:36 -03:00
unix	parse_fdinfo: Don't advance pointer twice [BZ #31798 ]	2024-05-27 06:52:45 -07:00
wordsize-32	Update copyright dates with scripts/update-copyrights	2024-01-01 10:53:40 -08:00
wordsize-64	Update copyright dates with scripts/update-copyrights	2024-01-01 10:53:40 -08:00
x86	i386: Disable Intel Xeon Phi tests for GCC 15 and above (BZ 31782)	2024-05-27 12:28:13 -07:00
x86_64	x86: Improve large memset perf with non-temporal stores [RHEL-29312]	2024-05-30 12:36:09 -05:00