glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-11 22:00:08 +00:00

Author	SHA1	Message	Date
Stefan Liebler	89b638f48a	S390: Enable static PIE This commit enables static PIE on 64bit. On 31bit, static PIE is not supported. A new configure check in sysdeps/s390/s390-64/configure.ac also performs a minimal test for requirements in ld: Ensure you also have those patches for: - binutils (ld) - "[PR ld/22263] s390: Avoid dynamic TLS relocs in PIE" https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=26b1426577b5dcb32d149c64cca3e603b81948a9 (Tested by configure check above) Otherwise there will be a R_390_TLS_TPOFF relocation, which fails to be processed in _dl_relocate_static_pie() as static TLS map is not setup. - "s390: Add DT_JMPREL pointing to .rela.[i]plt with static-pie" https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d942d8db12adf4c9e5c7d9ed6496a779ece7149e (We can't test it in configure as we are not able to link a static PIE executable if the system glibc lacks static PIE support) Otherwise there won't be DT_JMPREL, DT_PLTRELA, DT_PLTRELASZ entries and the IFUNC symbols are not processed, which leads to crashes. - kernel (the mentioned links to the commits belong to 5.19 merge window): - "s390/mmap: increase stack/mmap gap to 128MB" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=f2f47d0ef72c30622e62471903ea19446ea79ee2 - "s390/vdso: move vdso mapping to its own function" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=57761da4dc5cd60bed2c81ba0edb7495c3c740b8 - "s390/vdso: map vdso above stack" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=9e37a2e8546f9e48ea76c839116fa5174d14e033 - "s390/vdso: add vdso randomization" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=41cd81abafdc4e58a93fcb677712a76885e3ca25 (We can't test the kernel of the target system) Otherwise if /proc/sys/kernel/randomize_va_space is turned off (0), static PIE executables like ldconfig will crash. While startup sbrk is used to enlarge the HEAP. Unfortunately the underlying brk syscall fails as there is not enough space after the HEAP. Then the address of the TLS image is invalid and the following memcpy in __libc_setup_tls() leads to a segfault. If /proc/sys/kernel/randomize_va_space is activated (default: 2), there is enough space after HEAP. - glibc - "Linux: Define MMAP_CALL_INTERNAL" https://sourceware.org/git/?p=glibc.git;a=commit;h=c1b68685d438373efe64e5f076f4215723004dfb - "i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S" https://sourceware.org/git/?p=glibc.git;a=commit;h=6e5c7a1e262961adb52443ab91bd2c9b72316402 - "i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls" https://sourceware.org/git/?p=glibc.git;a=commit;h=60f0f2130d30cfd008ca39743027f1e200592dff - "ia64: Always define IA64_USE_NEW_STUB as a flag macro" https://sourceware.org/git/?p=glibc.git;a=commit;h=18bd9c3d3b1b6a9182698c85354578d1d58e9d64 - "Linux: Implement a useful version of _startup_fatal" https://sourceware.org/git/?p=glibc.git;a=commit;h=a2a6bce7d7e52c1c34369a7da62c501cc350bc31 - "Linux: Introduce __brk_call for invoking the brk system call" https://sourceware.org/git/?p=glibc.git;a=commit;h=b57ab258c1140bc45464b4b9908713e3e0ee35aa - "csu: Implement and use _dl_early_allocate during static startup" https://sourceware.org/git/?p=glibc.git;a=commit;h=f787e138aa0bf677bf74fa2a08595c446292f3d7 The mentioned patch series by Florian Weimer avoids the mentioned failing sbrk syscall by falling back to mmap. This commit also adjusts startup code in start.S to be ready for static PIE. We have to add a wrapper function for main as we are not allowed to use GOT relocations before __libc_start_main is called. (Compare also to: - commit `14d886edbd` "aarch64: fix start code for static pie" - commit `3d1d79283e` "aarch64: fix static pie enabled libc when main is in a shared library" ) (cherry picked from commit `728894dba4`)	2022-05-19 17:15:57 +02:00
Florian Weimer	ede8d94d15	csu: Implement and use _dl_early_allocate during static startup This implements mmap fallback for a brk failure during TLS allocation. scripts/tls-elf-edit.py is updated to support the new patching method. The script no longer requires that in the input object is of ET_DYN type. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `f787e138aa`)	2022-05-19 12:13:53 +02:00
Florian Weimer	43d77ef9b8	Linux: Introduce __brk_call for invoking the brk system call Alpha and sparc can now use the generic implementation. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `b57ab258c1`)	2022-05-19 12:13:53 +02:00
Florian Weimer	e7ca2a475c	Linux: Implement a useful version of _startup_fatal On i386 and ia64, the TCB is not available at this point. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `a2a6bce7d7`)	2022-05-19 12:13:53 +02:00
Florian Weimer	b2387bea84	ia64: Always define IA64_USE_NEW_STUB as a flag macro And keep the previous definition if it exists. This allows disabling IA64_USE_NEW_STUB while keeping USE_DL_SYSINFO defined. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `18bd9c3d3b`)	2022-05-19 12:13:53 +02:00
Florian Weimer	b38c9cdb58	Linux: Define MMAP_CALL_INTERNAL Unlike MMAP_CALL, this avoids a TCB dependency for an errno update on failure. <mmap_internal.h> cannot be included as is on several architectures due to the definition of page_unit, so introduce a separate header file for the definition of MMAP_CALL and MMAP_CALL_INTERNAL, <mmap_call.h>. Reviewed-by: Stefan Liebler <stli@linux.ibm.com> (cherry picked from commit `c1b68685d4`)	2022-05-19 12:13:53 +02:00
Florian Weimer	1a5b9d1a23	i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls Introduce an int-80h-based version of __libc_do_syscall and use it if I386_USE_SYSENTER is defined as 0. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `60f0f2130d`)	2022-05-19 12:13:53 +02:00
Florian Weimer	1e7b011f87	i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S After commit `a78e6a10d0` ("i386: Remove broken CAN_USE_REGISTER_ASM_EBP (bug 28771)"), it is never defined. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6e5c7a1e26`)	2022-05-19 12:13:53 +02:00
Fangrui Song	be9240c84c	elf: Remove __libc_init_secure After `73fc4e28b9`, __libc_enable_secure_decided is always 0 and a statically linked executable may overwrite __libc_enable_secure without considering AT_SECURE. The __libc_enable_secure has been correctly initialized in _dl_aux_init, so just remove __libc_enable_secure_decided and __libc_init_secure. This allows us to remove some startup_get*id functions from `22b79ed7f4`. Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `3e9acce8c5`)	2022-05-19 12:13:52 +02:00
Florian Weimer	ff900fad89	Linux: Consolidate auxiliary vector parsing (redo) And optimize it slightly. This is commit `8c8510ab27` revised. In _dl_aux_init in elf/dl-support.c, use an explicit loop and -fno-tree-loop-distribute-patterns to avoid memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit `73fc4e28b9`)	2022-05-19 12:13:51 +02:00
Florian Weimer	28bdb03b1b	Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED Otherwise, <dl-auxv.h> on POWER ends up being included twice, once in dl-sysdep.c, once in dl-support.c. That leads to a linker failure due to multiple definitions of _dl_cache_line_size. Fixes commit `d96d2995c1` ("Revert "Linux: Consolidate auxiliary vector parsing"). (cherry picked from commit `098c795e85`)	2022-05-19 11:49:06 +02:00
Florian Weimer	1cc4ddfeeb	Revert "Linux: Consolidate auxiliary vector parsing" This reverts commit `8c8510ab27`. The revert is not perfect because the commit included a bug fix for _dl_sysdep_start with an empty argv, introduced in commit `2d47fa6862` ("Linux: Remove DL_FIND_ARG_COMPONENTS"), and this bug fix is kept. The revert is necessary because the reverted commit introduced an early memset call on aarch64, which leads to crash due to lack of TCB initialization. (cherry picked from commit `d96d2995c1`)	2022-05-19 11:49:06 +02:00
Florian Weimer	4b9cd5465d	Linux: Consolidate auxiliary vector parsing And optimize it slightly. The large switch statement in _dl_sysdep_start can be replaced with a large array. This reduces source code and binary size. On i686-linux-gnu: Before: text data bss dec hex filename 7791 12 0 7803 1e7b elf/dl-sysdep.os After: text data bss dec hex filename 7135 12 0 7147 1beb elf/dl-sysdep.os Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `8c8510ab27`)	2022-05-19 11:49:06 +02:00
Florian Weimer	08728256fa	Linux: Assume that NEED_DL_SYSINFO_DSO is always defined The definition itself is still needed for generic code. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `f19fc997a5`)	2022-05-19 11:49:06 +02:00
Florian Weimer	458733fffe	Linux: Remove DL_FIND_ARG_COMPONENTS The generic definition is always used since the Native Client port has been removed. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `2d47fa6862`)	2022-05-19 11:49:06 +02:00
Florian Weimer	2139b1848e	Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE They are always defined. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `b9c3d3382f`)	2022-05-19 11:49:06 +02:00
Florian Weimer	b0bd6a1323	elf: Merge dl-sysdep.c into the Linux version The generic version is the de-facto Linux implementation. It requires an auxiliary vector, so Hurd does not use it. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `91c0a47ffb`)	2022-05-19 11:43:53 +02:00
Noah Goldstein	596c9a32cc	x86: Optimize {str\|wcs}rchr-evex The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.755 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c966099cdc`)	2022-05-16 18:55:54 -07:00
Noah Goldstein	00f09a14d2	x86: Optimize {str\|wcs}rchr-avx2 The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.832 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `df7e295d18`)	2022-05-16 18:55:45 -07:00
Noah Goldstein	0a11305416	x86: Optimize {str\|wcs}rchr-sse2 The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.741 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5307aa9c18`)	2022-05-16 18:55:37 -07:00
Noah Goldstein	df5de87260	x86: Cleanup page cross code in memcmp-avx2-movbe.S Old code was both inefficient and wasted code size. New code (-62 bytes) and comparable or better performance in the page cross case. geometric_mean(N=20) of page cross cases New / Original: 0.960 size, align0, align1, ret, New Time/Old Time 1, 4095, 0, 0, 1.001 1, 4095, 0, 1, 0.999 1, 4095, 0, -1, 1.0 2, 4094, 0, 0, 1.0 2, 4094, 0, 1, 1.0 2, 4094, 0, -1, 1.0 3, 4093, 0, 0, 1.0 3, 4093, 0, 1, 1.0 3, 4093, 0, -1, 1.0 4, 4092, 0, 0, 0.987 4, 4092, 0, 1, 1.0 4, 4092, 0, -1, 1.0 5, 4091, 0, 0, 0.984 5, 4091, 0, 1, 1.002 5, 4091, 0, -1, 1.005 6, 4090, 0, 0, 0.993 6, 4090, 0, 1, 1.001 6, 4090, 0, -1, 1.003 7, 4089, 0, 0, 0.991 7, 4089, 0, 1, 1.0 7, 4089, 0, -1, 1.001 8, 4088, 0, 0, 0.875 8, 4088, 0, 1, 0.881 8, 4088, 0, -1, 0.888 9, 4087, 0, 0, 0.872 9, 4087, 0, 1, 0.879 9, 4087, 0, -1, 0.883 10, 4086, 0, 0, 0.878 10, 4086, 0, 1, 0.886 10, 4086, 0, -1, 0.873 11, 4085, 0, 0, 0.878 11, 4085, 0, 1, 0.881 11, 4085, 0, -1, 0.879 12, 4084, 0, 0, 0.873 12, 4084, 0, 1, 0.889 12, 4084, 0, -1, 0.875 13, 4083, 0, 0, 0.873 13, 4083, 0, 1, 0.863 13, 4083, 0, -1, 0.863 14, 4082, 0, 0, 0.838 14, 4082, 0, 1, 0.869 14, 4082, 0, -1, 0.877 15, 4081, 0, 0, 0.841 15, 4081, 0, 1, 0.869 15, 4081, 0, -1, 0.876 16, 4080, 0, 0, 0.988 16, 4080, 0, 1, 0.99 16, 4080, 0, -1, 0.989 17, 4079, 0, 0, 0.978 17, 4079, 0, 1, 0.981 17, 4079, 0, -1, 0.98 18, 4078, 0, 0, 0.981 18, 4078, 0, 1, 0.98 18, 4078, 0, -1, 0.985 19, 4077, 0, 0, 0.977 19, 4077, 0, 1, 0.979 19, 4077, 0, -1, 0.986 20, 4076, 0, 0, 0.977 20, 4076, 0, 1, 0.986 20, 4076, 0, -1, 0.984 21, 4075, 0, 0, 0.977 21, 4075, 0, 1, 0.983 21, 4075, 0, -1, 0.988 22, 4074, 0, 0, 0.983 22, 4074, 0, 1, 0.994 22, 4074, 0, -1, 0.993 23, 4073, 0, 0, 0.98 23, 4073, 0, 1, 0.992 23, 4073, 0, -1, 0.995 24, 4072, 0, 0, 0.989 24, 4072, 0, 1, 0.989 24, 4072, 0, -1, 0.991 25, 4071, 0, 0, 0.99 25, 4071, 0, 1, 0.999 25, 4071, 0, -1, 0.996 26, 4070, 0, 0, 0.993 26, 4070, 0, 1, 0.995 26, 4070, 0, -1, 0.998 27, 4069, 0, 0, 0.993 27, 4069, 0, 1, 0.999 27, 4069, 0, -1, 1.0 28, 4068, 0, 0, 0.997 28, 4068, 0, 1, 1.0 28, 4068, 0, -1, 0.999 29, 4067, 0, 0, 0.996 29, 4067, 0, 1, 0.999 29, 4067, 0, -1, 0.999 30, 4066, 0, 0, 0.991 30, 4066, 0, 1, 1.001 30, 4066, 0, -1, 0.999 31, 4065, 0, 0, 0.988 31, 4065, 0, 1, 0.998 31, 4065, 0, -1, 0.998 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `23102686ec`)	2022-05-16 18:55:24 -07:00
Noah Goldstein	ffe75982cc	x86: Remove memcmp-sse4.S Code didn't actually use any sse4 instructions since `ptest` was removed in: commit `2f9062d717` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `7cbc03d030`)	2022-05-16 18:55:16 -07:00
Noah Goldstein	4ff6ae069b	x86: Small improvements for wcslen Just a few QOL changes. 1. Prefer `add` > `lea` as it has high execution units it can run on. 2. Don't break macro-fusion between `test` and `jcc` 3. Reduce code size by removing gratuitous padding bytes (-90 bytes). geometric_mean(N=20) of all benchmarks New / Original: 0.959 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `244b415d38`)	2022-05-16 18:55:09 -07:00
Noah Goldstein	80883f4354	x86: Remove AVX str{n}casecmp The rational is: 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% regression on Tigerlake using SSE42 versus AVX across the benchtest suite). 2. AVX2 version covers the majority of targets that previously prefered it. 3. The targets where AVX would still be best (SnB and IVB) are becoming outdated. All in all the saving the code size is worth it. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `305769b2a1`)	2022-05-16 18:55:02 -07:00
Noah Goldstein	b13a2e68eb	x86: Add EVEX optimized str{n}casecmp geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `84e7c46df4`)	2022-05-16 18:54:52 -07:00
Noah Goldstein	3051cf3e74	x86: Add AVX2 optimized str{n}casecmp geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `bbf8122234`)	2022-05-16 18:54:41 -07:00
Noah Goldstein	3605c74407	x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .920 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `d154758e61`)	2022-05-16 18:54:27 -07:00
Noah Goldstein	5997011826	x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .894 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `670b54bc58`)	2022-05-16 18:54:17 -07:00
Noah Goldstein	a4b1cae068	x86: Remove strspn-sse2.S and use the generic implementation The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .710 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `9c8a6ad620`)	2022-05-16 18:54:09 -07:00
Noah Goldstein	3811544655	x86: Remove strpbrk-sse2.S and use the generic implementation The generic implementation is faster (see strcspn commit). All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6533585352`)	2022-05-16 18:53:59 -07:00
Noah Goldstein	0dafa75e3c	x86: Remove strcspn-sse2.S and use the generic implementation The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .678 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `fe28e7d9d9`)	2022-05-16 18:53:48 -07:00
Noah Goldstein	0a2da01110	x86: Optimize strspn in strspn-c.c Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2; New / Original: .901 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `412d103431`)	2022-05-16 18:53:39 -07:00
Noah Goldstein	0ae1006967	x86: Optimize strcspn and strpbrk in strcspn-c.c Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2/strlen; New / Original: .928 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `30d627d477`)	2022-05-16 18:53:28 -07:00
Noah Goldstein	dd6d3a0bbc	x86: Code cleanup in strchr-evex and comment justifying branch Small code cleanup for size: -81 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks New / Original: .985 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `ec285ea904`)	2022-05-16 18:53:19 -07:00
Noah Goldstein	3c55c20756	x86: Code cleanup in strchr-avx2 and comment justifying branch Small code cleanup for size: -53 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks Original / New: 1.00 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `a6fbf4d51e`)	2022-05-16 18:53:07 -07:00
Adhemerval Zanella	dd457606ca	x86_64: Remove bcopy optimizations The symbols is not present in current POSIX specification and compiler already generates memmove call. (cherry picked from commit `bf92893a14`)	2022-05-16 18:52:57 -07:00
H.J. Lu	37f373e334	x86-64: Remove bzero weak alias in SS2 memset commit `3d9f171bfb` Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero added the optimized bzero. Remove bzero weak alias in SS2 memset to avoid undefined __bzero in memset-sse2-unaligned-erms. (cherry picked from commit `0fb8800029`)	2022-05-16 18:52:47 -07:00
H.J. Lu	6cba46c858	x86_64/multiarch: Sort sysdep_routines and put one entry per line (cherry picked from commit `c328d0152d`)	2022-05-16 18:52:35 -07:00
H.J. Lu	8de6e4a199	x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ)) (cherry picked from commit `1283948f23`)	2022-05-16 18:52:19 -07:00
Noah Goldstein	e123f08ad5	x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896 ] Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not __wcscmp_avx2. commit `ddf0992cf5` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:21 2022 -0600 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which can cause spurious aborts. This change will need to be backported. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `9fef7039a7`)	2022-05-05 09:13:13 -07:00
Noah Goldstein	5373c90f2e	x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895 ] Logic can read before the start of `s1` / `s2` if both `s1` and `s2` are near the start of a page. To avoid having the result contimated by these comparisons the `strcmp` variants would mask off these comparisons. This was missing in the `strncmp` variants causing the bug. This commit adds the masking to `strncmp` so that out of range comparisons don't affect the result. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as well a full xcheck on x86_64 linux. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `e108c02a5e`)	2022-05-05 09:11:49 -07:00
Noah Goldstein	70509f9b48	x86: Set .text section in memset-vec-unaligned-erms commit `3d9f171bfb` Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero Remove setting the .text section for the code. This commit adds that back. (cherry picked from commit `7912236f4a`)	2022-05-05 09:11:13 -07:00
H.J. Lu	5cb6329652	x86-64: Optimize bzero memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `3d9f171bfb`)	2022-05-05 09:10:53 -07:00
Noah Goldstein	190ea5f7e4	x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only) commit `b62ace2740` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Feb 6 00:54:18 2022 -0600 x86: Improve vec generation in memset-vec-unaligned-erms.S Revert usage of 'pshufb' in broadcast logic as it is an SSSE3 instruction and memset.S is restricted to only SSE2 instructions. (cherry picked from commit `1b0c60f95b`)	2022-05-05 08:54:23 -07:00
Noah Goldstein	ea19c490a3	x86: Improve vec generation in memset-vec-unaligned-erms.S No bug. Split vec generation into multiple steps. This allows the broadcast in AVX2 to use 'xmm' registers for the L(less_vec) case. This saves an expensive lane-cross instruction and removes the need for 'vzeroupper'. For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for byte broadcast. Results for memset-avx2 small (geomean of N = 20 benchset runs). size, New Time, Old Time, New / Old 0, 4.100, 3.831, 0.934 1, 5.074, 4.399, 0.867 2, 4.433, 4.411, 0.995 4, 4.487, 4.415, 0.984 8, 4.454, 4.396, 0.987 16, 4.502, 4.443, 0.987 All relevant string/wcsmbs tests are passing. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `b62ace2740`)	2022-05-05 08:54:11 -07:00
H.J. Lu	53ddafe917	x86-64: Fix strcmp-evex.S Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit `8418eb3ff4` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:39 2022 -0600 x86: Optimize strcmp-evex.S (cherry picked from commit `0e0199a9e0`)	2022-05-05 08:54:03 -07:00
H.J. Lu	d299032743	x86-64: Fix strcmp-avx2.S Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit `b77b06e0e2` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:38 2022 -0600 x86: Optimize strcmp-avx2.S (cherry picked from commit `c15efd011c`)	2022-05-05 08:53:50 -07:00
Noah Goldstein	c41a66767d	x86: Optimize strcmp-evex.S Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases as well are nearly universally improved. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `8418eb3ff4`)	2022-05-05 08:53:42 -07:00
Noah Goldstein	0d5b36c8cc	x86: Optimize strcmp-avx2.S Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases are improved most for smaller sizes [0, 128] and go about even for (128, 4096]. The loop page cross logic is improved so some more significant speedup is seen there as well. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `b77b06e0e2`)	2022-05-05 08:53:34 -07:00
Joseph Myers	97cb8227b8	Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17. Tested with build-many-glibcs.py for aarch64-linux-gnu. (cherry picked from commit `866c599182`)	2022-05-03 11:08:52 +02:00

1 2 3 4 5 ...

14497 Commits