glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-25 14:30:06 +00:00

Author	SHA1	Message	Date
Joe Ramsay	7b12776584	aarch64: Improve special-case handling in AdvSIMD double-precision libmvec routines Avoids emitting many saves/restores of vector registers, reduces the amount of code generated around the scalar fallback.	2023-11-29 15:03:36 +00:00
Joe Ramsay	bd70d3bacf	aarch64: Fix libmvec benchmarks These were broken by the new atan2 functions, as they were only set up for univariate functions. Arity is now detected from the input file - this revealed a mistake that the double-precision inputs were being used for both single- and double-precision routines, which is now remedied.	2023-11-22 09:10:43 +00:00
Adhemerval Zanella	55f41ef8de	elf: Remove LD_PROFILE for static binaries The _dl_non_dynamic_init does not parse LD_PROFILE, which does not enable profile for dlopen objects. Since dlopen is deprecated for static objects, it is better to remove the support. It also allows to trim down libc.a of profile support. Checked on x86_64-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-11-21 16:15:42 -03:00
Joe Ramsay	a8830c9285	aarch64: Add vector implementations of expm1 routines May discard sign of 0 - auto tests for -0 and -0x1p-10000 updated accordingly.	2023-11-20 17:53:14 +00:00
Wilco Dijkstra	2f5524cc53	AArch64: Remove Falkor memcpy The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:52:50 +00:00
Wilco Dijkstra	3d7090f14b	AArch64: Add memset_zva64 Add a specialized memset for the common ZVA size of 64 to avoid the overhead of reading the ZVA size. Since the code is identical to __memset_falkor, remove the latter. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:50:44 +00:00
Wilco Dijkstra	9627ab99b5	AArch64: Cleanup emag memset Cleanup emag memset - merge the memset_base64.S file, remove the unused ZVA code (since it is disabled on emag). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:45:47 +00:00
Joe Ramsay	3548a4f087	aarch64: Add vector implementations of log1p routines May discard sign of zero.	2023-11-10 17:07:43 +00:00
Joe Ramsay	b07038c5d3	aarch64: Add vector implementations of atan2 routines	2023-11-10 17:07:43 +00:00
Joe Ramsay	d30c39f80d	aarch64: Add vector implementations of atan routines	2023-11-10 17:07:42 +00:00
Joe Ramsay	b5d23367a8	aarch64: Add vector implementations of acos routines	2023-11-10 17:07:42 +00:00
Joe Ramsay	9bed498418	aarch64: Add vector implementations of asin routines	2023-11-10 17:07:42 +00:00
Wilco Dijkstra	9fd3409842	AArch64: Cleanup ifuncs Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename strlen_mte to strlen_generic. Remove rtld-memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-11-01 13:41:59 +00:00
Wilco Dijkstra	2bd0017988	AArch64: Add support for MOPS memcpy/memmove/memset Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for memcpy, memmove and memset (use .inst for now so it works with all binutils versions without needing complex configure and conditional compilation). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-10-24 13:37:48 +01:00
Joe Ramsay	31aaf6fed9	aarch64: Add vector implementations of exp10 routines Double-precision routines either reuse the exp table (AdvSIMD) or use SVE FEXPA intruction.	2023-10-23 15:00:45 +01:00
Joe Ramsay	067a34156c	aarch64: Add vector implementations of log10 routines A table is also added, which is shared between AdvSIMD and SVE log10.	2023-10-23 15:00:45 +01:00
Joe Ramsay	a8e3ab3074	aarch64: Add vector implementations of log2 routines A table is also added, which is shared between AdvSIMD and SVE log2.	2023-10-23 15:00:45 +01:00
Joe Ramsay	b39e9db5e3	aarch64: Add vector implementations of exp2 routines Some routines reuse table from v_exp_data.c	2023-10-23 15:00:45 +01:00
Joe Ramsay	f554334c05	aarch64: Add vector implementations of tan routines This includes some utility headers for evaluating polynomials using various schemes.	2023-10-23 15:00:44 +01:00
Joe Ramsay	5a4b6f8e4b	aarch64: Optimise vecmath logs * Transpose table layout for improved memory access * Use half-vector special comparisons for AdvSIMD * Improve register use near special-case branches - Due to the presence of a function call, return value would get mov-d out of x0 in order to facilitate PCS. By moving the final computation after the branch this can be avoided Also change SVE routines to use overloaded intrinsics for readability.	2023-10-05 16:54:16 +01:00
Joe Ramsay	480a0dfe1a	aarch64: Cosmetic change in SVE exp routines Use overloaded intrinsics for readability. Codegen does not change, however while we're bringing the routines up-to-date with recent improvements to other routines in AOR it is worth copying this change over as well.	2023-10-05 16:54:00 +01:00
Joe Ramsay	9180160e08	aarch64: Optimize SVE cos & cosf Saves a mov by ensuring return value does not need to be moved out of the way before special-case branch. Also change to use overloaded intrinsics.	2023-10-05 16:53:38 +01:00
Joe Ramsay	8014d1e832	aarch64: Improve vecmath sin routines * Update ULP comment reflecting a new observed max in [-pi/2, pi/2] * Use the same polynomial in AdvSIMD and SVE, rather than FTRIG instructions * Improve register use near special-case branch Also use overloaded intrinsics for SVE.	2023-10-05 16:53:06 +01:00
Wilco Dijkstra	6b695e5c62	AArch64: Remove -0.0 check from vector sin Remove the unnecessary extra checks for sin (-0.0) from vector sin/sinf, improving performance. Passes regress. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-09-26 13:40:07 +01:00
Siddhesh Poyarekar	c6cb8783b5	configure: Use autoconf 2.71 Bump autoconf requirement to 2.71 to allow regenerating configure on more recent distributions. autoconf 2.71 has been in Fedora since F36 and is the current version in Debian stable (bookworm). It appears to be current in Gentoo as well. All sysdeps configure and preconfigure scripts have also been regenerated; all changes are trivial transformations that do not affect functionality. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2023-07-17 10:08:10 -04:00
Joe Ramsay	4a9392ffc2	aarch64: Add vector implementations of exp routines Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-06-30 09:04:26 +01:00
Joe Ramsay	78c01a5cbe	aarch64: Add vector implementations of log routines Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. Log lookup table added as HIDDEN symbol to allow it to be shared between AdvSIMD and SVE variants. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-06-30 09:04:22 +01:00
Joe Ramsay	3bb1af2051	aarch64: Add vector implementations of sin routines Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-06-30 09:04:16 +01:00
Joe Ramsay	aed39a3aa3	aarch64: Add vector implementations of cos routines Replace the loop-over-scalar placeholder routines with optimised implementations from Arm Optimized Routines (AOR). Also add some headers containing utilities for aarch64 libmvec routines, and update libm-test-ulps. Data tables for new routines are used via a pointer with a barrier on it, in order to prevent overly aggressive constant inlining in GCC. This allows a single adrp, combined with offset loads, to be used for every constant in the table. Special-case handlers are marked NOINLINE in order to confine the save/restore overhead of switching from vector to normal calling standard. This way we only incur the extra memory access in the exceptional cases. NOINLINE definitions have been moved to math_private.h in order to reduce duplication. AOR exposes a config option, WANT_SIMD_EXCEPT, to enable selective masking (and later fixing up) of invalid lanes, in order to trigger fp exceptions correctly (AdvSIMD only). This is tested and maintained in AOR, however it is configured off at source level here for performance reasons. We keep the WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify the upstreaming process from AOR to glibc. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-06-30 09:04:10 +01:00
Paul Pluzhnikov	2cbeda847b	Fix a few more typos I missed in previous round -- BZ 25337	2023-06-02 23:46:32 +00:00
Paul Pluzhnikov	65cc53fe7c	Fix misspellings in sysdeps/ -- BZ 25337	2023-05-30 23:02:29 +00:00
Szabolcs Nagy	642f1b9b3d	aarch64: More configure checks for libmvec Check assembler and linker support too, not just SVE ACLE in the compiler, since variant PCS requires at least binutils 2.32.1.	2023-05-05 11:34:44 +01:00
Szabolcs Nagy	ee68e9cba4	aarch64: SVE ACLE configure test cleanups Use more idiomatic configure test for better autoconf cache and logs.	2023-05-05 10:28:29 +01:00
Szabolcs Nagy	1a62d7e5c3	aarch64: fix SVE ACLE check for bootstrap glibc builds arm_sve.h depends on stdint.h but that relies on libc headers unless compiled in freestanding mode. Without this change a bootstrap glibc build (that uses a compiler without installed libc headers) failed with checking for availability of SVE ACLE... In file included from [...]/arm_sve.h:28, from conftest.c:1: [...]/stdint.h:9:16: fatal error: stdint.h: No such file or directory 9 \| # include_next <stdint.h> \| ^~~~~~~~~~ compilation terminated. configure: error: mathvec is enabled but compiler does not have SVE ACLE. [...]	2023-05-04 10:19:11 +01:00
Joe Ramsay	cd94326a13	Enable libmvec support for AArch64 This patch enables libmvec on AArch64. The proposed change is mainly implementing build infrastructure to add the new routines to ABI, tests and benchmarks. I have demonstrated how this all fits together by adding implementations for vector cos, in both single and double precision, targeting both Advanced SIMD and SVE. The implementations of the routines themselves are just loops over the scalar routine from libm for now, as we are more concerned with getting the plumbing right at this point. We plan to contribute vector routines from the Arm Optimized Routines repo that are compliant with requirements described in the libmvec wiki. Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising the minimum GCC by such a big jump, we allow users to disable libmvec if their compiler is too old. Note that at this point users have to manually call the vector math functions. This seems to be acceptable to some downstream users. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-05-03 12:09:49 +01:00
Szabolcs Nagy	2ce48fbd5a	aarch64: update libm test ulps	2023-02-24 10:55:38 +00:00
Jun Tang	311a7e0256	AArch64: Fix HP_TIMING_DIFF computation [BZ# 29329] Fix the computation to allow for cntfrq_el0 being larger than 1GHz. Assume cntfrq_el0 is a multiple of 1MHz to increase the maximum interval (1024 seconds at 1GHz). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2023-02-22 16:45:59 +00:00
Adhemerval Zanella	a9b3b770f5	string: Remove string_private.h Now that _STRING_ARCH_unaligned is not used anymore. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2023-02-17 15:56:54 -03:00
Adhemerval Zanella	22999b2f0f	string: Add libc_hidden_proto for memrchr Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:58 -03:00
Adhemerval Zanella	7ea510127e	string: Add libc_hidden_proto for strchrnul Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:56 -03:00
Wilco Dijkstra	d2d3f3720c	AArch64: Improve SVE memcpy and memmove Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-02-06 16:15:34 +00:00
Wilco Dijkstra	55599d4804	AArch64: Improve strrchr Use shrn for narrowing the mask which simplifies code and speeds up small strings. Unroll the first search loop to improve performance on large strings. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	ad098893ba	AArch64: Optimize strnlen Optimize strnlen using the shrn instruction and improve the main loop. Small strings are around 10% faster, large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	03c8ce5000	AArch64: Optimize strlen Optimize strlen by unrolling the main loop. Large strings are 64% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	349e48c01e	AArch64: Optimize strcpy Unroll the main loop. Large strings are around 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	09ebd8549b	AArch64: Improve strchrnul Unroll the main loop, which improves performance slightly. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	51541a2297	AArch64: Optimize strchr Simplify calculation of the mask using shrn. Unroll the main loop. Small strings are 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	1bbb1a2022	AArch64: Improve strlen_asimd Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	0077624177	AArch64: Optimize memrchr Optimize the main loop - large strings are 43% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Wilco Dijkstra	ce758d4f06	AArch64: Optimize memchr Optimize the main loop - large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Joseph Myers	6d7e8eda9b	Update copyright dates with scripts/update-copyrights	2023-01-06 21:14:39 +00:00
Vladislav Khmelevsky	eb4181e9f4	elf: Fix rtld-audit trampoline for aarch64 This patch fixes two problems with audit: 1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS, resulting in x2 register value nulling in RG structure. 2. We need to preserve the x8 register before function call, but don't have to save it's new value and restore it before return. Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value which is wrong (althoug doesn't affect anything). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-11-21 10:20:40 -03:00
Florian Weimer	1f34a23288	elf: Introduce <dl-call_tls_init_tp.h> and call_tls_init_tp (bug 29249) This makes it more likely that the compiler can compute the strlen argument in _startup_fatal at compile time, which is required to avoid a dependency on strlen this early during process startup. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2022-11-03 17:28:03 +01:00
Adhemerval Zanella	8d98c7c00f	configure: Use -Wno-ignored-attributes if compiler warns about multiple aliases clang emits an warning when a double alias redirection is used, to warn the the original symbol will be used even when weak definition is overridden. However, this is a common pattern for weak_alias, where multiple alias are set to same symbol. Reviewed-by: Fangrui Song <maskray@google.com>	2022-11-01 09:51:06 -03:00
Szabolcs Nagy	dd31f42e39	aarch64: Don't build wordcopy Use an empty wordcopy.c to avoid building the generic one. It does not seem to be used anywhere. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-10-28 11:14:54 +01:00
Wilco Dijkstra	e6f3fe362f	aarch64: Use memcpy_simd as the default memcpy Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. If SVE is available, a SVE memcpy will be used by default (including for Neoverse N2).	2022-10-26 14:16:50 +01:00
Wilco Dijkstra	a8e72913fe	aarch64: Cleanup memset ifunc Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of 256, so add an explicit check.	2022-10-26 14:12:55 +01:00
Florian Weimer	58548b9d68	Use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources In the future, this will result in a compilation failure if the macros are unexpectedly undefined (due to header inclusion ordering or header inclusion missing altogether). Assembler sources are more difficult to convert. In many cases, they are hand-optimized for the mangling and no-mangling variants, which is why they are not converted. sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c are special: These are C sources, but most of the implementation is in assembler, so the PTR_DEMANGLE macro has to be undefined in some cases, to match the assembler style. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-10-18 17:04:10 +02:00
Florian Weimer	88f4b6929c	Introduce <pointer_guard.h>, extracted from <sysdep.h> This allows us to define a generic no-op version of PTR_MANGLE and PTR_DEMANGLE. In the future, we can use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources, avoiding an unintended loss of hardening due to missing include files or unlucky header inclusion ordering. In i386 and x86_64, we can avoid a <tls.h> dependency in the C code by using the computed constant from <tcb-offsets.h>. <sysdep.h> no longer includes these definitions, so there is no cyclic dependency anymore when computing the <tcb-offsets.h> constants. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-10-18 17:03:55 +02:00
Adhemerval Zanella	5355f9ca7b	elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2022-10-10 10:32:28 -03:00
Wilco Dijkstra	22f4ab2d20	Use atomic_exchange_release/acquire Rename atomic_exchange_rel/acq to use atomic_exchange_release/acquire since these map to the standard C11 atomic builtins. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-09-26 16:58:08 +01:00
Szabolcs Nagy	06d4381dd8	csu: Change start code license to have link exception The start code can get linked into dynamic linked executables where LGPL would require shipping the source or linkable binaries when the executable is distributed. On some targets the license exception was missing in start.S (which is compiled into crt1.o and Scrt1.o which may end up linked into PDE and PIE binaries). I did not review what other code may end up in executables, just fixed the start.S license inconsistency across targets. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2022-08-26 09:14:53 +01:00
Wilco Dijkstra	12182ba18d	AArch64: Fix typo in sve configure check (BZ# 29394) Fix a typo in the SVE configure check. This fixes [BZ# 29394].	2022-08-11 17:52:00 +01:00
Jason A. Donenfeld	eaad4f9e8f	arc4random: simplify design for better safety Rather than buffering 16 MiB of entropy in userspace (by way of chacha20), simply call getrandom() every time. This approach is doubtlessly slower, for now, but trying to prematurely optimize arc4random appears to be leading toward all sorts of nasty properties and gotchas. Instead, this patch takes a much more conservative approach. The interface is added as a basic loop wrapper around getrandom(), and then later, the kernel and libc together can work together on optimizing that. This prevents numerous issues in which userspace is unaware of when it really must throw away its buffer, since we avoid buffering all together. Future improvements may include userspace learning more from the kernel about when to do that, which might make these sorts of chacha20-based optimizations more possible. The current heuristic of 16 MiB is meaningless garbage that doesn't correspond to anything the kernel might know about. So for now, let's just do something conservative that we know is correct and won't lead to cryptographic issues for users of this function. This patch might be considered along the lines of, "optimization is the root of all evil," in that the much more complex implementation it replaces moves too fast without considering security implications, whereas the incremental approach done here is a much safer way of going about things. Once this lands, we can take our time in optimizing this properly using new interplay between the kernel and userspace. getrandom(0) is used, since that's the one that ensures the bytes returned are cryptographically secure. But on systems without it, we fallback to using /dev/urandom. This is unfortunate because it means opening a file descriptor, but there's not much of a choice. Secondly, as part of the fallback, in order to get more or less the same properties of getrandom(0), we poll on /dev/random, and if the poll succeeds at least once, then we assume the RNG is initialized. This is a rough approximation, as the ancient "non-blocking pool" initialized after the "blocking pool", not before, and it may not port back to all ancient kernels, though it does to all kernels supported by glibc (≥3.2), so generally it's the best approximation we can do. The motivation for including arc4random, in the first place, is to have source-level compatibility with existing code. That means this patch doesn't attempt to litigate the interface itself. It does, however, choose a conservative approach for implementing it. Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Cristian Rodríguez <crrodriguez@opensuse.org> Cc: Paul Eggert <eggert@cs.ucla.edu> Cc: Mark Harris <mark.hsj@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-07-27 08:58:27 -03:00
Adhemerval Zanella Netto	4c128c7823	aarch64: Add optimized chacha20 It adds vectorized ChaCha20 implementation based on libgcrypt cipher/chacha20-aarch64.S. It is used as default and only little-endian is supported (BE uses generic code). As for generic implementation, the last step that XOR with the input is omited. The final state register clearing is also omitted. On a virtualized Linux on Apple M1 it shows the following improvements (using formatted bench-arc4random data): GENERIC MB/s ----------------------------------------------- arc4random [single-thread] 380.89 arc4random_buf(16) [single-thread] 500.73 arc4random_buf(32) [single-thread] 552.61 arc4random_buf(48) [single-thread] 566.82 arc4random_buf(64) [single-thread] 574.01 arc4random_buf(80) [single-thread] 581.02 arc4random_buf(96) [single-thread] 591.19 arc4random_buf(112) [single-thread] 592.29 arc4random_buf(128) [single-thread] 596.43 ----------------------------------------------- OPTIMIZED MB/s ----------------------------------------------- arc4random [single-thread] 569.60 arc4random_buf(16) [single-thread] 825.78 arc4random_buf(32) [single-thread] 987.03 arc4random_buf(48) [single-thread] 1042.39 arc4random_buf(64) [single-thread] 1075.50 arc4random_buf(80) [single-thread] 1094.68 arc4random_buf(96) [single-thread] 1130.16 arc4random_buf(112) [single-thread] 1129.58 arc4random_buf(128) [single-thread] 1137.91 ----------------------------------------------- Checked on aarch64-linux-gnu.	2022-07-22 11:58:27 -03:00
Danila Kutenin	3c99806989	aarch64: Optimize string functions with shrn instruction We found that string functions were using AND+ADDP to find the nibble/syndrome mask but there is an easier opportunity through `SHRN dst.8b, src.8h, 4` (shift right every 2 bytes by 4 and narrow to 1 byte) and has same latency on all SIMD ARMv8 targets as ADDP. There are also possible gaps for memcmp but that's for another patch. We see 10-20% savings for small-mid size cases (<=128) which are primary cases for general workloads.	2022-07-06 09:26:20 +01:00
Tejas Belagod	05844d18f7	AArch64: Reset HWCAP2_AFP bits in FPCR for default fenv The AFP feature (Alternate floating-point behavior) was added in armv8.7 and introduced new FPCR bits. Currently, HWCAP2_AFP bits (bit 0, 1, 2) in FPCR are preserved when fenv is set to default environment. This is a deviation from standard behaviour. Clear these bits when setting the fenv to default. There is no libc API to modify the new FPCR bits. Restoring those bits matters if the user changed them directly.	2022-07-05 14:01:17 +01:00
Fangrui Song	e89913d0aa	aarch64: Handle fewer relocations for RTLD_BOOTSTRAP The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. TLSDESC/TLS_DTPMOD/TLS_DTPREL handling can be removed. Remove `case AARCH64_R(RELATIVE)` as well as elf_machine_rela has checked it. Tested on aarch64-linux-gnu.	2022-06-15 19:21:53 -07:00
Wilco Dijkstra	fdaf78656f	Add bounds check to __libc_ifunc_impl_list Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-06-10 17:13:29 +01:00
Wilco Dijkstra	eea282d9c6	AArch64: Sort makefile entries Sort makefile entries to reduce conflicts.	2022-06-07 16:58:15 +01:00
Wilco Dijkstra	9f298bfe1f	AArch64: Add SVE memcpy Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE vectors which improves the random memcpy benchmark significantly. Cleanup the memcpy and memmove ifunc selectors.	2022-06-07 16:58:03 +01:00
Fangrui Song	e555954e02	Revert "[AArch64][BZ #17711 ] Fix extern protected data handling" This reverts commit `0910702c4d`. Say both a.so and b.so define protected data symbol `var` and the executable copy relocates var. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics: a.so accesses the copy in the executable while b.so accesses its own. This behavior requires that (a) the compiler emits GOT-generating relocations (b) the linker produces GLOB_DAT instead of RELATIVE. Without the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA code, b.so's GLOB_DAT will bind to the executable (normal behavior). For aarch64 it makes sense to restore the original behavior and don't pay the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA cost. The behavior is very unlikely used by anyone. * Clang code generator treats STV_PROTECTED the same way as STV_HIDDEN: no GOT-generating relocation in the first place. * gold and lld reject copy relocation on a STV_PROTECTED symbol. * Nowadays -fpie/-fpic modes are popular. GCC/Clang's codegen uses GOT-generating relocation when accessing an default visibility external symbol which avoids copy relocation. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2022-05-23 13:37:05 -07:00
Szabolcs Nagy	1da064c015	aarch64: Move ld.so _start to separate file and drop _dl_skip_args A separate asm file is easier to maintain than a macro that expands to inline asm. The RTLD_START macro is only needed now because _dl_start is local in rtld.c, but _start has to call it, if _dl_start was made hidden then it could be empty. _dl_skip_args is no longer needed. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-05-17 10:14:03 +01:00
Szabolcs Nagy	86147bbeec	rtld: Remove DL_ARGV_NOT_RELRO and make _dl_skip_args const _dl_skip_args is always 0, so the target specific code that modifies argv after relro protection is applied is no longer used. After the patch relro protection is applied to _dl_argv consistently on all targets. Reviewed-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-05-17 10:14:03 +01:00
Fangrui Song	098a657fe4	elf: Replace PI_STATIC_AND_HIDDEN with opposite HIDDEN_VAR_NEEDS_DYNAMIC_RELOC PI_STATIC_AND_HIDDEN indicates whether accesses to internal linkage variables and hidden visibility variables in a shared object (ld.so) need dynamic relocations (usually R_*_RELATIVE). PI (position independent) in the macro name is a misnomer: a code sequence using GOT is typically position-independent as well, but using dynamic relocations does not meet the requirement. Not defining PI_STATIC_AND_HIDDEN is legacy and we expect that all new ports will define PI_STATIC_AND_HIDDEN. Current ports defining PI_STATIC_AND_HIDDEN are more than the opposite. Change the configure default. No functional change. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-04-26 09:26:22 -07:00
Ben Woodard	ce9a68c57c	elf: Fix runtime linker auditing on aarch64 (BZ #26643 ) The rtld audit support show two problems on aarch64: 1. _dl_runtime_resolve does not preserve x8, the indirect result location register, which might generate wrong result calls depending of the function signature. 2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve were twice the size of D registers extracted from the stack frame by _dl_runtime_profile. While 2. might result in wrong information passed on the PLT tracing, 1. generates wrong runtime behaviour. The aarch64 rtld audit support is changed to: * Both La_aarch64_regs and La_aarch64_retval are expanded to include both x8 and the full sized NEON V registers, as defined by the ABI. * dl_runtime_profile needed to extract registers saved by _dl_runtime_resolve and put them into the new correctly sized La_aarch64_regs structure. * The LAV_CURRENT check is change to only accept new audit modules to avoid the undefined behavior of not save/restore x8. * Different than other architectures, audit modules older than LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval changed their layout and there are no requirements to support multiple audit interface with the inherent aarch64 issues). * A new field is also reserved on both La_aarch64_regs and La_aarch64_retval to support variant pcs symbols. Similar to x86, a new La_aarch64_vector type to represent the NEON register is added on the La_aarch64_regs (so each type can be accessed directly). Since LAV_CURRENT was already bumped to support bind-now, there is no need to increase it again. Checked on aarch64-linux-gnu. Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>	2022-02-01 14:49:46 -03:00
Wilco Dijkstra	e5fa62b8db	AArch64: Check for SVE in ifuncs [BZ #28744 ] Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove. This fixes BZ #28744.	2022-01-06 14:36:28 +00:00
Paul Eggert	581c785bf3	Update copyright dates with scripts/update-copyrights I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: * 912-#endif remote: * 913: remote: * 914- remote: * error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines	2022-01-01 11:40:24 -08:00
Adhemerval Zanella	83b8d5027d	malloc: Remove memusage.h And use machine-sp.h instead. The Linux implementation is based on already provided CURRENT_STACK_FRAME (used on nptl code) and STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.	2021-12-28 14:57:57 -03:00
Adhemerval Zanella	92ff345137	Remove atomic-machine.h atomic typedefs Now that memusage.c uses generic types we can remove them.	2021-12-28 14:57:57 -03:00
Adhemerval Zanella	8c0664e2b8	elf: Add _dl_audit_pltexit It consolidates the code required to call la_pltexit audit callback. Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu. Reviewed-by: Florian Weimer <fweimer@redhat.com>	2021-12-28 08:40:38 -03:00
Adhemerval Zanella	691d9ae9e6	Remove ununsed tcb-offset Some architectures do not use the auto-generated tcb-offsets.h.	2021-12-17 17:47:29 -03:00
Adhemerval Zanella	ecb94e9587	aarch64: Add math-use-builtins-f{max,min}.h It allows to remove the arch-specific implementations.	2021-12-13 10:08:46 -03:00
Florian Weimer	627f5ede70	Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGN TLS_INIT_TCB_ALIGN is not actually used. TLS_TCB_ALIGN was likely introduced to support a configuration where the thread pointer has not the same alignment as THREAD_SELF. Only ia64 seems to use that, but for the stack/pointer guard, not for storing tcbhead_t. Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift the thread pointer, potentially landing in a different residue class modulo the alignment, but the changes should not impact that. In general, given that TLS variables have their own alignment requirements, having different alignment for the (unshifted) thread pointer and struct pthread would potentially result in dynamic offsets, leading to more complexity. hppa had different values before: __alignof__ (tcbhead_t), which seems to be 4, and __alignof__ (struct pthread), which was 8 (old default) and is now 32. However, it defines THREAD_SELF as: /* Return the thread descriptor for the current thread. / # define THREAD_SELF \ ({ struct pthread __self; \ __self = __get_cr27(); \ __self - 1; \ }) So the thread pointer points after struct pthread (hence __self - 1), and they have to have the same alignment on hppa as well. Similarly, on ia64, the definitions were different. We have: # define TLS_PRE_TCB_SIZE \ (sizeof (struct pthread) \ + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t) \ ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1) \ & ~(__alignof__ (struct pthread) - 1)) \ : 0)) # define THREAD_SELF \ ((struct pthread ) ((char ) __thread_self - TLS_PRE_TCB_SIZE)) And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment (confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c). On m68k, we have a larger gap between tcbhead_t and struct pthread. But as far as I can tell, the port is fine with that. The definition of TCB_OFFSET is sufficient to handle the shifted TCB scenario. This fixes commit `23c77f6018` ("nptl: Increase default TCB alignment to 32"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-09 23:47:49 +01:00
Florian Weimer	ce2248ab91	nptl: Introduce <tcb-access.h> for THREAD_* accessors These are common between most architectures. Only the x86 targets are outliers. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-09 09:49:32 +01:00
Florian Weimer	23c77f6018	nptl: Increase default TCB alignment to 32 rseq support will use a 32-byte aligned field in struct pthread, so the whole struct needs to have at least that alignment. nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h> to obtain the fallback definition. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-03 20:43:31 +01:00
Wilco Dijkstra	b31bd11454	AArch64: Improve A64FX memcpy v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-02 18:36:03 +00:00
Wilco Dijkstra	b51eb35c57	AArch64: Optimize memcmp Rewrite memcmp to improve performance. On small and medium inputs performance is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per iteration, which is 30-50% faster depending on the size. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-02 18:35:53 +00:00
Noah Goldstein	9894127d20	String: Add hidden defs for __memcmpeq() to enable internal usage No bug. This commit adds hidden defs for all declarations of __memcmpeq. This enables usage of __memcmpeq without the PLT for usage internal to GLIBC.	2021-10-26 16:51:29 -05:00
Noah Goldstein	44829b3ddb	String: Add support for __memcmpeq() ABI on all targets No bug. This commit adds support for __memcmpeq() as a new ABI for all targets. In this commit __memcmpeq() is implemented only as an alias to the corresponding targets memcmp() implementation. __memcmpeq() is added as a new symbol starting with GLIBC_2.35 and defined in string.h with comments explaining its behavior. Basic tests that it is callable and works where added in string/tester.c As discussed in the proposal "Add new ABI '__memcmpeq()' to libc" __memcmpeq() is essentially a reserved namespace for bcmp(). The means is shares the same specifications as memcmp() except the return value for non-equal byte sequences is any non-zero value. This is less strict than memcmp()'s return value specification and can be better optimized when a boolean return is all that is needed. __memcmpeq() is meant to only be called by compilers if they can prove that the return value of a memcmp() call is only used for its boolean value. All tests in string/tester.c passed. As well build succeeds on x86_64-linux-gnu target.	2021-10-26 16:51:29 -05:00
Adhemerval Zanella	d6d89608ac	elf: Fix dynamic-link.h usage on rtld.c The `4af6982e4c` fix does not fully handle RTLD_BOOTSTRAP usage on rtld.c due two issues: 1. RTLD_BOOTSTRAP is also used on dl-machine.h on various architectures and it changes the semantics of various machine relocation functions. 2. The elf_get_dynamic_info() change was done sideways, previously to `490e6c62aa` get-dynamic-info.h was included by the first dynamic-link.h include without RTLD_BOOTSTRAP being defined. It means that the code within elf_get_dynamic_info() that uses RTLD_BOOTSTRAP is in fact unused. To fix 1. this patch now includes dynamic-link.h only once with RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have the relocation fnctions with the expected semantics for the loader. And to fix 2. part of `4af6982e4c` is reverted (the check argument elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP pieces are removed. To reorganize the includes the static TLS definition is moved to its own header to avoid a circular dependency (it is defined on dynamic-link.h and dl-machine.h requires it at same time other dynamic-link.h definition requires dl-machine.h defitions). Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL are moved to its own header. Only ancient ABIs need special values (arm, i386, and mips), so a generic one is used as default. The powerpc Elf64_FuncDesc is also moved to its own header, since csu code required its definition (which would require either include elf/ folder or add a full path with elf/). Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32, and powerpc64le. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-10-14 14:52:07 -03:00
Fangrui Song	490e6c62aa	elf: Avoid nested functions in the loader [BZ #27220 ] dynamic-link.h is included more than once in some elf/ files (rtld.c, dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested functions. This harms readability and the nested functions usage is the biggest obstacle prevents Clang build (Clang doesn't support GCC nested functions). The key idea for unnesting is to add extra parameters (struct link_map and struct r_scope_elm []) to RESOLVE_MAP, ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a], elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired by Stan Shebs' ppc64/x86-64 implementation in the google/grte/v5-2.27/master which uses mixed extra parameters and static variables.) Future simplification: * If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM, elf_machine_runtime_setup can drop the `scope` parameter. * If TLSDESC no longer need to be in elf_machine_lazy_rel, elf_machine_lazy_rel can drop the `scope` parameter. Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32, sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64. In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu and riscv64-linux-gnu-rv64imafdc-lp64d. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-10-07 11:55:02 -07:00
Szabolcs Nagy	c75b106145	aarch64: update libm test ulps Update after commit `6bbf729832`. Fixed inaccuracy of j0f (BZ #28185)	2021-10-05 13:44:27 +01:00
Naohiro Tamura	381b29616a	aarch64: Disable A64FX memcpy/memmove BTI unconditionally This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit `07b427296b` Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-09-24 13:26:59 +01:00
Sergey Bugaev	c484da9087	elf: Remove THREAD_GSCOPE_IN_TCB All the ports now have THREAD_GSCOPE_IN_TCB set to 1. Remove all support for !THREAD_GSCOPE_IN_TCB, along with the definition itself. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20210915171110.226187-4-bugaevc@gmail.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>	2021-09-16 01:04:20 +02:00
Naohiro Tamura	1d9f99ce1b	AArch64: Update A64FX memset not to degrade at 16KB This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2021-09-06 10:23:24 +01:00
Szabolcs Nagy	f873adf3df	Revert "AArch64: Update A64FX memset not to degrade at 16KB" Because of wrong commit author. Will recommit it with right author. This reverts commit `23777232c2`.	2021-09-06 10:23:25 +01:00
Siddhesh Poyarekar	30891f35fa	Remove "Contributed by" lines We stopped adding "Contributed by" or similar lines in sources in 2012 in favour of git logs and keeping the Contributors section of the glibc manual up to date. Removing these lines makes the license header a bit more consistent across files and also removes the possibility of error in attribution when license blocks or files are copied across since the contributed-by lines don't actually reflect reality in those cases. Move all "Contributed by" and similar lines (Written by, Test by, etc.) into a new file CONTRIBUTED-BY to retain record of these contributions. These contributors are also mentioned in manual/contrib.texi, so we just maintain this additional record as a courtesy to the earlier developers. The following scripts were used to filter a list of files to edit in place and to clean up the CONTRIBUTED-BY file respectively. These were not added to the glibc sources because they're not expected to be of any use in future given that this is a one time task: https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02 Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2021-09-03 22:06:44 +05:30
Naohiro Tamura via Libc-alpha	23777232c2	AArch64: Update A64FX memset not to degrade at 16KB This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2021-09-03 15:59:46 +01:00
Fangrui Song	710ba420fd	Remove sysdeps/*/tls-macros.h They provide TLS_GD/TLS_LD/TLS_IE/TLS_IE macros for TLS testing. Now that we have migrated to __thread and tls_model attributes, these macros are unused and the tls-macros.h files can retire. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-08-18 09:15:20 -07:00

1 2 3 4 5 ...

476 Commits