glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-21 12:30:06 +00:00

Author	SHA1	Message	Date
Noah Goldstein	c510681a69	x86/string: Use `movsl` instead of `movsd` in strncpy/strncat [BZ #32344 ] `ld`, starting at 2.40, emits a warning when using `movsd`. There is no change to the actual code produced. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-11-13 10:09:30 -06:00
Noah Goldstein	6754b5becf	x86/string: Use `movsl` instead of `movsd` [BZ #32344 ] `ld`, starting at 2.40, emits a warning when using `movsd`. There is no change to the actual code produced. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-11-08 17:23:05 -06:00
Adhemerval Zanella	6d477b8de8	x86_64: Add exp2m1f with FMA The CORE-MATH exp2m1f implementation showed slight worse latency when using x86_64 baseline ABI. This patch adds a ifunc variant with similar performance for x86_64-v3. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:40 -03:00
Adhemerval Zanella	c28f8d7f19	x86_64: Add exp10m1f with FMA The CORE-MATH exp10m1f implementation showed slight worse latency when using x86_64 baseline ABI. This patch adds a ifunc variant with similar performance for x86_64-v3. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:40 -03:00
Adhemerval Zanella	f338c7c5f5	math: Use log10p1f from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows slight better performance to the generic log10p1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1, gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 68.5251 32.2627 52.92% x86_64v2 68.8912 32.7887 52.41% x86_64v3 59.3427 27.0521 54.41% i686 162.026 103.383 36.19% aarch64 26.8513 14.5695 45.74% power10 12.7426 8.4929 33.35% powerpc 16.6768 9.29135 44.29% reciprocal-throughput master patched improvement x86_64 26.0969 12.4023 52.48% x86_64v2 25.0045 11.0748 55.71% x86_64v3 20.5610 10.2995 49.91% i686 89.8842 78.5211 12.64% aarch64 17.1200 9.4832 44.61% power10 6.7814 6.4258 5.24% powerpc 15.769 7.6825 51.28% Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:40 -03:00
Adhemerval Zanella	8ae9e51376	math: Use log1pf from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows slight better performance to the generic log1pf. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1, gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 71.8142 38.9668 45.74% x86_64v2 71.9094 39.1321 45.58% x86_64v3 60.1000 32.4016 46.09% i686 147.105 104.258 29.13% aarch64 26.4439 14.0050 47.04% power10 19.4874 9.4146 51.69% powerpc 17.6145 8.00736 54.54% reciprocal-throughput master patched improvement x86_64 19.7604 12.7254 35.60% x86_64v2 19.0039 11.9455 37.14% x86_64v3 16.8559 11.9317 29.21% i686 82.3426 73.9718 10.17% aarch64 14.4665 7.9614 44.97% power10 11.9974 8.4117 29.89% powerpc 7.15222 6.0914 14.83% Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:39 -03:00
Adhemerval Zanella	c369580814	math: Use log2p1f from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance compared to the generic log2p1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 70.1462 47.0090 32.98% x86_64v2 70.2513 47.6160 32.22% x86_64v3 60.4840 39.9443 33.96% i686 164.068 122.909 25.09% aarch64 25.9169 16.9207 34.71% power10 18.1261 9.8592 45.61% powerpc 17.2683 9.38665 45.64% reciprocal-throughput master patched improvement x86_64 26.2240 16.4082 37.43% x86_64v2 25.0911 15.7480 37.24% x86_64v3 20.9371 11.7264 43.99% i686 90.4209 95.3073 -5.40% aarch64 16.8537 8.9561 46.86% power10 12.9401 6.5555 49.34% powerpc 9.01763 7.54745 16.30% The performance decrease for i686 is mostly due the use of x87 fpu, when building with '-msse2 -mfpmath=sse: master patched improvement latency 164.068 102.982 37.23% reciprocal-throughput 89.1968 82.5117 7.49% Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:39 -03:00
Adhemerval Zanella	bbd578b38d	math: Use expm1f from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance compared to the generic expm1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 96.7402 36.4026 62.37% x86_64v2 97.5391 33.4625 65.69% x86_64v3 82.1778 30.8668 62.44% i686 120.58 94.8302 21.35% aarch64 32.3558 12.8881 60.17% power10 23.5087 9.8574 58.07% powerpc 23.4776 9.06325 61.40% reciprocal-throughput master patched improvement x86_64 27.8224 15.9255 42.76% x86_64v2 27.8364 9.6438 65.36% x86_64v3 20.3227 9.6146 52.69% i686 63.5629 59.4718 6.44% aarch64 17.4838 7.1082 59.34% power10 12.4644 8.7829 29.54% powerpc 14.2152 5.94765 58.16% Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:35 -03:00
Adhemerval Zanella	5c22fd25c1	math: Use exp2m1f from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance compared to the generic exp2m1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). The only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO. The benchmark inputs are based on exp2f ones. Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 40.6042 48.7104 -19.96% x86_64v2 40.7506 35.9032 11.90% x86_64v3 35.2301 31.7956 9.75% i686 102.094 94.6657 7.28% aarch64 18.2704 15.1387 17.14% power10 11.9444 8.2402 31.01% reciprocal-throughput master patched improvement x86_64 20.8683 16.1428 22.64% x86_64v2 19.5076 10.4474 46.44% x86_64v3 19.2106 10.4014 45.86% i686 56.4054 59.3004 -5.13% aarch64 12.0781 7.3953 38.77% power10 6.5306 5.9388 9.06% The generic implementation calls __ieee754_exp2f and x86_64 provides an optimized ifunc version (built with -mfma -mavx2, not correctly rounded). This explains the performance difference for x86_64. Same for i686, where the ABI provides an optimized __ieee754_exp2f version built with '-msse2 -mfpmath=sse'. When built wth same flags, the new algorithm shows a better performance: master patched improvement latency 102.094 91.2823 10.59% reciprocal-throughput 56.4054 52.7984 6.39% Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:35 -03:00
Adhemerval Zanella	5fa89852fa	math: Use exp10m1f from CORE-MATH The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance compared to the generic exp10m1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). I mostly fixed some small issues in corner cases (sNaN handling, -INFINITY, a specific overflow check). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 45.4690 49.5845 -9.05% x86_64v2 46.1604 36.2665 21.43% x86_64v3 37.8442 31.0359 17.99% i686 121.367 93.0079 23.37% aarch64 21.1126 15.0165 28.87% power10 12.7426 8.4929 33.35% reciprocal-throughput master patched improvement x86_64 19.6005 17.4005 11.22% x86_64v2 19.6008 11.1977 42.87% x86_64v3 17.5427 10.2898 41.34% i686 59.4215 60.9675 -2.60% aarch64 13.9814 7.9173 43.37% power10 6.7814 6.4258 5.24% The generic implementation calls __ieee754_exp10f which has an optimized version, although it is not correctly rounded, which is the main culprit of the the latency difference for x86_64 and throughp for i686. Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>	2024-11-01 11:27:26 -03:00
Paul Zimmermann	392b3f0971	replace tgammaf by the CORE-MATH implementation The CORE-MATH implementation is correctly rounded (for any rounding mode). This can be checked by exhaustive tests in a few minutes since there are less than 2^32 values to check against for example GNU MPFR. This patch also adds some bench values for tgammaf. Tested on x86_64 and x86 (cfarm26). With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700: "tgammaf": { "": { "duration": 3.50188e+09, "iterations": 2e+07, "max": 602.891, "min": 65.1415, "mean": 175.094 } } With the new code: "tgammaf": { "": { "duration": 3.30825e+09, "iterations": 5e+07, "max": 211.592, "min": 32.0325, "mean": 66.1649 } } With the initial GNU libc code it gave on cfarm26 (i686): "tgammaf": { "": { "duration": 3.70505e+09, "iterations": 6e+06, "max": 2420.23, "min": 243.154, "mean": 617.509 } } With the new code: "tgammaf": { "": { "duration": 3.24497e+09, "iterations": 1.8e+07, "max": 1238.15, "min": 101.155, "mean": 180.276 } } Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Changes in v2: - include <math.h> (fix the linknamespace failures) - restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file - restored original wrapper code (math/w_tgammaf_compat.c), except for the dealing with the sign - removed the tgammaf/float entries in all libm-test-ulps files - address other comments from Joseph Myers (https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html) Changes in v3: - pass NULL argument for signgam from w_tgammaf_compat.c - use of math_narrow_eval - added more comments Changes in v4: - initialize local_signgam to 0 in math/w_tgamma_template.c - replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file Changes in v5: - do not mention local_signgam any more in math/w_tgammaf_compat.c - initialize local_signgam to 1 instead of 0 in w_tgamma_template.c and added comment Changes in v6: - pass NULL as 2nd argument of __ieee754_gammaf_r in w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c Changes in v7: - added Signed-off-by line for Alexei Sibidanov (author of the code) Changes in v8: - added Signed-off-by line for Paul Zimmermann (submitted of the patch) Changes in v9: - address comments from review by Adhemerval Zanella Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-10-11 11:12:32 +02:00
Carlos O'Donell	cae9944a6c	Fix whitespace related license issues. Several copies of the licenses in files contained whitespace related problems. Two cases are addressed here, the first is two spaces after a period which appears between "PURPOSE." and "See". The other is a space after the last forward slash in the URL. Both issues are corrected and the licenses now match the official textual description of the license (and the other license in the sources). Since these whitespaces changes do not alter the paragraph structure of the license, nor create new sentences, they do not change the license.	2024-10-07 18:08:16 -04:00
Noah Goldstein	483443d321	x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212 ] The loop should be aligned to 32-bytes so that it can ideally run out the DSB. This is particularly important on Skylake-Server where deficiencies in it's DSB implementation make it prone to not being able to run loops out of the DSB. For example running strcmp-evex on 200Mb string: 32-byte aligned loop: - 43,399,578,766 idq.dsb_uops not 32-byte aligned loop: - 6,060,139,704 idq.dsb_uops This results in a 25% performance degradation for the non-aligned version. The fix is to just ensure the code layout is such that the loop is aligned. (Which was previously the case but was accidentally dropped in `84e7c46df`). NB: The fix was actually 64-byte alignment. This is because 64-byte alignment generally produces more stable performance than 32-byte aligned code (cache line crosses can affect perf), so if we are going past 16-byte alignmnent, might as well go to 64. 64-byte alignment also matches most other functions we over-align, so it creates a common point of optimization. Times are reported as ratio of Time_With_Patch / Time_Without_Patch. Lower is better. The values being reported is the geometric mean of the ratio across all tests in bench-strcmp and bench-strncmp. Note this patch is only attempting to improve the Skylake-Server strcmp for long strings. The rest of the numbers are only to test for regressions. Tigerlake Results Strings <= 512: strcmp : 1.026 strncmp: 0.949 Tigerlake Results Strings > 512: strcmp : 0.994 strncmp: 0.998 Skylake-Server Results Strings <= 512: strcmp : 0.945 strncmp: 0.943 Skylake-Server Results Strings > 512: strcmp : 0.778 strncmp: 1.000 The 2.6% regression on TGL-strcmp is due to slowdowns caused by changes in alignment of code handling small sizes (most on the page-cross logic). These should be safe to ignore because 1) We previously only 16-byte aligned the function so this behavior is not new and was essentially up to chance before this patch and 2) this type of alignment related regression on small sizes really only comes up in tight micro-benchmark loops and is unlikely to have any affect on realworld performance. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-09-30 07:40:40 -07:00
Adhemerval Zanella	89b53077d2	nptl: Fix Race conditions in pthread cancellation [BZ#12683] The current racy approach is to enable asynchronous cancellation before making the syscall and restore the previous cancellation type once the syscall returns, and check if cancellation has happen during the cancellation entrypoint. As described in BZ#12683, this approach shows 2 problems: 1. Cancellation can act after the syscall has returned from the kernel, but before userspace saves the return value. It might result in a resource leak if the syscall allocated a resource or a side effect (partial read/write), and there is no way to program handle it with cancellation handlers. 2. If a signal is handled while the thread is blocked at a cancellable syscall, the entire signal handler runs with asynchronous cancellation enabled. This can lead to issues if the signal handler call functions which are async-signal-safe but not async-cancel-safe. For the cancellation to work correctly, there are 5 points at which the cancellation signal could arrive: [ ... )[ ... )[ syscall ]( ... 1 2 3 4 5 1. Before initial testcancel, e.g. [... testcancel) 2. Between testcancel and syscall start, e.g. [testcancel...syscall start) 3. While syscall is blocked and no side effects have yet taken place, e.g. [ syscall ] 4. Same as 3 but with side-effects having occurred (e.g. a partial read or write). 5. After syscall end e.g. (syscall end...] And libc wants to act on cancellation in cases 1, 2, and 3 but not in cases 4 or 5. For the 4 and 5 cases, the cancellation will eventually happen in the next cancellable entrypoint without any further external event. The proposed solution for each case is: 1. Do a conditional branch based on whether the thread has received a cancellation request; 2. It can be caught by the signal handler determining that the saved program counter (from the ucontext_t) is in some address range beginning just before the "testcancel" and ending with the syscall instruction. 3. SIGCANCEL can be caught by the signal handler and determine that the saved program counter (from the ucontext_t) is in the address range beginning just before "testcancel" and ending with the first uninterruptable (via a signal) syscall instruction that enters the kernel. 4. In this case, except for certain syscalls that ALWAYS fail with EINTR even for non-interrupting signals, the kernel will reset the program counter to point at the syscall instruction during signal handling, so that the syscall is restarted when the signal handler returns. So, from the signal handler's standpoint, this looks the same as case 2, and thus it's taken care of. 5. For syscalls with side-effects, the kernel cannot restart the syscall; when it's interrupted by a signal, the kernel must cause the syscall to return with whatever partial result is obtained (e.g. partial read or write). 6. The saved program counter points just after the syscall instruction, so the signal handler won't act on cancellation. This is similar to 4. since the program counter is past the syscall instruction. So The proposed fixes are: 1. Remove the enable_asynccancel/disable_asynccancel function usage in cancellable syscall definition and instead make them call a common symbol that will check if cancellation is enabled (__syscall_cancel at nptl/cancellation.c), call the arch-specific cancellable entry-point (__syscall_cancel_arch), and cancel the thread when required. 2. Provide an arch-specific generic system call wrapper function that contains global markers. These markers will be used in SIGCANCEL signal handler to check if the interruption has been called in a valid syscall and if the syscalls has side-effects. A reference implementation sysdeps/unix/sysv/linux/syscall_cancel.c is provided. However, the markers may not be set on correct expected places depending on how INTERNAL_SYSCALL_NCS is implemented by the architecture. It is expected that all architectures add an arch-specific implementation. 3. Rewrite SIGCANCEL asynchronous handler to check for both canceling type and if current IP from signal handler falls between the global markers and act accordingly. 4. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to use the appropriate cancelable syscalls. 5. Adjust 'lowlevellock-futex.h' arch-specific implementations to provide cancelable futex calls. Some architectures require specific support on syscall handling: * On i386 the syscall cancel bridge needs to use the old int80 instruction because the optimized vDSO symbol the resulting PC value for an interrupted syscall points to an address outside the expected markers in __syscall_cancel_arch. It has been discussed in LKML [1] on how kernel could help userland to accomplish it, but afaik discussion has stalled. Also, sysenter should not be used directly by libc since its calling convention is set by the kernel depending of the underlying x86 chip (check kernel commit 30bfa7b3488bfb1bb75c9f50a5fcac1832970c60). * mips o32 is the only kABI that requires 7 argument syscall, and to avoid add a requirement on all architectures to support it, mips support is added with extra internal defines. Checked on aarch64-linux-gnu, arm-linux-gnueabihf, powerpc-linux-gnu, powerpc64-linux-gnu, powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu. [1] https://lkml.org/lkml/2016/3/8/1105 Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2024-08-23 14:27:43 -03:00
Matthew Sterrett	294a892769	x86: Unifies 'strnlen-evex' and 'strnlen-evex512' implementations. This commit uses a common implementation 'strnlen-evex-base.S' for both 'strnlen-evex' and 'strnlen-evex512' This patch serves both to reduce the number of implementations, and it also does some small optimizations that benefit strnlen-evex and strnlen-evex512. All tests pass on x86. Benchmarks were taken on SKX. https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html Geometric mean for strnlen-evex over all benchmarks (N=10) was (new/old) 0.881 Geometric mean for strnlen-evex512 over all benchmarks (N=10) was (new/old) 0.953 Code Size Changes: strnlen-evex : +31 bytes strnlen-evex512 : +156 bytes Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-08-19 07:34:02 -07:00
Noah Goldstein	f446d90fe6	x86: Add `Avoid_STOSB` tunable to allow NT memset without ERMS The goal of this flag is to allow targets which don't prefer/have ERMS to still access the non-temporal memset implementation. There are 4 cases for tuning memset: 1) `Avoid_STOSB && Avoid_Non_Temporal_Memset` - Memset with temporal stores 2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset` - Memset with temporal/non-temporal stores. Non-temporal path goes through `rep stosb` path. We accomplish this by setting `x86_rep_stosb_threshold` to `x86_memset_non_temporal_threshold`. 3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset` - Memset with temporal stores/`rep stosb` 3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset` - Memset with temporal stores/`rep stosb`/non-temporal stores. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-08-15 08:19:15 -07:00
Noah Goldstein	7da0886247	x86: Fix bug in strchrnul-evex512 [BZ #32078 ] Issue was we were expecting not matches with CHAR before the start of the string in the page cross case. The check code in the page cross case: ``` and $0xffffffffffffffc0,%rax vmovdqa64 (%rax),%zmm17 vpcmpneqb %zmm17,%zmm16,%k1 vptestmb %zmm17,%zmm17,%k0{%k1} kmovq %k0,%rax inc %rax shr %cl,%rax je L(continue) ``` expects that all characters that neither match null nor CHAR will be 1s in `rax` prior to the `inc`. Then the `inc` will overflow all of the 1s where no relevant match was found. This is incorrect in the page-cross case, as the `vmovdqa64 (%rax),%zmm17` loads from before the start of the input string. If there are matches with CHAR before the start of the string, `rax` won't properly overflow. The fix is quite simple. Just replace: ``` inc %rax shr %cl,%rax ``` With: ``` sar %cl,%rax inc %rax ``` The arithmetic shift will clear any matches prior to the start of the string while maintaining the signbit so the 1s can properly overflow to zero in the case of no matches. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-08-15 08:11:33 -07:00
Paul Zimmermann	0797283910	added inputs giving large errors on x86_64 for new C23 functions These functions are exp10m1, exp2m1, log10p1, log2p1. Also regenerated ulps on x86_64. For each format, there are 4 values, one for each rounding mode. (For the intel96 format, there are 8 values, 4 for Intel hardware, and 4 for AMD hardware. However, regen-ulps was only run on Intel. It should be run in a separate patch on a AMD x86_64.) Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-08-07 14:28:46 +02:00
H.J. Lu	652c6cf269	x86-64: Remove sysdeps/x86_64/x32/dl-machine.h Remove sysdeps/x86_64/x32/dl-machine.h by folding x32 ARCH_LA_PLTENTER, ARCH_LA_PLTEXIT and RTLD_START into sysdeps/x86_64/dl-machine.h. There are no regressions on x86-64 nor x32. There are no changes in x86-64 _dl_start_user. On x32, _dl_start_user changes are <_dl_start_user>: mov %eax,%r12d + mov %esp,%r13d mov (%rsp),%edx mov %edx,%esi - mov %esp,%r13d and $0xfffffff0,%esp mov 0x0(%rip),%edi # <_dl_start_user+0x14> lea 0x8(%r13,%rdx,4),%ecx Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-07-25 00:17:21 -07:00
Paul Zimmermann	4dc22baa84	This patch adds larger ulp errors for the log2p1 function. Changes in v2: - added larger error for long double on AMD reported by Adhemerval (https://sourceware.org/pipermail/libc-alpha/2024-June/157755.html) Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-07-22 08:54:23 +02:00
H.J. Lu	66f2cd6e1a	x32: xfail elf/tst-platform-1 [BZ #22363 ] Xfail elf/tst-platform-1 on x32 since kernel passes i686 in AT_PLATFORM. See https://sourceware.org/bugzilla/show_bug.cgi?id=22363 Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org>	2024-07-19 10:34:38 +02:00
Florian Weimer	018f0fc3b8	elf: Support recursive use of dynamic TLS in interposed malloc It turns out that quite a few applications use bundled mallocs that have been built to use global-dynamic TLS (instead of the recommended initial-exec TLS). The previous workaround from commit `afe42e935b` ("elf: Avoid some free (NULL) calls in _dl_update_slotinfo") does not fix all encountered cases unfortunatelly. This change avoids the TLS generation update for recursive use of TLS from a malloc that was called during a TLS update. This is possible because an interposed malloc has a fixed module ID and TLS slot. (It cannot be unloaded.) If an initially-loaded module ID is encountered in __tls_get_addr and the dynamic linker is already in the middle of a TLS update, use the outdated DTV, thus avoiding another call into malloc. It's still necessary to update the DTV to the most recent generation, to get out of the slow path, which is why the check for recursion is needed. The bookkeeping is done using a global counter instead of per-thread flag because TLS access in the dynamic linker is tricky. All this will go away once the dynamic linker stops using malloc for TLS, likely as part of a change that pre-allocates all TLS during pthread_create/dlopen. Fixes commit `d2123d6827` ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2024-07-01 19:02:11 +02:00
MayShao-oc	c19457aec6	x86_64: Optimize large size copy in memmove-ssse3 This patch optimizes large size copy using normal store when src > dst and overlap. Make it the same as the logic in memmove-vec-unaligned-erms.S. Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non- temporal threshold, this patch updates that value to '__x86_shared_non_temporal_threshold'. Currently, the __x86_shared_non_temporal_threshold is cpu-specific, and different CPUs will have different values based on the related nt-benchmark results. However, in memmove-ssse3, the nontemporal threshold uses '__x86_shared_cache_size_half', which sounds unreasonable. The performance is not changed drastically although shows overall improvements without any major regressions or gains. Results on Zhaoxin KX-7000: bench-memcpy geometric_mean(N=20) New / Original: 0.999 bench-memcpy-random geometric_mean(N=20) New / Original: 0.999 bench-memcpy-large geometric_mean(N=20) New / Original: 0.978 bench-memmove geometric_mean(N=20) New / Original: 1.000 bench-memmmove-large geometric_mean(N=20) New / Original: 0.962 Results on Intel Core i5-6600K: bench-memcpy geometric_mean(N=20) New / Original: 1.001 bench-memcpy-random geometric_mean(N=20) New / Original: 0.999 bench-memcpy-large geometric_mean(N=20) New / Original: 1.001 bench-memmove geometric_mean(N=20) New / Original: 0.995 bench-memmmove-large geometric_mean(N=20) New / Original: 0.936 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-06-30 06:26:43 -07:00
Stefan Liebler	343439a31e	elf: Remove _DL_PLATFORMS_COUNT Remove the definitions of _DL_PLATFORMS_COUNT as those are not used anymore after removal in elf/dl-cache.c:search_cache(). Note: On x86, we can also get rid of the definitions HWCAP_PLATFORMS_START and HWCAP_PLATFORMS_COUNT. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-06-18 10:45:36 +02:00
Andreas K. Hüttel	98ffc1bfeb	Convert to autoconf 2.72 (vanilla release, no distribution patches) As discussed at the patch review meeting Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> Reviewed-by: Simon Chopin <simon.chopin@canonical.com>	2024-06-17 21:15:28 +02:00
Joseph Myers	7ec903e028	Implement C23 exp2m1, exp10m1 C23 adds various <math.h> function families originally defined in TS 18661-4. Add the exp2m1 and exp10m1 functions (exp2(x)-1 and exp10(x)-1, like expm1). As with other such functions, these use type-generic templates that could be replaced with faster and more accurate type-specific implementations in future. Test inputs are copied from those for expm1, plus some additions close to the overflow threshold (copied from exp2 and exp10) and also some near the underflow threshold. exp2m1 has the unusual property of having an input (M_MAX_EXP) where whether the function overflows (under IEEE semantics) depends on the rounding mode. Although these could reasonably be XFAILed in the testsuite (as we do in some cases for arguments very close to a function's overflow threshold when an error of a few ulps in the implementation can result in the implementation not agreeing with an ideal one on whether overflow takes place - the testsuite isn't smart enough to handle this automatically), since these functions aren't required to be correctly rounding, I made the implementation check for and handle this case specially. The Makefile ordering expected by lint-makefiles for the new functions is a bit peculiar, but I implemented it in this patch so that the test passes; I don't know why log2 also needed moving in one Makefile variable setting when it didn't in my previous patches, but the failure showed a different place was expected for that function as well. The powerpc64le IFUNC setup seems not to be as self-contained as one might hope; it shouldn't be necessary to add IFUNCs for new functions such as these simply to get them building, but without setting up IFUNCs for the new functions, there were undefined references to __GI___expm1f128 (that IFUNC machinery results in no such function being defined, but doesn't stop include/math.h from doing the redirection resulting in the exp2m1f128 and exp10m1f128 implementations expecting to call it). Tested for x86_64 and x86, and with build-many-glibcs.py.	2024-06-17 16:31:49 +00:00
Joseph Myers	55eb99e9a9	Implement C23 log10p1 C23 adds various <math.h> function families originally defined in TS 18661-4. Add the log10p1 functions (log10(1+x): like log1p, but for base-10 logarithms). This is directly analogous to the log2p1 implementation (except that whereas log2p1 has a smaller underflow range than log1p, log10p1 has a larger underflow range). The test inputs are copied from those for log1p and log2p1, plus a few more inputs in that wider underflow range. Tested for x86_64 and x86, and with build-many-glibcs.py.	2024-06-17 13:48:13 +00:00
Joseph Myers	bb014f50c4	Implement C23 logp1 C23 adds various <math.h> function families originally defined in TS 18661-4. Add the logp1 functions (aliases for log1p functions - the name is intended to be more consistent with the new log2p1 and log10p1, where clearly it would have been very confusing to name those functions log21p and log101p). As aliases rather than new functions, the content of this patch is somewhat different from those actually adding new functions. Tests are shared with log1p, so this patch does mechanically update all affected libm-test-ulps files to expect the same errors for both functions. The vector versions of log1p on aarch64 and x86_64 are not updated to have logp1 aliases (and thus there are no corresponding header, tests, abilist or ulps changes for vector functions either). It would be reasonable for such vector aliases and corresponding changes to other files to be made separately. For now, the log1p tests instead avoid testing logp1 in the vector case (a Makefile change is needed to avoid problems with grep, used in generating the .c files for vector function tests, matching more than one ALL_RM_TEST line in a file testing multiple functions with the same inputs, when it assumes that the .inc file only has a single such line). Tested for x86_64 and x86, and with build-many-glibcs.py.	2024-06-17 13:47:09 +00:00
Noah Goldstein	46b5e98ef6	x86: Add seperate non-temporal tunable for memset The tuning for non-temporal stores for memset vs memcpy is not always the same. This includes both the exact value and whether non-temporal stores are profitable at all for a given arch. This patch add `x86_memset_non_temporal_threshold`. Currently we disable non-temporal stores for non Intel vendors as the only benchmarks showing its benefit have been on Intel hardware. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-30 12:36:09 -05:00
Noah Goldstein	5bf0ab8057	x86: Improve large memset perf with non-temporal stores [RHEL-29312] Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-30 12:36:09 -05:00
Xin Wang	e0f7f1808f	x86_64: Reformat elf_machine_rela A space is added before the left bracket of the x86_64 elf_machine_rela function, in order to harmonize with the rest of the implementation of the function and to make it easier to retrieve the function. The lines where the function definition is located has been re-indented, as well as its left curly bracket placed in the correct position. Signed-off-by: Xin Wang <yw987194828@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-27 13:46:45 -07:00
Joseph Myers	79c52daf47	Implement C23 log2p1 C23 adds various <math.h> function families originally defined in TS 18661-4. Add the log2p1 functions (log2(1+x): like log1p, but for base-2 logarithms). This illustrates the intended structure of implementations of all these function families: define them initially with a type-generic template implementation. If someone wishes to add type-specific implementations, it is likely such implementations can be both faster and more accurate than the type-generic one and can then override it for types for which they are implemented (adding benchmarks would be desirable in such cases to demonstrate that a new implementation is indeed faster). The test inputs are copied from those for log1p. Note that these changes make gen-auto-libm-tests depend on MPFR 4.2 (or later). The bulk of the changes are fairly generic for any such new function. (sysdeps/powerpc/nofpu/Makefile only needs changing for those type-generic templates that use fabs.) Tested for x86_64 and x86, and with build-many-glibcs.py.	2024-05-20 13:41:39 +00:00
Gabi Falk	dd5f891c1a	x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) This code expects the WCSCAT preprocessor macro to be predefined in case the evex implementation of the function should be defined with a name different from __wcsncat_evex. However, when glibc is built for x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the variable to WCSNCAT, as it is actually a better naming choice for the variable in this case. Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: `64b8b6516b` ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-05-08 07:37:59 -07:00
Florian Weimer	aea52e3d2b	Revert "x86_64: Suppress false positive valgrind error" This reverts commit `a1735e0aa8`. The test failure is a real valgrind bug that needs to be fixed before valgrind is usable with a glibc that has been built with CC="gcc -march=x86-64-v3". The proposed valgrind patch teaches valgrind to replace ld.so strcmp with an unoptimized scalar implementation, thus avoiding any AVX2-related problems. Valgrind bug: <https://bugs.kde.org/show_bug.cgi?id=485487> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-04-13 17:42:13 +02:00
H.J. Lu	9e1f4aef86	x86-64: Exclude FMA4 IFUNC functions for -mapxf When -mapxf is used to build glibc, the resulting glibc will never run on FMA4 machines. Exclude FMA4 IFUNC functions when -mapxf is used. This requires GCC which defines __APX_F__ for -mapxf with commit: 1df56719bd8 x86: Define __APX_F__ for -mapxf Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-04-06 05:03:55 -07:00
Adhemerval Zanella	44ccc2465c	math: x86 trunc traps when FE_INEXACT is enabled (BZ 31603) The implementations of trunc functions using x87 floating point (i386 and x86_64 long double only) traps when FE_INEXACT is enabled. Although this is a GNU extension outside the scope of the C standard, other architectures that also support traps do not show this behavior. The fix moves the implementation to a common one that holds any exceptions with a 'fnclex' (libc_feholdexcept_setround_387). Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-04-04 14:29:28 -03:00
Adhemerval Zanella	932544efa4	math: x86 floor traps when FE_INEXACT is enabled (BZ 31601) The implementations of floor functions using x87 floating point (i386 and 86_64 long double only) traps when FE_INEXACT is enabled. Although this is a GNU extension outside the scope of the C standard, other architectures that also support traps do not show this behavior. The fix moves the implementation to a common one that holds any exceptions with a 'fnclex' (libc_feholdexcept_setround_387). Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-04-04 14:29:28 -03:00
Adhemerval Zanella	637bfc392f	math: x86 ceill traps when FE_INEXACT is enabled (BZ 31600) The implementations of ceil functions using x87 floating point (i386 and x86_64 long double only) traps when FE_INEXACT is enabled. Although this is a GNU extension outside the scope of the C standard, other architectures that also support traps do not show this behavior. The fix moves the implementation to a common one that holds any exceptions with a 'fnclex' (libc_feholdexcept_setround_387). Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-04-04 14:29:28 -03:00
Adhemerval Zanella	721314c980	x86_64: Remove avx512 strstr implementation As indicated in a recent thread, this it is a simple brute-force algorithm that checks the whole needle at a matching character pair (and does so 1 byte at a time after the first 64 bytes of a needle). Also it never skips ahead and thus can match at every haystack position after trying to match all of the needle, which generic implementation avoids. As indicated by Wilco, a 4x larger needle and 16x larger haystack gives a clear 65x slowdown both basic_strstr and __strstr_avx512: "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512", "__strstr_sse2_unaligned", "__strstr_generic"], { "len_haystack": 65536, "len_needle": 1024, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2] }, { "len_haystack": 1048576, "len_needle": 4096, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9] } PS: I don't have an AVX512 capable machine to verify this issues, but skimming through the code it does seems to follow what Wilco has described. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-03-27 13:48:16 -03:00
Andreas Schwab	fd7ee2e6c5	Add tst-gnu2-tls2mod1 to test-internal-extras That allows sysdeps/x86_64/tst-gnu2-tls2mod1.S to use internal headers. Fixes: `717ebfa85c` ("x86-64: Allocate state buffer space for RDI, RSI and RBX")	2024-03-19 14:28:28 +01:00
H.J. Lu	717ebfa85c	x86-64: Allocate state buffer space for RDI, RSI and RBX _dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack. After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11. Define TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to STATE_SAVE_OFFSET(%rsp). +==================+<- stack frame start aligned at 8 or 16 bytes \| \|<- RDI saved in the red zone \| \|<- RSI saved in the red zone \| \|<- RBX saved in the red zone \| \|<- paddings for stack realignment of 64 bytes \|------------------\|<- xsave buffer end aligned at 64 bytes \| \|<- \| \|<- \| \|<- \|------------------\|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp) \| \|<- 8-byte padding for 64-byte alignment \| \|<- 8-byte padding for 64-byte alignment \| \|<- R11 \| \|<- R10 \| \|<- R9 \| \|<- R8 \| \|<- RDX \| \|<- RCX +==================+<- RSP aligned at 64 bytes Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI and RBX are saved onto stack without adjusting stack pointer first, using the red-zone. This fixes BZ #31501. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-03-18 19:45:13 -07:00
H.J. Lu	9b7091415a	x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers _dl_tlsdesc_dynamic should also preserve AMX registers which are caller-saved. Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID to x86-64 TLSDESC_CALL_STATE_SAVE_MASK. Compute the AMX state size and save it in xsave_state_full_size which is only used by _dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec. This fixes the AMX part of BZ #31372. Tested on AMX processor. AMX test is enabled only for compilers with the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098 GCC 14 and GCC 11/12/13 branches have the bug fix. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-02-29 04:30:01 -08:00
H.J. Lu	a1735e0aa8	x86_64: Suppress false positive valgrind error When strcmp-avx2.S is used as the default, elf/tst-valgrind-smoke fails with ==1272761== Conditional jump or move depends on uninitialised value(s) ==1272761== at 0x4022C98: strcmp (strcmp-avx2.S:462) ==1272761== by 0x400B05B: _dl_name_match_p (dl-misc.c:75) ==1272761== by 0x40085F3: _dl_map_object (dl-load.c:1966) ==1272761== by 0x401AEA4: map_doit (rtld.c:644) ==1272761== by 0x4001488: _dl_catch_exception (dl-catch.c:237) ==1272761== by 0x40015AE: _dl_catch_error (dl-catch.c:256) ==1272761== by 0x401B38F: do_preload (rtld.c:816) ==1272761== by 0x401C116: handle_preload_list (rtld.c:892) ==1272761== by 0x401EDF5: dl_main (rtld.c:1842) ==1272761== by 0x401A79E: _dl_sysdep_start (dl-sysdep.c:140) ==1272761== by 0x401BEEE: _dl_start_final (rtld.c:494) ==1272761== by 0x401BEEE: _dl_start (rtld.c:581) ==1272761== by 0x401AD87: ??? (in /elf/ld.so) The assembly codes are: 0x0000000004022c80 <+144>: vmovdqu 0x20(%rdi),%ymm0 0x0000000004022c85 <+149>: vpcmpeqb 0x20(%rsi),%ymm0,%ymm1 0x0000000004022c8a <+154>: vpcmpeqb %ymm0,%ymm15,%ymm2 0x0000000004022c8e <+158>: vpandn %ymm1,%ymm2,%ymm1 0x0000000004022c92 <+162>: vpmovmskb %ymm1,%ecx 0x0000000004022c96 <+166>: inc %ecx => 0x0000000004022c98 <+168>: jne 0x4022c32 <strcmp+66> strcmp-avx2.S has 32-byte vector loads of strings which are shorter than 32 bytes: (gdb) p (char ) ($rdi + 0x20) $6 = 0x1ffeffea20 "memcheck-amd64-linux.so" (gdb) p (char ) ($rsi + 0x20) $7 = 0x4832640 "core-amd64-linux.so" (gdb) call (int) strlen ((char ) ($rsi + 0x20)) $8 = 19 (gdb) call (int) strlen ((char *) ($rdi + 0x20)) $9 = 23 (gdb) It triggers the valgrind error. The above code is safe since the loads don't cross the page boundary. Update tst-valgrind-smoke.sh to accept an optional suppression file and pass a suppression file to valgrind when strcmp-avx2.S is the default implementation of strcmp. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-02-28 13:40:55 -08:00
H.J. Lu	befe2d3c4d	x86-64: Don't use SSE resolvers for ISA level 3 or above When glibc is built with ISA level 3 or above enabled, SSE resolvers aren't available and glibc fails to build: ld: .../elf/librtld.os: in function `init_cpu_features': .../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave' ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object /usr/local/bin/ld: final link failed: bad value For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor _dl_tlsdesc_dynamic_fxsave. This fixes BZ #31429. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-02-28 11:49:30 -08:00
H.J. Lu	0aac205a81	x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers Compiler generates the following instruction sequence for GNU2 dynamic TLS access: leaq tls_var@TLSDESC(%rip), %rax call tls_var@TLSCALL(%rax) or leal tls_var@TLSDESC(%ebx), %eax call tls_var@TLSCALL(%eax) CALL instruction is transparent to compiler which assumes all registers, except for EFLAGS and RAX/EAX, are unchanged after CALL. When _dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow path. __tls_get_addr is a normal function which doesn't preserve any caller-saved registers. _dl_tlsdesc_dynamic saved and restored integer caller-saved registers, but didn't preserve any other caller-saved registers. Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all caller-saved registers. This fixes BZ #31372. Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic) to optimize elf_machine_runtime_setup. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-02-28 09:02:56 -08:00
Sunil K Pandey	9f78a7c1d0	x86_64: Exclude SSE, AVX and FMA4 variants in libm multiarch When glibc is built with ISA level 3 or higher by default, the resulting glibc binaries won't run on SSE or FMA4 processors. Exclude SSE, AVX and FMA4 variants in libm multiarch when ISA level 3 or higher is enabled by default. When glibc is built with ISA level 2 enabled by default, only keep SSE4.1 variant. Fixes BZ 31335. NB: elf/tst-valgrind-smoke test fails with ISA level 4, because valgrind doesn't support AVX512 instructions: https://bugs.kde.org/show_bug.cgi?id=383010 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-02-25 13:20:51 -08:00
H.J. Lu	ef7f4b1fef	Apply the Makefile sorting fix Apply the Makefile sorting fix generated by sort-makefile-lines.py.	2024-02-15 11:19:56 -08:00
H.J. Lu	71d133c500	sysdeps/x86_64/Makefile (tests): Add the end marker	2024-02-15 11:12:13 -08:00
Adhemerval Zanella	491e55beab	x86: Expand the comment on when REP STOSB is used on memset Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-02-13 08:49:43 -08:00
Michael Jeanson	155bb9d036	x86/cet: fix shadow stack test scripts Some shadow stack test scripts use the '==' operator with the 'test' command to validate exit codes resulting in the following error: sysdeps/x86_64/tst-shstk-legacy-1e.sh: 31: test: 139: unexpected operator The '==' operator is invalid for the 'test' command, use '-eq' like the previous call to 'test'. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-02-12 06:49:57 -08:00

1 2 3 4 5 ...

1892 Commits