Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
These functions are exp10m1, exp2m1, log10p1, log2p1.
Also regenerated ulps on x86_64.
For each format, there are 4 values, one for each rounding mode.
(For the intel96 format, there are 8 values, 4 for Intel hardware,
and 4 for AMD hardware. However, regen-ulps was only run on Intel.
It should be run in a separate patch on a AMD x86_64.)
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The old code used l_init_called as an indicator for whether TLS
initialization was complete. However, it is possible that
TLS for an object is initialized, written to, and then dlopen
for this object is called again, and l_init_called is not true at
this point. Previously, this resulted in TLS being initialized
twice, discarding any interim writes (technically introducing a
use-after-free bug even).
This commit introduces an explicit per-object flag, l_tls_in_slotinfo.
It indicates whether _dl_add_to_slotinfo has been called for this
object. This flag is used to avoid double-initialization of TLS.
In update_tls_slotinfo, the first_static_tls micro-optimization
is removed because preserving the initalization flag for subsequent
use by the second loop for static TLS is a bit complicated, and
another per-object flag does not seem to be worth it. Furthermore,
the l_init_called flag is dropped from the second loop (for static
TLS initialization) because l_need_tls_init on its own prevents
double-initialization.
The remaining l_init_called usage in resize_scopes and update_scopes
is just an optimization due to the use of scope_has_map, so it is
not changed in this commit.
The isupper check ensures that libc.so.6 is TLS is not reverted.
Such a revert happens if l_need_tls_init is not cleared in
_dl_allocate_tls_init for the main_thread case, now that
l_init_called is not checked anymore in update_tls_slotinfo
in elf/dl-open.c.
Reported-by: Jonathon Anderson <janderson@rice.edu>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
5476f8cd2e ("htl: move pthread_self info libc.") and
9dfa256216 ("htl: move pthread_equal into libc") to
1dc0bc8f07 ("htl: move pthread_attr_setdetachstate into libc")
moved some pthread_ symbols from libpthread.so to libc.so, but missed
adding the compat version like 5476f8cd2e ("htl: move pthread_self
info libc.") did: libc already had these symbols as forwards,
but versioned GLIBC_2.21, while the symbols in libpthread.so were
versioned GLIBC_2.12.
To fix running executables built before this, we thus have to add the
GLIBC_2.12 version, otherwise execution fails with e.g.
/usr/lib/i386-gnu/libglib-2.0.so: symbol lookup error: /usr/lib/i386-gnu/libglib-2.0.so: undefined symbol: pthread_attr_setinheritsched, version GLIBC_2.12
Add tests for MREMAP_MAYMOVE and MREMAP_FIXED. On Linux, also test
MREMAP_DONTUNMAP.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Update the mremap C implementation to support the optional argument for
MREMAP_DONTUNMAP added in Linux 5.7 since it may not always be correct
to implement a variadic function as a non-variadic function on all Linux
targets. Return MAP_FAILED and set errno to EINVAL for unknown flag bits.
This fixes BZ #31968.
Note: A test must be added when a new flag bit is introduced.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It was added by commit c62b758bae6af16 as a way for userspace to
check if two file descriptors refer to the same struct file.
Checked on aarch64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This patch updates the kernel version in the tests tst-mman-consts.py,
tst-mount-consts.py, and tst-pidfd-consts.py to 6.9.
There are no new constants covered by these tests in 6.10.
Tested with build-many-glibcs.py.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Linux 6.10 changes for syscall are:
* mseal for all architectures.
* map_shadow_stack for x32.
* Replace sync_file_range with sync_file_range2 for csky (which
fixes a broken sync_file_range usage).
Update syscall-names.list and regenerate the arch-syscall.h headers
with build-many-glibcs.py update-syscalls.
Tested with build-many-glibcs.py.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>,
which provides equivalent reporting, with the name of the file and the
line number within of the failure site additionally included. Remove
FAIL_ERR altogether and include ": %m" explicitly with the format string
supplied to FAIL_EXIT1 as there seems little value to have a separate
macro just for this.
Reviewed-by: DJ Delorie <dj@redhat.com>
Use RXX_LP in RTLD_START_ENABLE_X86_FEATURES. Support shadow stack during
startup for Linux 6.10:
commit 2883f01ec37dd8668e7222dfdb5980c86fdfe277
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Mar 15 07:04:33 2024 -0700
x86/shstk: Enable shadow stacks for x32
1. Add shadow stack support to x32 signal.
2. Use the 64-bit map_shadow_stack syscall for x32.
3. Set up shadow stack for x32.
Add the map_shadow_stack system call to <fixup-asm-unistd.h> and regenerate
arch-syscall.h. Tested on Intel Tiger Lake with CET enabled x32. There
are no regressions with CET enabled x86-64. There are no changes in CET
enabled x86-64 _dl_start_user.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Remove sysdeps/x86_64/x32/dl-machine.h by folding x32 ARCH_LA_PLTENTER,
ARCH_LA_PLTEXIT and RTLD_START into sysdeps/x86_64/dl-machine.h. There
are no regressions on x86-64 nor x32. There are no changes in x86-64
_dl_start_user. On x32, _dl_start_user changes are
<_dl_start_user>:
mov %eax,%r12d
+ mov %esp,%r13d
mov (%rsp),%edx
mov %edx,%esi
- mov %esp,%r13d
and $0xfffffff0,%esp
mov 0x0(%rip),%edi # <_dl_start_user+0x14>
lea 0x8(%r13,%rdx,4),%ecx
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The powerpc pkey_get/pkey_set support was only added for 64-bit [1],
and tst-pkey only checks if the support was present with pkey_alloc
(which does not fail on powerpc32, at least running a 64-bit kernel).
Checked on powerpc-linux-gnu.
[1] https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a803367bab167f5ec4fde1f0d0ec447707c29520
Reviewed-By: Andreas K. Huettel <dilfridge@gentoo.org>
Xfail elf/tst-platform-1 on x32 since kernel passes i686 in AT_PLATFORM.
See https://sourceware.org/bugzilla/show_bug.cgi?id=22363
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
0e75c4a463 ("hurd: Fix pthread_self() without libpthread") added a
declaration for ___pthread_init_thread instead of __pthread_init_thread,
and missed defining the external hidden symbol.
5476f8cd2e ("htl: move pthread_self info libc.") moved the htl
pthread_self() function from libpthread to libc, replacing the previous libc
stub that just returns 0. And 53da64d1cf ("htl: Initialize ___pthread_self
early") added initialization code which is needed before being able to
call pthread_self. It is currently in libpthread, and thus never called
before programs can call pthread_self from libc, which then segfaults
when accessing _pthread_self()->thread.
This moves the initialization to libc itself, as initialized variables, so
pthread_self can always be called fine.
In _dl_tlsdesc_dynamic, there are three 'addi.d sp, sp, -size'
instructions to allocate stack size for Float/LSX/LASX registers.
Every 'addi.d sp, sp, -size' needs a cfi_adjust_cfa_offset because
of sp is used to compute CFA. But only one 'addi.d sp, sp, -size'
will be run according to HWCAP value. And all cfi_adjust_cfa_offset
will be executed in stack unwinding, it result in incorrect CFA.
Change _dl_tlsdesc_dynamic to _dl_tlsdesc_dynamic,
_dl_tlsdesc_dynamic_lsx and _dl_tlsdesc_dynamic_lasx.
Conflicting cfi instructions can be distributed to the three functions.
And cfi instructions can correspond to stack down instructions.
The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).
Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.
This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.
Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
We use thread_get_name and thread_set_name to get and set the thread
name, so nothing is stored in the thread structure since these functions
are supposed to be called sparingly.
One notable difference with Linux is that the thread name is up to 32
chars, whereas Linux's is 16.
Also added a mach_RPC_CHECK to check for the existing of gnumach RPCs.
The z13/vector-optimized wcsncmp implementation segfaults if n=1
and there is only one character (equal on both strings) before
the page end. Then it loads and compares one character and misses
to check n again. The following load fails.
This patch removes the extra load and compare of the first character
and just start with the loop which uses vector-load-to-block-boundary.
This code-path also checks n.
With this patch both tests are passing:
- the simplified one mentioned in the bugzilla 31934
- the full one in Florian Weimer's patch:
"manual: Document a GNU extension for strncmp/wcsncmp"
(https://patchwork.sourceware.org/project/glibc/patch/874j9eml6y.fsf@oldenburg.str.redhat.com/):
On s390x-linux-gnu (z16), the new wcsncmp test fails due to bug 31934.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The __rseq_size value is now the active area of struct rseq
(so 20 initially), not the full struct size including padding
at the end (32 initially).
Update misc/tst-rseq to print some additional diagnostics.
Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This hasn't been looked at for a loong time (already guessing from
the number of missing entries), and it ain't pretty.
There are some 9-ulps results for float.
- ZaZaZebra (qemu-system-m68k clone of PowerBook 190 system)
- GCC 13.3.1 20240614 (Gentoo 13.3.1_p20240614 p17)
- ld GNU ld (Gentoo 2.42 p6) 2.42.0
- Linux ZaZaZebra 4.19.0-5-m68k #1 Gentoo 4.19.37-5 (2019-06-19) m68k 68040 68040 GNU/Linux
- manual build
- ../glibc/configure --enable-fortify-source --prefix=/usr
- Tested by Immolo (via Andreas K. Hüttel)
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
While working on a patch to add support for the extensible rseq ABI, we
came across an issue where a new 'const' variable would be merged with
the existing '__rseq_size' variable. We tracked this to the use of
'-fmerge-all-constants' which allows the compiler to merge identical
constant variables. This means that all 'const' variables in a compile
unit that are of the same size and are initialized to the same value can
be merged.
In this specific case, on 32 bit systems 'unsigned int' and 'ptrdiff_t'
are both 4 bytes and initialized to 0 which should trigger the merge.
However for reasons we haven't delved into when the attribute 'section
(".data.rel.ro")' is added to the mix, only variables of the same exact
types are merged. As far as we know this behavior is not specified
anywhere and could change with a new compiler version, hence this patch.
Move the definitions of these variables into an assembler file and add
hidden writable aliases for internal use. This has the added bonus of
removing the asm workaround to set the values on rseq registration.
Tested on Debian 12 with GCC 12.2.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Adhemerval noticed that the gettimeofday() and 32-bit clock_gettime()
vDSO calls won't be used by glibc on hppa, so there is no need to
declare them. Both syscalls will be emulated by utilizing return values
of the 64-bit clock_gettime() vDSO instead.
Signed-off-by: Helge Deller <deller@gmx.de>
Suggested-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
MIPSr6 has MADDF.s/MADDF.d instructions, which are fused.
In MIPS ISA, double support can be subsetted. Only FMAF is enabled
for this case.
* sysdeps/mips/fpu/math-use-builtins-fma.h
Signed-off-by: YunQiang Su <syq@gcc.gnu.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS). The previous workaround from
commit afe42e935b ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.
This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update. This
is possible because an interposed malloc has a fixed module ID and
TLS slot. (It cannot be unloaded.) If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc. It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.
The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.
All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.
Fixes commit d2123d6827 ("elf: Fix slow
tls access after dlopen [BZ #19924]").
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch optimizes large size copy using normal store when src > dst
and overlap. Make it the same as the logic in memmove-vec-unaligned-erms.S.
Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
temporal threshold, this patch updates that value to
'__x86_shared_non_temporal_threshold'. Currently, the
__x86_shared_non_temporal_threshold is cpu-specific, and different CPUs
will have different values based on the related nt-benchmark results.
However, in memmove-ssse3, the nontemporal threshold uses
'__x86_shared_cache_size_half', which sounds unreasonable.
The performance is not changed drastically although shows overall
improvements without any major regressions or gains.
Results on Zhaoxin KX-7000:
bench-memcpy geometric_mean(N=20) New / Original: 0.999
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 0.978
bench-memmove geometric_mean(N=20) New / Original: 1.000
bench-memmmove-large geometric_mean(N=20) New / Original: 0.962
Results on Intel Core i5-6600K:
bench-memcpy geometric_mean(N=20) New / Original: 1.001
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 1.001
bench-memmove geometric_mean(N=20) New / Original: 0.995
bench-memmmove-large geometric_mean(N=20) New / Original: 0.936
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.
Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.
Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Qualcom's new core, oryon-1, has a different characteristics for
memset than the current versions of memset. For non-zero, larger
sizes, using GPRs rather than the SIMD stores is ~30% faster.
For even larger sizes, using the nontemporal stores is needed
not to polute the L1/L2 caches.
For zero values, using `dc zva` should be used. Since we
know the size will always be 64 bytes, we don't need to figure
out the size there.
I started with the emag memset and added back the `dc zva` code.
Changes since v1:
* v3: Fix comment formating
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Qualcomm's new core (oryon-1) has a different performance characteristic
than other cores. For memcpy, it is faster to use the GPRs to
do the copy for large sizes (2x faster). For even larger sizes,
it is better to use the nontemporal load/store instructions so
we don't pollute the L1/L2 caches.
For smaller sizes, the characteristic are very similar to
other cores.
I used the thunderx memcpy as a starting point and expanded from there.
Changes since v1:
* v2: Fix ordering in Makefile.
* v3: Fix comment grammar about the ldnp/stnp instructions.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This recently came up during a cleanup to remove misaligned accesses
from the RISC-V port.
Link: https://sourceware.org/pipermail/libc-alpha/2022-June/139961.html
Suggested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Fangrui Song <maskray@google.com>
asm volatile ("movfcsr2gr $t0, $fcsr0" ::: "$t0");
asm volatile ("st.d $t0, %0" :"=m"(restore_fcsr));
generate to the following instructions with -Og flag:
movfcsr2gr $t0, $zero
addi.d $t0, $sp, 2047(0x7ff)
addi.d $t0, $t0, 77(0x4d)
st.w $t0, $t0, 0
fcsr0 register and restore_fcsr variable are both stored in t0 register.
Change to:
asm volatile ("movfcsr2gr %0, $fcsr0" :"=r"(restore_fcsr));
to avoid restore_fcsr address in t0.
Comparing float value using memcmp because float value cannot be
directly compared for equality.
Put LOAD_REGISTER_FCSR and SAVE_REGISTER_FCC after LOAD_REGISTER_FLOAT.
Some float instructions may change fcsr register.
If the pidfd_spawn/pidfd_spawnp helper process succeeds, but evecve
fails for some reason (either with an invalid/non-existent, memory
allocation, etc.) the resulting pidfd is never closed, nor returned
to caller (so it can call close).
Since the process creation failed, it should be up to posix_spawn to
also, close the file descriptor in this case (similar to what it
does to reap the process).
This patch also changes the waitpid with waitid (P_PIDFD) for pidfd
case, to avoid a possible pid re-use.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The atomic_spin_nop() macro can be used to run arch-specific
code in the body of a spin loop to potentially improve efficiency.
RISC-V's Zihintpause extension includes a PAUSE instruction for
this use-case, which is encoded as a HINT, which means that it
behaves like a NOP on systems that don't implement Zihintpause.
Binutils supports Zihintpause since 2.36, so this patch uses
the ".insn" directive to keep the code compatible with older
toolchains.
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
MIPSr6 has MADDF.s/MADDF.d instructions, which are fused.
In MIPS ISA, double support can be subsetted. Only FMAF is enabled
for this case.
* sysdeps/mips/fpu/math-use-builtins-fma.h
Signed-off-by: YunQiang Su <syq@gcc.gnu.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The upcoming parisc (hppa) v6.11 Linux kernel will include vDSO
support for gettimeofday(), clock_gettime() and clock_gettime64()
syscalls for 32- and 64-bit userspace.
The patch below adds the necessary glue code for glibc.
Signed-off-by: Helge Deller <deller@gmx.de>
Changes in v2:
- add vsyscalls for 64-bit too
The test is not a run-time check, so update the description.
Also use readelf -W for a more stable output format and fix
an LC_ALL typo.
This avoids garbled configure messages:
checking for s390-specific static PIE requirements (runtime check)... 0x0000000000000017 (JMPREL) 0x280
yes
Based on a -march=x86-64-v4 -mfpmath=sse build, with and without
--disable-multi-arch, running on a Zen 4 CPU. Also used different
-march=x8i6-64-v… settings.
HWCAP value is overwritten at the first comparison of the LASX case.
The second comparison at LSX get incorrect result.
Change to use t0 to save HWCAP value, and use t1 to save comparison
result.
The _dl_sysdep_parse_arguments function contains initalization
of a large on-stack variable:
dl_parse_auxv_t auxv_values = { 0, };
This uses a non-inline version of memset on powerpc64le-linux-gnu,
so it must use the baseline memset.
A desired hugetlb page size can be encoded in the flags parameter of
system calls such as mmap() and shmget(). The Linux UAPI headers have
included explicit definitions for these encodings since v4.14.
This patch adds these definitions that are used along with MAP_HUGETLB
and SHM_HUGETLB flags as specified in the corresponding man pages. This
relieves programs from having to duplicate and/or compute the encodings
manually.
Additionally, the filter on these definitions in tst-mman-consts.py is
removed, as suggested by Florian. I then ran this tests successfully,
confirming the alignment with the kernel headers.
PASS: misc/tst-mman-consts
original exit status 0
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Remove the definitions of HWCAP_IMPORTANT after removal of
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask. There HWCAP_IMPORTANT
was used as default value.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the environment variable LD_HWCAP_MASK and the tunable
glibc.cpu.hwcap_mask as those are not used anymore in common-code
after removal in elf/dl-cache.c:search_cache().
The only remaining user is sparc32 where it is used in
elf_machine_matches_host(). If sparc32 does not need it anymore,
we can get rid of it at all. Otherwise we could also move
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask to be sparc32 specific.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_PLATFORMS_COUNT as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Note: On x86, we can also get rid of the definitions
HWCAP_PLATFORMS_START and HWCAP_PLATFORMS_COUNT.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_FIRST_PLATFORM as those were only used
in the _DL_HWCAP_PLATFORM definitions and in _dl_string_platform().
Both were removed.
Note: Removed on every architecture despite of powerpc, where
_dl_string_platform() is still used.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_HWCAP_PLATFORM as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the platform strings in dl-procinfo.c where also
the implementation of _dl_string_platform() was removed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Despite of powerpc where the returned integer is stored in tcb,
and the diagnostics output, there is no user anymore.
Thus this patch removes the diagnostics output and
_dl_string_platform for all other platforms.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Both defines are not used anymore. Those were only used for
_dl_string_hwcap(), which itself was removed with commit
ab40f20364
"elf: Remove _dl_string_hwcap"
Just clean up.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
As discussed at the patch review meeting
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the exp2m1 and exp10m1 functions (exp2(x)-1 and
exp10(x)-1, like expm1).
As with other such functions, these use type-generic templates that
could be replaced with faster and more accurate type-specific
implementations in future. Test inputs are copied from those for
expm1, plus some additions close to the overflow threshold (copied
from exp2 and exp10) and also some near the underflow threshold.
exp2m1 has the unusual property of having an input (M_MAX_EXP) where
whether the function overflows (under IEEE semantics) depends on the
rounding mode. Although these could reasonably be XFAILed in the
testsuite (as we do in some cases for arguments very close to a
function's overflow threshold when an error of a few ulps in the
implementation can result in the implementation not agreeing with an
ideal one on whether overflow takes place - the testsuite isn't smart
enough to handle this automatically), since these functions aren't
required to be correctly rounding, I made the implementation check for
and handle this case specially.
The Makefile ordering expected by lint-makefiles for the new functions
is a bit peculiar, but I implemented it in this patch so that the test
passes; I don't know why log2 also needed moving in one Makefile
variable setting when it didn't in my previous patches, but the
failure showed a different place was expected for that function as
well.
The powerpc64le IFUNC setup seems not to be as self-contained as one
might hope; it shouldn't be necessary to add IFUNCs for new functions
such as these simply to get them building, but without setting up
IFUNCs for the new functions, there were undefined references to
__GI___expm1f128 (that IFUNC machinery results in no such function
being defined, but doesn't stop include/math.h from doing the
redirection resulting in the exp2m1f128 and exp10m1f128
implementations expecting to call it).
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log10p1 functions (log10(1+x): like log1p, but for
base-10 logarithms).
This is directly analogous to the log2p1 implementation (except that
whereas log2p1 has a smaller underflow range than log1p, log10p1 has a
larger underflow range). The test inputs are copied from those for
log1p and log2p1, plus a few more inputs in that wider underflow
range.
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p). As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.
Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.
The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either). It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately. For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).
Tested for x86_64 and x86, and with build-many-glibcs.py.
When we don't want to use non-temporal stores for memset, we set
`x86_memset_non_temporal_threshold` to SIZE_MAX.
The current code, however, we using `maximum_non_temporal_threshold`
as the upper bound which is `SIZE_MAX >> 4` so we ended up with a
value of `0`.
Fix is to just use `SIZE_MAX` as the upper bound for when setting the
tunable.
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Linux pinacolada 6.6.32-gentoo #1 SMP PREEMPT Sun Jun 9 14:18:17 CEST 2024 x86_64 Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz GenuineIntel GNU/Linux
32bit build for multilib environment
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
"ADDI sp, sp, 24" and "ADDI sp, sp, SZFCSREG" (SZFCSREG = 4) are
misaligning the stack: the ABI mandates a 16-byte alignment. Fix it
by changing the first one to "ADDI sp, sp, 32", and reuse the spare 4th
slot for saving fcsr.
Reported-by: Jinyang He <hejinyang@loongson.cn>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Properly set libc_cv_have_x86_isa_level in shell for MINIMUM_X86_ISA_LEVEL
defined as
(__X86_ISA_V1 + __X86_ISA_V2 + __X86_ISA_V3 + __X86_ISA_V4)
Also set __X86_ISA_V2 to 1 for i386 if __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
is defined. There are no changes in config.h nor in config.make on x86-64.
On i386, -march=x86-64-v2 with GCC generates
#define MINIMUM_X86_ISA_LEVEL 2
in config.h and
have-x86-isa-level = 2
in config.make. This fixes BZ #31883.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The __stack_prot is used by Linux to make the stack executable if
a modules requires it. It is also marked as RELRO, which requires
to change the segment permission to RW to update it.
Also, there is no need to keep track of the flags: either the stack
will have the default permission of the ABI or should be change to
PROT_READ | PROT_WRITE | PROT_EXEC. The only additional flag,
PROT_GROWSDOWN or PROT_GROWSUP, is Linux only and can be deducted
from _STACK_GROWS_DOWN/_STACK_GROWS_UP.
Also, the check_consistency function was already removed some time
ago.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
On i386, set the default minimum ISA level to 0, not 1 (baseline which
includes SSE2). There are no changes in config.h nor in config.make on
x86-64. This fixes BZ #31867.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Tested-by: Ian Jordan <immoloism@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
In commit 46b5e98ef6 ("x86: Add seperate non-temporal tunable for
memset") a tunable threshold for enabling non-temporal memset was added,
but only for Intel hardware.
Since that commit, new benchmark results suggest that non-temporal
memset is beneficial on AMD, as well, so allow this tunable to be set
for AMD.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
which has been updated to include data using different stategies for
large memset on AMD Zen2, Zen3, and Zen4.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
As of Linux kernel 6.9, some ioctls and a parameters structure have been
introduced which allow user programs to control whether a particular
epoll context will busy poll.
Update the headers to include these for the convenience of user apps.
The ioctls were added in Linux kernel 6.9 commit 18e2bf0edf4dd
("eventpoll: Add epoll ioctl for epoll_params") [1] to
include/uapi/linux/eventpoll.h.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?h=v6.9&id=18e2bf0edf4dd
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Left shift of ki is undefined when ki<0, copy the logic from exp,
which uses unsigned arithmetics, to fix it.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.
This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
A space is added before the left bracket of the x86_64 elf_machine_rela
function, in order to harmonize with the rest of the implementation of
the function and to make it easier to retrieve the function. The lines
where the function definition is located has been re-indented, as well
as its left curly bracket placed in the correct position.
Signed-off-by: Xin Wang <yw987194828@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
pidfd_getpid.c has
/* Ignore invalid large values. */
if (INT_MULTIPLY_WRAPV (10, n, &n)
|| INT_ADD_WRAPV (n, *l++ - '0', &n))
return -1;
For GCC older than GCC 7, INT_ADD_WRAPV(a, b, r) is defined as
_GL_INT_OP_WRAPV (a, b, r, +, _GL_INT_ADD_RANGE_OVERFLOW)
and *l++ - '0' is evaluated twice. Fix BZ #31798 by moving "l++" out of
the if statement. Tested with GCC 6.4 and GCC 14.1.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch updates the kernel version in the tests tst-mman-consts.py
and tst-mount-consts.py to 6.9. (There are no new constants covered
by these tests in 6.9 that need any other header changes;
tst-pidfd-consts.py was updated separately along with adding new
constants relevant to that test.)
Tested with build-many-glibcs.py.
The libc.a for alpha, s390, and sparcv9 does not provide
copysignf64x, copysignf128, frexpf64x, frexpf128, modff64x, and
modff128.
Checked with a static build for the affected ABIs.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Both the generic and POWER6 versions provide definitions of the
symbol, which are already provided by the ifunc resolver.
Checked on powerpc-linux-gnu-power4.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
For powerpc64 the generic version provides a weak definition of
strchrnul, which are already provided by the ifunc resolver. The
powerpc32 version is slight different, where for static case there
is no iFUNC support.
The strncasecmp_l is provided ifunc resolver.
Checked on powerpc-linux-gnu-power4 and powerpc64-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The generic version provides weak definitions of memchr/strlen,
which are already provided by the ifunc resolvers.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Linux 6.9 adds some more PIDFD_* constants. Add them to glibc's
sys/pidfd.h, including updating comments that said FLAGS was reserved
and must be 0, along with updating tst-pidfd-consts.py.
Tested with build-many-glibcs.py.
Don't provide __nexttowardf128_do_not_use, nexttowardf128_do_not_use,
finitef128_do_not_use, isinff128_do_not_use and isnanf128_do_not_use.
This fixes BZ #31757.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Some static implementation of float128 routines might call __isnanf128,
which is not provided by the static object.
Checked on x86_64-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The commit 08ddd26814 removed the static exp10 on i386 and m68k with an
empty w_exp10.c (required for the ABIs that uses the newly
implementation). This patch fixes by adding the required symbols on the
arch-specific w_exp{f}_compat.c implementation.
Checked on i686-linux-gnu and with a build for m68k-linux-gnu.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
The commit 16439f419b removed the static fmod/fmodf on i386 and m68k
with and empty w_fmod.c (required for the ABIs that uses the newly
implementation). This patch fixes by adding the required symbols on
the arch-specific w_fmod{f}_compat.c implementation.
To statically build fmod fails on some ABI (alpha, s390, sparc) because
it does not export the ldexpf128, this is also fixed by this patch.
Checked on i686-linux-gnu and with a build for m68k-linux-gnu.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
clone3 isn't exported from glibc and is hidden in libc.so. Fix BZ #31770
by removing clone3 alias.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Plus a small amount of moving includes around in order to be able to
remove duplicate definition of asuint64.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Fix BZ #31768 by not defining stpncpy alias when used in IFUNC.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log2p1 functions (log2(1+x): like log1p, but for
base-2 logarithms).
This illustrates the intended structure of implementations of all
these function families: define them initially with a type-generic
template implementation. If someone wishes to add type-specific
implementations, it is likely such implementations can be both faster
and more accurate than the type-generic one and can then override it
for types for which they are implemented (adding benchmarks would be
desirable in such cases to demonstrate that a new implementation is
indeed faster).
The test inputs are copied from those for log1p. Note that these
changes make gen-auto-libm-tests depend on MPFR 4.2 (or later).
The bulk of the changes are fairly generic for any such new function.
(sysdeps/powerpc/nofpu/Makefile only needs changing for those
type-generic templates that use fabs.)
Tested for x86_64 and x86, and with build-many-glibcs.py.
Linux 6.9 has no new syscalls. Update the version number in
syscall-names.list to reflect that it is still current for 6.9.
Tested with build-many-glibcs.py.
Fix BZ #31755 by renaming the internal function procutils_read_file to
__libc_procutils_read_file.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Fix BZ #31759 by not defining nearbyint aliases when used in IFUNC.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This supports common coding patterns. The GCC C front end before
version 7 rejects the may_alias attribute on a struct definition
if it was not present in a previous forward declaration, so this
attribute can only be conditionally applied.
This implements the spirit of the change in Austin Group issue 1641.
Suggested-by: Marek Polacek <polacek@redhat.com>
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This patch ensures that $libc_cv_cc_submachine, which is set from
"--with-cpu", overrides $CFLAGS for configure time tests.
Suggested-by: Peter Bergner <bergner@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
This is mostly based on AArch64 and RISC-V implementation.
Add R_LARCH_TLS_DESC32 and R_LARCH_TLS_DESC64 relocations.
For _dl_tlsdesc_dynamic function slow path, temporarily save and restore
all vector registers.
Previously many routines used * to load from vector types stored
in the data table. This is emitted as ldr, which byte-swaps the
entire vector register, and causes bugs for big-endian when not
all lanes contain the same value. When a vector is to be used
this way, it has been replaced with an array and the load with an
explicit ld1 intrinsic, which byte-swaps only within lanes.
As well, many routines previously used non-standard GCC syntax
for vector operations such as indexing into vectors types with []
and assembling vectors using {}. This syntax should not be mixed
with ACLE, as the former does not respect endianness whereas the
latter does. Such examples have been replaced with, for instance,
vcombine_* and vgetq_lane* intrinsics. Helpers which only use the
GCC syntax, such as the v_call helpers, do not need changing as
they do not use intrinsics.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The e68b1151f7 commit changed the
__fesetround_inline_nocheck implementation to use mffscrni
(through __fe_mffscrn) instead of mtfsfi. For generic powerpc
ceil/floor/trunc, the function is supposed to disable the
floating-point inexact exception enable bit, however mffscrni
does not change any exception enable bits.
This patch fixes by reverting the optimization for the
__fesetround_inline_nocheck.
Checked on powerpc-linux-gnu.
Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
This code expects the WCSCAT preprocessor macro to be predefined in case
the evex implementation of the function should be defined with a name
different from __wcsncat_evex. However, when glibc is built for
x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines
WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the
variable to WCSNCAT, as it is actually a better naming choice for the
variable in this case.
Reported-by: Kenton Groombridge
Link: https://bugs.gentoo.org/921945
Fixes: 64b8b6516b ("x86: Add evex optimized functions for the wchar_t strcpy family")
Signed-off-by: Gabi Falk <gabifalk@gmx.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>