One can be very unlucky to call time_now first just before a second switch,
and mach_msg sleep just a bit more enough for the second time_now call to
count one second too many (or even more if scheduling is really unlucky).
So we have to protect against returning a bogus negative value in such case.
This patch modifies the current Power9 implementation of strcpy and
stpcpy to optimize it for Power9 and Power10.
No new Power10 instructions are used, so the original Power9 strcpy
is modified instead of creating a new implementation for Power10.
The changes also affect stpcpy, which uses the same implementation
with some additional code before returning.
Improvements compared to the old Power9 version:
Use simple comparisons for the first ~512 bytes:
The main loop is good for long strings, but comparing 16B each time is
better for shorter strings. After aligning the address to 16 bytes, we
unroll the loop four times, checking 128 bytes each time. There may be
some overlap with the main loop for unaligned strings, but it is better
for shorter strings.
Loop with 64 bytes for longer bytes:
Use 4 consecutive lxv/stxv instructions.
Showed an average improvement of 13%.
Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
The current racy approach is to enable asynchronous cancellation
before making the syscall and restore the previous cancellation
type once the syscall returns, and check if cancellation has happen
during the cancellation entrypoint.
As described in BZ#12683, this approach shows 2 problems:
1. Cancellation can act after the syscall has returned from the
kernel, but before userspace saves the return value. It might
result in a resource leak if the syscall allocated a resource or a
side effect (partial read/write), and there is no way to program
handle it with cancellation handlers.
2. If a signal is handled while the thread is blocked at a cancellable
syscall, the entire signal handler runs with asynchronous
cancellation enabled. This can lead to issues if the signal
handler call functions which are async-signal-safe but not
async-cancel-safe.
For the cancellation to work correctly, there are 5 points at which the
cancellation signal could arrive:
[ ... )[ ... )[ syscall ]( ...
1 2 3 4 5
1. Before initial testcancel, e.g. [*... testcancel)
2. Between testcancel and syscall start, e.g. [testcancel...syscall start)
3. While syscall is blocked and no side effects have yet taken
place, e.g. [ syscall ]
4. Same as 3 but with side-effects having occurred (e.g. a partial
read or write).
5. After syscall end e.g. (syscall end...*]
And libc wants to act on cancellation in cases 1, 2, and 3 but not
in cases 4 or 5. For the 4 and 5 cases, the cancellation will eventually
happen in the next cancellable entrypoint without any further external
event.
The proposed solution for each case is:
1. Do a conditional branch based on whether the thread has received
a cancellation request;
2. It can be caught by the signal handler determining that the saved
program counter (from the ucontext_t) is in some address range
beginning just before the "testcancel" and ending with the
syscall instruction.
3. SIGCANCEL can be caught by the signal handler and determine that
the saved program counter (from the ucontext_t) is in the address
range beginning just before "testcancel" and ending with the first
uninterruptable (via a signal) syscall instruction that enters the
kernel.
4. In this case, except for certain syscalls that ALWAYS fail with
EINTR even for non-interrupting signals, the kernel will reset
the program counter to point at the syscall instruction during
signal handling, so that the syscall is restarted when the signal
handler returns. So, from the signal handler's standpoint, this
looks the same as case 2, and thus it's taken care of.
5. For syscalls with side-effects, the kernel cannot restart the
syscall; when it's interrupted by a signal, the kernel must cause
the syscall to return with whatever partial result is obtained
(e.g. partial read or write).
6. The saved program counter points just after the syscall
instruction, so the signal handler won't act on cancellation.
This is similar to 4. since the program counter is past the syscall
instruction.
So The proposed fixes are:
1. Remove the enable_asynccancel/disable_asynccancel function usage in
cancellable syscall definition and instead make them call a common
symbol that will check if cancellation is enabled (__syscall_cancel
at nptl/cancellation.c), call the arch-specific cancellable
entry-point (__syscall_cancel_arch), and cancel the thread when
required.
2. Provide an arch-specific generic system call wrapper function
that contains global markers. These markers will be used in
SIGCANCEL signal handler to check if the interruption has been
called in a valid syscall and if the syscalls has side-effects.
A reference implementation sysdeps/unix/sysv/linux/syscall_cancel.c
is provided. However, the markers may not be set on correct
expected places depending on how INTERNAL_SYSCALL_NCS is
implemented by the architecture. It is expected that all
architectures add an arch-specific implementation.
3. Rewrite SIGCANCEL asynchronous handler to check for both canceling
type and if current IP from signal handler falls between the global
markers and act accordingly.
4. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to
use the appropriate cancelable syscalls.
5. Adjust 'lowlevellock-futex.h' arch-specific implementations to
provide cancelable futex calls.
Some architectures require specific support on syscall handling:
* On i386 the syscall cancel bridge needs to use the old int80
instruction because the optimized vDSO symbol the resulting PC value
for an interrupted syscall points to an address outside the expected
markers in __syscall_cancel_arch. It has been discussed in LKML [1]
on how kernel could help userland to accomplish it, but afaik
discussion has stalled.
Also, sysenter should not be used directly by libc since its calling
convention is set by the kernel depending of the underlying x86 chip
(check kernel commit 30bfa7b3488bfb1bb75c9f50a5fcac1832970c60).
* mips o32 is the only kABI that requires 7 argument syscall, and to
avoid add a requirement on all architectures to support it, mips
support is added with extra internal defines.
Checked on aarch64-linux-gnu, arm-linux-gnueabihf, powerpc-linux-gnu,
powerpc64-linux-gnu, powerpc64le-linux-gnu, i686-linux-gnu, and
x86_64-linux-gnu.
[1] https://lkml.org/lkml/2016/3/8/1105
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
With ia64 removal, the function descriptor supports is only used
by HPPA and new architectures do not seem leaning towards this
design.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This commit uses a common implementation 'strnlen-evex-base.S' for both
'strnlen-evex' and 'strnlen-evex512'
This patch serves both to reduce the number of implementations, and it also does some small optimizations that benefit strnlen-evex and strnlen-evex512.
All tests pass on x86.
Benchmarks were taken on SKX.
https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html
Geometric mean for strnlen-evex over all benchmarks (N=10) was (new/old) 0.881
Geometric mean for strnlen-evex512 over all benchmarks (N=10) was (new/old) 0.953
Code Size Changes:
strnlen-evex : +31 bytes
strnlen-evex512 : +156 bytes
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Fix an issue with commit 8f4632deb3 ("Linux: rseq registration tests")
and prevent testing from being run in the process of the test driver
itself rather than just the test child where one has been forked. The
problem here is the unguarded use of a destructor to call a part of the
testing. The destructor function, 'do_rseq_destructor_test' is called
implicitly at program completion, however because it is associated with
the executable itself rather than an individual process, it is called
both in the test child *and* in the test driver itself.
Prevent this from happening by providing a guard variable that only
enables test invocation from 'do_rseq_destructor_test' in the process
that has first run 'do_test'. Consequently extra testing is invoked
from 'do_rseq_destructor_test' only once and in the correct process,
regardless of the use or the lack of of the '--direct' option. Where
called in the controlling test driver process that has neved called
'do_test' the destructor function silently returns right away without
taking any further actions, letting the test driver fail gracefully
where applicable.
This arrangement prevents 'tst-rseq-nptl' from ever causing testing to
hang forever and never complete, such as currently happening with the
'mips-linux-gnu' (o32 ABI) target.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Previously if the setaffinity wrapper failed the rest of the subtest
would not execute and the current subtest would be reported as passing.
Now if the setaffinity wrapper fails the subtest is correctly reported
as faling. Tested manually by changing the conditions of the affinity
call including setting size to zero, or checking the wrong condition.
No regressions on x86_64.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
The goal of this flag is to allow targets which don't prefer/have ERMS
to still access the non-temporal memset implementation.
There are 4 cases for tuning memset:
1) `Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores
2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal/non-temporal stores. Non-temporal path
goes through `rep stosb` path. We accomplish this by setting
`x86_rep_stosb_threshold` to
`x86_memset_non_temporal_threshold`.
3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`
3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`/non-temporal stores.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is just a refactor and there should be no behavioral change from
this commit.
The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Issue was we were expecting not matches with CHAR before the start of
the string in the page cross case.
The check code in the page cross case:
```
and $0xffffffffffffffc0,%rax
vmovdqa64 (%rax),%zmm17
vpcmpneqb %zmm17,%zmm16,%k1
vptestmb %zmm17,%zmm17,%k0{%k1}
kmovq %k0,%rax
inc %rax
shr %cl,%rax
je L(continue)
```
expects that all characters that neither match null nor CHAR will be
1s in `rax` prior to the `inc`. Then the `inc` will overflow all of
the 1s where no relevant match was found.
This is incorrect in the page-cross case, as the
`vmovdqa64 (%rax),%zmm17` loads from before the start of the input
string.
If there are matches with CHAR before the start of the string, `rax`
won't properly overflow.
The fix is quite simple. Just replace:
```
inc %rax
shr %cl,%rax
```
With:
```
sar %cl,%rax
inc %rax
```
The arithmetic shift will clear any matches prior to the start of the
string while maintaining the signbit so the 1s can properly overflow
to zero in the case of no matches.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
In _dl_tlsdesc_dynamic, there are three 'addi.d sp, sp, -size'
instructions to allocate stack size for Float/LSX/LASX registers.
Every 'addi.d sp, sp, -size' needs a cfi_adjust_cfa_offset because
of sp is used to compute CFA. But only one 'addi.d sp, sp, -size'
will be run according to HWCAP value. And all cfi_adjust_cfa_offset
will be executed in stack unwinding, it result in incorrect CFA.
Change _dl_tlsdesc_dynamic to _dl_tlsdesc_dynamic,
_dl_tlsdesc_dynamic_lsx and _dl_tlsdesc_dynamic_lasx.
Conflicting cfi instructions can be distributed to the three functions.
And cfi instructions can correspond to stack down instructions.
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
These functions are exp10m1, exp2m1, log10p1, log2p1.
Also regenerated ulps on x86_64.
For each format, there are 4 values, one for each rounding mode.
(For the intel96 format, there are 8 values, 4 for Intel hardware,
and 4 for AMD hardware. However, regen-ulps was only run on Intel.
It should be run in a separate patch on a AMD x86_64.)
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The old code used l_init_called as an indicator for whether TLS
initialization was complete. However, it is possible that
TLS for an object is initialized, written to, and then dlopen
for this object is called again, and l_init_called is not true at
this point. Previously, this resulted in TLS being initialized
twice, discarding any interim writes (technically introducing a
use-after-free bug even).
This commit introduces an explicit per-object flag, l_tls_in_slotinfo.
It indicates whether _dl_add_to_slotinfo has been called for this
object. This flag is used to avoid double-initialization of TLS.
In update_tls_slotinfo, the first_static_tls micro-optimization
is removed because preserving the initalization flag for subsequent
use by the second loop for static TLS is a bit complicated, and
another per-object flag does not seem to be worth it. Furthermore,
the l_init_called flag is dropped from the second loop (for static
TLS initialization) because l_need_tls_init on its own prevents
double-initialization.
The remaining l_init_called usage in resize_scopes and update_scopes
is just an optimization due to the use of scope_has_map, so it is
not changed in this commit.
The isupper check ensures that libc.so.6 is TLS is not reverted.
Such a revert happens if l_need_tls_init is not cleared in
_dl_allocate_tls_init for the main_thread case, now that
l_init_called is not checked anymore in update_tls_slotinfo
in elf/dl-open.c.
Reported-by: Jonathon Anderson <janderson@rice.edu>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
5476f8cd2e ("htl: move pthread_self info libc.") and
9dfa256216 ("htl: move pthread_equal into libc") to
1dc0bc8f07 ("htl: move pthread_attr_setdetachstate into libc")
moved some pthread_ symbols from libpthread.so to libc.so, but missed
adding the compat version like 5476f8cd2e ("htl: move pthread_self
info libc.") did: libc already had these symbols as forwards,
but versioned GLIBC_2.21, while the symbols in libpthread.so were
versioned GLIBC_2.12.
To fix running executables built before this, we thus have to add the
GLIBC_2.12 version, otherwise execution fails with e.g.
/usr/lib/i386-gnu/libglib-2.0.so: symbol lookup error: /usr/lib/i386-gnu/libglib-2.0.so: undefined symbol: pthread_attr_setinheritsched, version GLIBC_2.12
Add tests for MREMAP_MAYMOVE and MREMAP_FIXED. On Linux, also test
MREMAP_DONTUNMAP.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Update the mremap C implementation to support the optional argument for
MREMAP_DONTUNMAP added in Linux 5.7 since it may not always be correct
to implement a variadic function as a non-variadic function on all Linux
targets. Return MAP_FAILED and set errno to EINVAL for unknown flag bits.
This fixes BZ #31968.
Note: A test must be added when a new flag bit is introduced.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It was added by commit c62b758bae6af16 as a way for userspace to
check if two file descriptors refer to the same struct file.
Checked on aarch64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This patch updates the kernel version in the tests tst-mman-consts.py,
tst-mount-consts.py, and tst-pidfd-consts.py to 6.9.
There are no new constants covered by these tests in 6.10.
Tested with build-many-glibcs.py.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Linux 6.10 changes for syscall are:
* mseal for all architectures.
* map_shadow_stack for x32.
* Replace sync_file_range with sync_file_range2 for csky (which
fixes a broken sync_file_range usage).
Update syscall-names.list and regenerate the arch-syscall.h headers
with build-many-glibcs.py update-syscalls.
Tested with build-many-glibcs.py.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>,
which provides equivalent reporting, with the name of the file and the
line number within of the failure site additionally included. Remove
FAIL_ERR altogether and include ": %m" explicitly with the format string
supplied to FAIL_EXIT1 as there seems little value to have a separate
macro just for this.
Reviewed-by: DJ Delorie <dj@redhat.com>
Use RXX_LP in RTLD_START_ENABLE_X86_FEATURES. Support shadow stack during
startup for Linux 6.10:
commit 2883f01ec37dd8668e7222dfdb5980c86fdfe277
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Mar 15 07:04:33 2024 -0700
x86/shstk: Enable shadow stacks for x32
1. Add shadow stack support to x32 signal.
2. Use the 64-bit map_shadow_stack syscall for x32.
3. Set up shadow stack for x32.
Add the map_shadow_stack system call to <fixup-asm-unistd.h> and regenerate
arch-syscall.h. Tested on Intel Tiger Lake with CET enabled x32. There
are no regressions with CET enabled x86-64. There are no changes in CET
enabled x86-64 _dl_start_user.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Remove sysdeps/x86_64/x32/dl-machine.h by folding x32 ARCH_LA_PLTENTER,
ARCH_LA_PLTEXIT and RTLD_START into sysdeps/x86_64/dl-machine.h. There
are no regressions on x86-64 nor x32. There are no changes in x86-64
_dl_start_user. On x32, _dl_start_user changes are
<_dl_start_user>:
mov %eax,%r12d
+ mov %esp,%r13d
mov (%rsp),%edx
mov %edx,%esi
- mov %esp,%r13d
and $0xfffffff0,%esp
mov 0x0(%rip),%edi # <_dl_start_user+0x14>
lea 0x8(%r13,%rdx,4),%ecx
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>