Minor clean-up, we need to change this part in following patch, clean this up
to prevent we duplicated the change twice.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The sanity tests where meant to ensure that the default implementation
was only being built without multiarch with the exception of the
multiarch/rtld-*.S files.
The code used IS_IN (rtld) to check if the build for was for an
multiarch/rtld-*.S file which is incorrect as IS_IN (rtld) is set for
the non-multiarch build as well.
Most of these don't really matter as there was no dirty upper state
but we should generally avoid stray sse when its not needed.
The one case that really matters is in svml_d_tanh4_core_avx2.S:
blendvps %xmm0, %xmm8, %xmm7
When there was a dirty upper state.
Tested on x86_64-linux
1. Refactor files so that all implementations for in the multiarch
directory.
- Essentially moved sse2 {raw|w}memchr.S implementation to
multiarch/{raw|w}memchr-sse2.S
- The non-multiarch {raw|w}memchr.S file now only includes one of
the implementations in the multiarch directory based on the
compiled ISA level (only used for non-multiarch builds.
Otherwise we go through the ifunc selector).
2. Add ISA level build guards to different implementations.
- I.e memchr-avx2.S which is ISA level 3 will only build if
compiled ISA level <= 3. Otherwise there is no reason to include
it as we will always use one of the ISA level 4
implementations (memchr-evex{-rtm}.S).
3. Add new multiarch/rtld-{raw}memchr.S that just include the
non-multiarch {raw}memchr.S which will in turn select the best
implementation based on the compiled ISA level.
4. Refactor the ifunc selector and ifunc implementation list to use
the ISA level aware wrapper macros that allow functions below the
compiled ISA level (with a guranteed replacement) to be skipped.
- Guranteed replacement essentially means that for any ISA level
build there must be a function that the baseline of the ISA
supports. So for {raw|w}memchr.S since there is not ISA level 2
function, the ISA level 2 build still includes the ISA level
1 (sse2) function. Once we reach the ISA level 3 build, however,
{raw|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA
level 1 implementation ({raw|w}memchr-sse2.S) will not be built.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
1. Factor out some of the ISA level defines in isa-level.c to
standalone header isa-level.h
2. Add new headers with ISA level dependent macros for handling
ifuncs.
Note, this file does not change any code.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
Let's use LC_ALL=C as we do elsewhere for consistency.
Tested on s390x-ibm-linux-gnu.
See: 72bd208846
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
We already check for it in root configure.ac with AC_CHECK_TOOL. Let's
use the result.
Tested on s390x-ibm-linux-gnu.
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
commit c22eb807b0
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Thu Jun 16 15:07:12 2022 -0700
x86: Rename generic functions with unique postfix for clarity
Changed the names of the strspn-c, strcspn-c, and strpbrk-c files
in a general refactor. It didn't change the include paths for the
i386 files breaking the i386 build. This commit fixes that.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
No functions are changed. It just renames generic implementations from
'{func}_sse2' to '{func}_generic'. This is just because the postfix
"_sse2" was overloaded and was used for files that had hand-optimized
sse2 assembly implementations and files that just redirected back
to the generic implementation.
Full xcheck passed on x86_64.
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only
needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. RELATIVE has been
handled (by _ELF_DYNAMIC_DO_RELOC due to DT_RELACOUNT, or RELR), so the
switch statement only needs to handle GLOB_DAT and JUMP_SLOT.
We can drop these `#if[n]def RTLD_BOOTSTRAP` and add a large
`# ifndef RTLD_BOOTSTRAP` instead.
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only
needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT.
TLSDESC/TLS_DTPMOD/TLS_DTPREL handling can be removed. Remove
`case AARCH64_R(RELATIVE)` as well as elf_machine_rela has checked it.
Tested on aarch64-linux-gnu.
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only
needs to handle RELATIVE, GLOB_DAT, and the symbolic relocation type
(R_RISCV_{32,64}). NONE and IRELATIVE can be removed.
The code relies on ld.so having DT_RELACOUNT so that the RTLD_BOOTSTRAP
branch does not need handle RELATIVE. Drop this minor size
optimization for clarity.
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
If an executable has copy relocations for extern protected data, that
can only work if the library containing the definition is built with
assumptions (a) the compiler emits GOT-generating relocations (b) the
linker produces R_*_GLOB_DAT instead of R_*_RELATIVE. Otherwise the
library uses its own definition directly and the executable accesses a
stale copy. Note: the GOT relocations defeat the purpose of protected
visibility as an optimization, but allow rtld to make the executable and
library use the same copy when copy relocations are present, but it
turns out this never worked perfectly.
ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both
a.so and b.so define protected var and the executable copy relocates
var: b.so accesses its own copy even with GLOB_DAT. The behavior change
is from commit 62da1e3b00 (x86) and then
copied to nios2 (ae5eae7cfc) and arc
(0e7d930c4c).
Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy
relocated data like a.so.
There is now a warning for copy relocation on protected symbol since
commit 7374c02b68. It's extremely
unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA
behavior, so let's remove it: this removes a check in the symbol lookup
code.
This has been missing since the the ifuncs where added.
The performance of SSE4.2 is preferable to to SSE2.
Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
Refine commit 349b0441da:
1. Copy relocations for extern protected data do not work properly,
regardless whether GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS is used.
It makes sense to produce a warning unconditionally.
2. Non-zero value of an undefined function symbol may break pointer
equality, but may be benign in many cases (many programs don't take the
address in the shared object then compare it with the address in the
executable). Reword the diagnostic to be clearer.
3. Remove the unneeded condition !(undef_map->l_1_needed &
GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS). If the executable does
not not have GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (can only
occur in error cases), the diagnostic should be emitted as well.
When the defining shared object has
GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS, report an error to apply
the intended enforcement.
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.
Passes buildmanyglibc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Optimizations are:
1. Reduce code size (-112 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-4k+ rodata is shared with avx2).
Result is roughly a 15-16% speedup:
Function, New Time, Old Time, New / Old
_ZGVbN4v_tanhf, 3.158, 3.749, 0.842
Optimizations are:
1. Reduce code size (-81 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-32 bytes).
Result is roughly a 17-18% speedup:
Function, New Time, Old Time, New / Old
_ZGVdN8v_tanhf, 1.977, 2.402, 0.823
tanhf-avx2 and tanhf-sse4 use the same data tables so we can save
over 4kb using a shared datatable. This does increase the memory
footprint of the sse4 version (as now all the targets are 32 bytes
instead of 16), generally it seems worth the code size save.
NB: This patch doesn't do anything itself, it is setup for future
patches.
Optimizations are:
1. Reduce code size (-67 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata usage (-448 bytes).
Result is roughly a 14% speedup:
Function, New Time, Old Time, New / Old
_ZGVeN16v_tanhf, 0.649, 0.752, 0.863
Improvements are:
1. Reduce code size (-62 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata usage (-16 bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVbN4v_atanhf, 8.821, 8.903, 0.991
Improvements are:
1. Reduce code size (-60 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Shrink rodata usage (-32 bytes).
The throughput improvement is not that significant (3-5%) as the
port 0 bottleneck is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVdN8v_atanhf, 2.799, 2.923, 0.958
Improvements are:
1. Reduce code size (-64 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata size ([-128, -188] bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf, 1.39, 1.408, 0.987
commit 6dcbb7d95d
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jun 6 21:11:33 2022 -0700
x86: Shrink code size of memchr-avx2.S
Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.
This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.
Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.
Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 64 bytes
There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 59 bytes
There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller lengths more.
2. optimizes target placement more carefully.
3. reuses logic more.
4. fixes up various inefficiencies in the logic.
The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874
Regressions:
1. The page cross case is now colder, especially re-entry from the
page cross case if a match is not found in the first VEC
(roughly 50%). My general opinion with this patch is this is
acceptable given the "coldness" of this case (less than 4%) and
generally performance improvement in the other far more common
cases.
2. There are some regressions 5-15% for medium/large user-arg
lengths that have a match in the first VEC. This is because the
logic was rewritten to optimize finds in the first VEC if the
user-arg length is shorter (where we see roughly 20-50%
performance improvements). It is not always the case this is a
regression. My intuition is some frontend quirk is partially
explaining the data although I haven't been able to find the
root cause.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.
This can be useful when hoisting a vzeroupper to save code size
for example:
```
L(foo):
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
VZEROUPPER_RETURN
L(bar):
xorl %eax, %eax
VZEROUPPER_RETURN
```
Can become:
```
L(foo):
COND_VZEROUPPER
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
ret
L(bar):
xorl %eax, %eax
ret
```
This code does not change any existing functionality.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.
Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).
Tested on powerpc, powerpc64 and powerpc64le.
Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:
(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.
(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.
(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.
Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)
Geometric mean of all benchmarks: new / old = 0.66
Difficult skiptable(0) : new / old = 0.02
Difficult skiptable(1) : new / old = 0.01
Difficult 2-way : new / old = 0.25
Difficult testing first 2 : new / old = 1.26
Difficult skiptable(0) : new / old = 0.05
Difficult skiptable(1) : new / old = 0.06
Difficult 2-way : new / old = 0.26
Difficult testing first 2 : new / old = 1.05
Difficult skiptable(0) : new / old = 0.42
Difficult skiptable(1) : new / old = 0.24
Difficult 2-way : new / old = 0.21
Difficult testing first 2 : new / old = 1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Newer versions of GNU grep (after grep 3.7, not inclusive) will warn on
'egrep' and 'fgrep' invocations.
Convert usages within the tree to their expanded non-aliased counterparts
to avoid irritating warnings during ./configure and the test suite.
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Added in Linux 5.15 (884a7e5964e06ed93c7771c0d7cf19c09a8946f1), the new
syscalls allows a caller to free the memory of a dying target process.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
It was added on Linux 5.10 (ecb8ac8b1f146915aa6b96449b66dd48984caacc)
with the same functionality as madvise but using a pidfd of the target
process.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This definition is only used as a fallback with old kernel headers.
The change follows kernel commit bfdf4e6208051ed7165b2e92035b4bf11
("rseq: Remove broken uapi field layout on 32-bit little endian").
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
LoongArch is going to be the first architecture supported by Linux that
has neither fstat* nor newfstatat [1], instead exclusively relying on
statx. So in fstatat64's implementation, we need to also enable statx
usage if neither fstatat64 nor newfstatat is present, to prepare for
this new case of kernel ABI.
[1]: https://lore.kernel.org/all/20220518092619.1269111-1-chenhuacai@loongson.cn/
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux 5.18 adds a constant MADV_DONTNEED_LOCKED (defined in multiple
header files, but with the same value on all architectures). Add this
constant to bits/mman-linux.h.
Tested for x86_64.
Linux 5.18 defines a new AArch64 HWCAP value HWCAP2_MTE3; add it to
glibc's sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h.
Tested with build-many-glibcs.py for aarch64-linux-gnu.
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on sparc64-linux-gnu and sparcv9-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on s390x-linux-gnu and s390-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0.
The startup code is changed to read the _dl_argc and _dl_argv values,
and envp is calculated from argc and argv.
Checked on ia64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Different than other architectures, hppa creates an unrelated stack
frame where ld.so argc/argv adjustments done by ad43cac44a
is not done on the argc/argv saved/restore by _dl_start_user.
Instead load _dl_argc and _dl_argv directlty instead of adjust them
using _dl_skip_args value.
Checked on hppa-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the fixup_stack branch ununsed.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the _fixup_stack branch ununsed.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. It makes the fixup_stack branch ununsed.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.
Reviewed-by: Fangrui Song <maskray@google.com>
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.
Placeholder function, not used by any processor at the moment.
- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch updates the kernel version in the tests tst-mman-consts.py
and tst-pidfd-consts.py to 5.18. (There are no new constants covered
by these tests in 5.18, or in 5.17 in the case of tst-pidfd-consts.py
that previously used version 5.16, that need any other header
changes.)
Tested with build-many-glibcs.py.
Linux 5.18 has no new syscalls. Update the version number in
syscall-names.list to reflect that it is still current for 5.18.
Tested with build-many-glibcs.py.
In multi-threaded programs, registering via pthread_atfork,
de-registering implicitly via dlclose, or running pthread_atfork
handlers during fork was protected by an internal lock. This meant
that a pthread_atfork handler attempting to register another handler or
dlclose a dynamically loaded library would lead to a deadlock.
This commit fixes the deadlock in the following way:
During the execution of handlers at fork time, the atfork lock is
released prior to the execution of each handler and taken again upon its
return. Any handler registrations or de-registrations that occurred
during the execution of the handler are accounted for before proceeding
with further handler execution.
If a handler that hasn't been executed yet gets de-registered by another
handler during fork, it will not be executed. If a handler gets
registered by another handler during fork, it will not be executed
during that particular fork.
The possibility that handlers may now be registered or deregistered
during handler execution means that identifying the next handler to be
run after a given handler may register/de-register others requires some
bookkeeping. The fork_handler struct has an additional field, 'id',
which is assigned sequentially during registration. Thus, handlers are
executed in ascending order of 'id' during 'prepare', and descending
order of 'id' during parent/child handler execution after the fork.
Two tests are included:
* tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This test exercises calling dlclose from prepare, parent, and child
handlers.
* tst-atfork4: This test exercises calling pthread_atfork and dlclose
from the prepare handler.
[BZ #24595, BZ #27054]
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins). Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.
It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.
On ia64 it also fixes the sNAN handling:
math/test-float64x-fabs
math/test-ldouble-fabs
Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
This reverts commit 0910702c4d.
Say both a.so and b.so define protected data symbol `var` and the executable
copy relocates var. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange
semantics: a.so accesses the copy in the executable while b.so accesses its
own. This behavior requires that (a) the compiler emits GOT-generating
relocations (b) the linker produces GLOB_DAT instead of RELATIVE.
Without the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA code, b.so's GLOB_DAT
will bind to the executable (normal behavior).
For aarch64 it makes sense to restore the original behavior and don't
pay the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA cost. The behavior is very
unlikely used by anyone.
* Clang code generator treats STV_PROTECTED the same way as STV_HIDDEN:
no GOT-generating relocation in the first place.
* gold and lld reject copy relocation on a STV_PROTECTED symbol.
* Nowadays -fpie/-fpic modes are popular. GCC/Clang's codegen uses
GOT-generating relocation when accessing an default visibility
external symbol which avoids copy relocation.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Unroll slightly and enforce good instruction scheduling. This improves
performance on out-of-order machines. The unrolling allows for
pipelined multiplies.
As well, as an optional sysdep, reorder the operations and prevent
reassosiation for better scheduling and higher ILP. This commit
only adds the barrier for x86, although it should be either no
change or a win for any architecture.
Unrolling further started to induce slowdowns for sizes [0, 4]
but can help the loop so if larger sizes are the target further
unrolling can be beneficial.
Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Time as Geometric Mean of N=30 runs
Geometric of all benchmark New / Old: 0.674
type, length, New Time, Old Time, New Time / Old Time
fixed, 0, 2.865, 2.72, 1.053
fixed, 1, 3.567, 2.489, 1.433
fixed, 2, 2.577, 3.649, 0.706
fixed, 3, 3.644, 5.983, 0.609
fixed, 4, 4.211, 6.833, 0.616
fixed, 5, 4.741, 9.372, 0.506
fixed, 6, 5.415, 9.561, 0.566
fixed, 7, 6.649, 10.789, 0.616
fixed, 8, 8.081, 11.808, 0.684
fixed, 9, 8.427, 12.935, 0.651
fixed, 10, 8.673, 14.134, 0.614
fixed, 11, 10.69, 15.408, 0.694
fixed, 12, 10.789, 16.982, 0.635
fixed, 13, 12.169, 18.411, 0.661
fixed, 14, 12.659, 19.914, 0.636
fixed, 15, 13.526, 21.541, 0.628
fixed, 16, 14.211, 23.088, 0.616
fixed, 32, 29.412, 52.722, 0.558
fixed, 64, 65.41, 142.351, 0.459
fixed, 128, 138.505, 295.625, 0.469
fixed, 256, 291.707, 601.983, 0.485
random, 2, 12.698, 12.849, 0.988
random, 4, 16.065, 15.857, 1.013
random, 8, 19.564, 21.105, 0.927
random, 16, 23.919, 26.823, 0.892
random, 32, 31.987, 39.591, 0.808
random, 64, 49.282, 71.487, 0.689
random, 128, 82.23, 145.364, 0.566
random, 256, 152.209, 298.434, 0.51
Co-authored-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
To check for the pidfd functions pidfd_open, pidfd_getfd, pid_send_signal,
and waitid with P_PIDFD.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
It was added on Linux 5.4 (3695eae5fee0605f316fbaad0b9e3de791d7dfaf)
to extend waitid to wait on pidfd.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
This was added on Linux 5.1(3eb39f47934f9d5a3027fe00d906a45fe3a15fad)
as a way to avoid the race condition of using kill (where PID might be
reused by the kernel between between obtaining the pid and sending the
signal).
If the siginfo_t argument is NULL then pidfd_send_signal is equivalent
to kill. If it is not NULL pidfd_send_signal is equivalent to
rt_sigqueueinfo.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
This was added on Linux 5.6 (8649c322f75c96e7ced2fec201e123b2b073bf09)
as a way to retrieve a file descriptors for another process though
pidfd (created either with CLONE_PIDFD or pidfd_getfd). The
functionality is similar to recvmmsg SCM_RIGHTS.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
This was added on Linux 5.3 (32fcb426ec001cb6d5a4a195091a8486ea77e2df)
as a way to retrieve a pid file descriptors for process that has not
been created CLONE_PIDFD (by usual fork/clone).
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
A separate asm file is easier to maintain than a macro that expands to
inline asm.
The RTLD_START macro is only needed now because _dl_start is local in
rtld.c, but _start has to call it, if _dl_start was made hidden then it
could be empty.
_dl_skip_args is no longer needed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This is for bug 23293 and it relies on the glibc test system running
tests via explicit ld.so invokation by default.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
_dl_skip_args is always 0, so the target specific code that modifies
argv after relro protection is applied is no longer used.
After the patch relro protection is applied to _dl_argv consistently
on all targets.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
When an executable is invoked as
./ld.so [ld.so-args] ./exe [exe-args]
then the argv is adujusted in ld.so before calling the entry point of
the executable so ld.so args are not visible to it. On most targets
this requires moving argv, env and auxv on the stack to ensure correct
stack alignment at the entry point. This had several issues:
- The code for this adjustment on the stack is written in asm as part
of the target specific ld.so _start code which is hard to maintain.
- The adjustment is done after _dl_start returns, where it's too late
to update GLRO(dl_auxv), as it is already readonly, so it points to
memory that was clobbered by the adjustment. This is bug 23293.
- _environ is also wrong in ld.so after the adjustment, but it is
likely not used after _dl_start returns so this is not user visible.
- _dl_argv was updated, but for this it was moved out of relro, which
changes security properties across targets unnecessarily.
This patch introduces a generic _dl_start_args_adjust function that
handles the argument adjustments after ld.so processed its own args
and before relro protection is applied.
The same algorithm is used on all targets, _dl_skip_args is now 0, so
existing target specific adjustment code is no longer used. The bug
affects aarch64, alpha, arc, arm, csky, ia64, nios2, s390-32 and sparc,
other targets don't need the change in principle, only for consistency.
The GNU Hurd start code relied on _dl_skip_args after dl_main returned,
now it checks directly if args were adjusted and fixes the Hurd startup
data accordingly.
Follow up patches can remove _dl_skip_args and DL_ARGV_NOT_RELRO.
Tested on aarch64-linux-gnu and cross tested on i686-gnu.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The Linux version used by i686 and m68k provide three overrrides for
generic code:
1. DISTINGUISH_LIB_VERSIONS to print additional information when
libc5 is used by a dependency.
2. EXTRA_LD_ENVVARS to that enabled LD_LIBRARY_VERSION environment
variable.
3. EXTRA_UNSECURE_ENVVARS to add two environment variables related
to aout support.
None are really requires, it has some decades since libc5 or aout
suppported was removed and Linux even remove support for aout files.
The LD_LIBRARY_VERSION is also dead code, dl_correct_cache_id is not
used anywhere.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
The kernel version check is used to avoid glibc to run on older
kernels where some syscall are not available and fallback code are
not enabled to handle graciously fail. However, it does not prevent
if the kernel does not correctly advertise its version through
vDSO note, uname or procfs.
Also kernel version checks are sometime not desirable by users,
where they want to deploy on different system with different kernel
version knowing the minimum set of syscall is always presented on
such systems.
The kernel version check has been removed along with the
LD_ASSUME_KERNEL environment variable. The minimum kernel used to
built glibc is still provided through NT_GNU_ABI_TAG ELF note and
also printed when libc.so is issued.
Checked on x86_64-linux-gnu.