Remove the LD_PREFER_MAP_32BIT_EXEC environment variable support since
the first PT_LOAD segment is no longer executable due to defaulting to
-z separate-code.
This fixes [BZ #28656].
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Without the bar_ctor_finish barrier, it was possible that thread2
re-locked user_lock before ctor had a chance to lock it. ctor then
blocked in its locking operation, xdlopen from the main thread
did not return, and thread2 was stuck waiting in bar_dtor:
thread 1: started.
thread 2: started.
thread 2: locked user_lock.
constructor started: 0.
thread 1: in ctor: started.
thread 3: started.
thread 3: done.
thread 2: unlocked user_lock.
thread 2: locked user_lock.
Fixes the test in commit 83b5323261
("elf: Avoid deadlock between pthread_create and ctors [BZ #28357]").
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
TLS_INIT_TCB_ALIGN is not actually used. TLS_TCB_ALIGN was likely
introduced to support a configuration where the thread pointer
has not the same alignment as THREAD_SELF. Only ia64 seems to use
that, but for the stack/pointer guard, not for storing tcbhead_t.
Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift
the thread pointer, potentially landing in a different residue class
modulo the alignment, but the changes should not impact that.
In general, given that TLS variables have their own alignment
requirements, having different alignment for the (unshifted) thread
pointer and struct pthread would potentially result in dynamic
offsets, leading to more complexity.
hppa had different values before: __alignof__ (tcbhead_t), which
seems to be 4, and __alignof__ (struct pthread), which was 8
(old default) and is now 32. However, it defines THREAD_SELF as:
/* Return the thread descriptor for the current thread. */
# define THREAD_SELF \
({ struct pthread *__self; \
__self = __get_cr27(); \
__self - 1; \
})
So the thread pointer points after struct pthread (hence __self - 1),
and they have to have the same alignment on hppa as well.
Similarly, on ia64, the definitions were different. We have:
# define TLS_PRE_TCB_SIZE \
(sizeof (struct pthread) \
+ (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t) \
? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1) \
& ~(__alignof__ (struct pthread) - 1)) \
: 0))
# define THREAD_SELF \
((struct pthread *) ((char *) __thread_self - TLS_PRE_TCB_SIZE))
And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment
(confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c).
On m68k, we have a larger gap between tcbhead_t and struct pthread.
But as far as I can tell, the port is fine with that. The definition
of TCB_OFFSET is sufficient to handle the shifted TCB scenario.
This fixes commit 23c77f6018
("nptl: Increase default TCB alignment to 32").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The relationship between the thread pointer and the rseq area
is made explicit. The constant offset can be used by JIT compilers
to optimize rseq access (e.g., for really fast sched_getcpu).
Extensibility is provided through __rseq_size and __rseq_flags.
(In the future, the kernel could request a different rseq size
via the auxiliary vector.)
Co-Authored-By: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
This tunable allows applications to register the rseq area instead
of glibc.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The rseq area is placed directly into struct pthread. rseq
registration failure is not treated as an error, so it is possible
that threads run with inconsistent registration status.
<sys/rseq.h> is not yet installed as a public header.
Co-Authored-By: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
<tls.h> already contains a definition that is quite similar,
but it is not consistent across architectures.
Only architectures for which rseq support is added are covered.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Current binutils defines __executable_start as the lowest text
address, so using the entry point address as a fallback is no
longer necessary. As a result, overriding <entry.h> is only
necessary if the entry point is not called _start.
The previous approach to define __ASSEMBLY__ to suppress the
declaration breaks if headers included by <entry.h> are not
compatible with __ASSEMBLY__. This happens with rseq integration
because it is necessary to include kernel headers in more places.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Programs without dynamic dependencies and without a program
interpreter are now run via execve.
Previously, the dynamic linker either crashed while attempting to
read a non-existing dynamic segment (looking for DT_AUDIT/DT_DEPAUDIT
data), or the self-relocated in the static PIE executable crashed
because the outer dynamic linker had already applied RELRO protection.
<dl-execve.h> is needed because execve is not available in the
dynamic loader on Hurd.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
rseq support will use a 32-byte aligned field in struct pthread,
so the whole struct needs to have at least that alignment.
nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h>
to obtain the fallback definition.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
v2 is a complete rewrite of the A64FX memcpy. Performance is improved
by streamlining the code, aligning all large copies and using a single
unrolled loop for all sizes. The code size for memcpy and memmove goes
down from 1796 bytes to 868 bytes. Performance is better in all cases:
bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33%
faster for large sizes, bench-memcpy-walk is 25% faster for small sizes
and 20% for the largest sizes. The geomean of all tests in bench-memcpy
is 5.1% faster, and total time is reduced by 4%.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Rewrite memcmp to improve performance. On small and medium inputs performance
is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per
iteration, which is 30-50% faster depending on the size.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The syscall pipe2 was added in linux 2.6.27 and glibc requires linux
3.2.0. The patch removes the arch-specific implementation for alpha,
ia64, mips, sh, and sparc which requires a different kernel ABI
than the usual one.
Checked on x86_64-linux-gnu and with a build for the affected ABIs.
Variadic function calls in syscalls.list does not work for all ABIs
(for instance where the argument are passed on the stack instead of
registers) and might have underlying issues depending of the variadic
type (for instance if a 64-bit argument is used).
Checked on x86_64-linux-gnu.
The LFS prlimit64 requires a arch-specific implementation in
syscalls.list. Instead add a generic one that handles the
required symbol alias for __RLIM_T_MATCHES_RLIM64_T.
HPPA is the only outlier which requires a different default
symbol.
Checked on x86_64-linux-gnu and with build for the affected ABIs.
Passing 64-bit arguments on syscalls.list is tricky: it requires
to reimplement the expected kernel abi in each architecture. This
is way to better to represent in C code where we already have
macros for this (SYSCALL_LL64).
Checked on x86_64-linux-gnu.
Add vector sin/sinf and input files to libmvec microbenchmark.
libmvec-sin-inputs:
90% Normal random distribution
range: (-DBL_MAX, DBL_MAX)
mean: 0.0
sigma: 5.0
10% uniform random distribution in range (-1000.0, 1000.0)
libmvec-sinf-inputs:
90% Normal random distribution
range: (-FLT_MAX, FLT_MAX)
mean: 0.0f
sigma: 5.0f
10% uniform random distribution in range (-1000.0f, 1000.0f)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add vector pow/powf and input files to libmvec microbenchmark.
libmvec-pow-inputs:
arg1:
90% Normal random distribution
range: (0.0, 256.0)
mean: 0.0
sigma: 32.0
10% uniform random distribution in range (0.0, 256.0)
arg2:
90% Normal random distribution
range: (-127.0, 127.0)
mean: 0.0
sigma: 16.0
10% uniform random distribution in range (-127.0, 127.0)
libmvec-powf-inputs:
arg1:
90% Normal random distribution
range: (0.0f, 100.0f)
mean: 0.0f
sigma: 16.0f
10% uniform random distribution in range (0.0f, 100.0f)
arg2:
90% Normal random distribution
range: (-10.0f, 10.0f)
mean: 0.0f
sigma: 8.0f
10% uniform random distribution in range (-10.0f, 10.0f)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add vector log/logf and input files to libmvec microbenchmark.
libmvec-log-inputs:
70% Normal random distribution
range: (0.0, DBL_MAX)
mean: 1.0
sigma: 50.0
30% uniform random distribution in range (0.0, DBL_MAX)
libmvec-logf-inputs:
70% Normal random distribution
range: (0.0f, FLT_MAX)
mean: 1.0f
sigma: 50.0f
30% uniform random distribution in range (0.0f, FLT_MAX)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add vector exp/expf and input files to libmvec microbenchmark.
libmvec-exp-inputs:
90% Normal random distribution
range: (-708.0, 709.0)
mean: 0.0
sigma: 16.0
10% uniform random distribution in range (-500.0, 500.0)
libmvec-expf-inputs:
90% Normal random distribution
range: (-87.0f, 88.0f)
mean: 0.0f
sigma: 8.0f
10% uniform random distribution in range (-50.0f, 50.0f)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add vector cos/cosf and input files to libmvec microbenchmark.
libmvec-cos-inputs:
90% Normal random distribution
range: (-DBL_MAX, DBL_MAX)
mean: 0.0
sigma: 5.0
10% uniform random distribution in range (-1000.0, 1000.0)
libmvec-cosf-inputs:
90% Normal random distribution
range: (-FLT_MAX, FLT_MAX)
mean: 0.0f
sigma: 5.0f
10% uniform random distribution in range (-1000.0f, 1000.0f)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Now that Hurd implementis both close_range and closefrom (f2c996597d),
we can make close_range() a base ABI, and make the default closefrom()
implementation on top of close_range().
The generic closefrom() implementation based on __getdtablesize() is
moved to generic close_range(). On Linux it will be overriden by
the auto-generation syscall while on Hurd it will be a system specific
implementation.
The closefrom() now calls close_range() and __closefrom_fallback().
Since on Hurd close_range() does not fail, __closefrom_fallback() is an
empty static inline function set by__ASSUME_CLOSE_RANGE.
The __ASSUME_CLOSE_RANGE also allows optimize Linux
__closefrom_fallback() implementation when --enable-kernel=5.9 or
higher is used.
Finally the Linux specific tst-close_range.c is moved to io and
enabled as default. The Linuxism and CLOSE_RANGE_UNSHARE are
guarded so it can be built for Hurd (I have not actually test it).
Checked on x86_64-linux-gnu, i686-linux-gnu, and with a i686-gnu
build.
__libc_signal_restore_set was in the wrong place: It also ran
when setjmp returned the second time (after pthread_exit or
pthread_cancel). This is observable with blocked pending
signals during thread exit.
Fixes commit b3cae39dcb
("nptl: Start new threads with all signals blocked [BZ #25098]").
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The @notoc usage only yields an advantage on ISA 3.1+ machine (power10)
and for ld.bfd also when it sees pcrel relocations used on the code
(generated if compiler targets ISA 3.1+). On bfd case ISA 3.1+
instruction on stubs are used iff linker also sees the new pc-relative
relocations (for instance R_PPC64_D34), otherwise it generates default
stubs (ppc64_elf_check_relocs:4700).
This patch also help on linkers that do not implement this optimization,
since building for older ISA (such as 3.0 / power9) will also trigger
power10 stubs generation in the assembly code uses the NOTOC imacro.
Checked on powerpc64le-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
It requires less boilerplate code for newer ports. The _Static_assert
checks from internal setjmp are moved to its own internal test since
setjmp.h is included early by multiple headers (to generate
rtld-sizes.sym).
The riscv jmp_buf-macros.h check is also redundant, it is already
done by riscv configure.ac.
Checked with a build for the affected architectures.
This patch updates the kernel version in the test tst-mman-consts.py
to 5.15. (There are no new MAP_* constants covered by this test in
5.15 that need any other header changes.)
Tested with build-many-glibcs.py.
And make it an installed header. This addresses a few aliasing
violations (which do not seem to result in miscompilation due to
the use of atomics), and also enables use of wide counters in other
parts of the library.
The debug output in nptl/tst-cond22 has been adjusted to print
the 32-bit values instead because it avoids a big-endian/little-endian
difference.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add python script to generate libmvec microbenchmark from the input
values for each libmvec function using skeleton benchmark template.
Creates double and float benchmarks with vector length 1, 2, 4, 8,
and 16 for each libmvec function. Vector length 1 corresponds to
scalar version of function and is included for vector function perf
comparison.
Co-authored-by: Haochen Jiang <haochen.jiang@intel.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
No bug.
This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.
memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes
The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.
See https://research.google/pubs/pub48320/ for more details.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Linux 5.15 has one new syscall, process_mrelease (and also enables the
clone3 syscall for RV32). It also has a macro __NR_SYSCALL_MASK for
Arm, which is not a syscall but matches the pattern used for syscall
macro names.
Add __NR_SYSCALL_MASK to the names filtered out in the code dealing
with syscall lists, update syscall-names.list for the new syscall and
regenerate the arch-syscall.h headers with build-many-glibcs.py
update-syscalls.
Tested with build-many-glibcs.py.
Depending on the layout chosen by the linker, the 16-bit displacement
of the jh instruction is insufficient to reach the target label.
Analysis of the linker failure was carried out by Nick Clifton.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
Since
commit d73f5331ce
Author: Roland McGrath <roland@gnu.org>
Date: Fri May 2 02:20:45 2003 +0000
2003-05-01 Roland McGrath <roland@redhat.com>
dependency is generated by passing -MD -MF to compiler. Remove the unused
+mkdep, +make-deps, s-proto.S and s-proto-cancel.S.
This fixes BZ #28554.
The include cleanup on dl-minimal.c removed too much for some
targets.
Also for Hurd, __sbrk is removed from localplt.data now that
tunables allocated memory through mmap.
Checked with a build for all affected architectures.
The rtld_malloc functions are moved to its own file so it can be
used on csu code. Also, the functiosn are renamed to __minimal_*
(since there are now used not only on loader code).
Using the __minimal_malloc on tunables_strdup() avoids potential
issues with sbrk() calls while processing the tunables (I see
sporadic elf/tst-dso-ordering9 on powerpc64le with different
tests failing due ASLR).
Also, using __minimal_malloc over plain mmap optimizes the memory
allocation on both static and dynamic case (since it will any unused
space in either the last page of data segments, avoiding mmap() call,
or from the previous mmap() call).
Checked on x86_64-linux-gnu, i686-linux-gnu, and powerpc64le-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The close_range () function implements the same API as the Linux and
FreeBSD syscalls. It operates atomically and reliably. The specified
upper bound is clamped to the actual size of the file descriptor table;
it is expected that the most common use case is with last = UINT_MAX.
Like in the Linux syscall, it is also possible to pass the
CLOSE_RANGE_CLOEXEC flag to mark the file descriptors in the range
cloexec instead of acually closing them.
Also, add a Hurd version of the closefrom () function. Since unlike on
Linux, close_range () cannot fail due to being unuspported by the
running kernel, a fallback implementation is never necessary.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20211106153524.82700-1-bugaevc@gmail.com>
No bug.
The optimizations are as follows:
1) Always align entry to 64 bytes. This makes behavior more
predictable and makes other frontend optimizations easier.
2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
significant benefits in the case that:
0 < (dst - src) < [256, 512]
3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
improvement and for FSRM [-10%, 25%].
In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
There are a few places where only known numeric values are acceptable for
`asm` parameters, yet the constraint "i" is used. "i" can include
"symbolic constants whose values will be known only at assembly time or
later."
Use "n" instead of "i" where known numeric values are required.
Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
It allows build both glibc and tests with lld (Since lld does not
support R_RISCV_ALIGN linker relaxation).
Checked with a build for riscv32-linux-gnu-rv32imafdc-ilp32d and
riscv64-linux-gnu-rv64imafdc-lp64d.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Clang cannot assemble movzx in the AT&T dialect mode.
../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
movzx (%rsi), %ecx
^~~~
Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The lld linker does not support TLSDESC for arm. The have-arm-tls-desc
is a leftover of 56583289b1 to support NaCL.
Reviewed-by: Fangrui Song <maskray@google.com>
The lld does not support R_ARM_GOTOFF32 to preemptible symbol (_dl_argv
has default visibility). Use the internal alias instead (one option
would to use HIDDEN_JUMPTARGET, bu the macro is not defined for
!__ASSEMBLER__ and I made this patch arm-specific to avoid require to
check extensivelly on other architecture it this might break something).
Checked on arm-linux-gnueabihf.
Reviewed-by: Fangrui Song <maskray@google.com>
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
In strcmp-evex.S, to compare 2 32-byte strings, replace
VMOVU (%rdi, %rdx), %YMM0
VMOVU (%rsi, %rdx), %YMM1
/* Each bit in K0 represents a mismatch in YMM0 and YMM1. */
VPCMP $4, %YMM0, %YMM1, %k0
VPCMP $0, %YMMZERO, %YMM0, %k1
VPCMP $0, %YMMZERO, %YMM1, %k2
/* Each bit in K1 represents a NULL in YMM0 or YMM1. */
kord %k1, %k2, %k1
/* Each bit in K1 represents a NULL or a mismatch. */
kord %k0, %k1, %k1
kmovd %k1, %ecx
testl %ecx, %ecx
jne L(last_vector)
with
VMOVU (%rdi, %rdx), %YMM0
VPTESTM %YMM0, %YMM0, %k2
/* Each bit cleared in K1 represents a mismatch or a null CHAR
in YMM0 and 32 bytes at (%rsi, %rdx). */
VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2}
kmovd %k1, %ecx
incl %ecx
jne L(last_vector)
It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.
Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
Compiling mmap_internal fails to compile when we use -1 for MMAP2_PAGE_UNIT
on 32 bit architectures. The error is as follows:
../sysdeps/unix/sysv/linux/mmap_internal.h:30:8: error: unknown type
name 'uint64_t'
|
30 | static uint64_t page_unit;
|
| ^~~~~~~~
Fix by adding including stdint.h.
The following commit:
commit cf4fd28ea4
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Tue Oct 26 19:43:18 2021 -0500
Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S
was not defined outside of multiarch and the alias for __memcmpeq in
x86_64/memcmp.S was removed.
This commit fixes that issue by adding x86_64/memcmpeq.S.
make xcheck passes on x86_64 with and without --disable-multi-arch
A non-local STV_DEFAULT defined symbol is by default preemptible in a
shared object. j/jal cannot target a preemptible symbol. On other
architectures, such a jump instruction either causes PLT [BZ #18822], or
if short-ranged, sometimes rejected by the linker (but not by GNU ld's
riscv port [ld PR/28509]).
Use HIDDEN_JUMPTARGET to target a non-preemptible symbol instead.
With this patch, ld.so and libc.so can be linked with LLD if source
files are compiled/assembled with -mno-relax/-Wa,-mno-relax.
Acked-by: Palmer Dabbelt <palmer@dabbelt.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
No bug. This commit adds new optimized __memcmpeq implementation for
evex.
The primary optimizations are:
1) skipping the logic to find the difference of the first mismatched
byte.
2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
No bug. This commit adds new optimized __memcmpeq implementation for
avx2.
The primary optimizations are:
1) skipping the logic to find the difference of the first mismatched
byte.
2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
No bug. This commit does not modify any of the memcmp
implementation. It just adds __memcmpeq ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.
No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.
No bug.
This commit adds support for __memcmpeq() as a new ABI for all
targets. In this commit __memcmpeq() is implemented only as an alias
to the corresponding targets memcmp() implementation. __memcmpeq() is
added as a new symbol starting with GLIBC_2.35 and defined in string.h
with comments explaining its behavior. Basic tests that it is callable
and works where added in string/tester.c
As discussed in the proposal "Add new ABI '__memcmpeq()' to libc"
__memcmpeq() is essentially a reserved namespace for bcmp(). The means
is shares the same specifications as memcmp() except the return value
for non-equal byte sequences is any non-zero value. This is less
strict than memcmp()'s return value specification and can be better
optimized when a boolean return is all that is needed.
__memcmpeq() is meant to only be called by compilers if they can prove
that the return value of a memcmp() call is only used for its boolean
value.
All tests in string/tester.c passed. As well build succeeds on
x86_64-linux-gnu target.
When LIBC_LINKER_FEATURE is used to check a linker option with the equal
sign, it will likely fail because the LD -v --help output may look like
`-z lam-report=[none|warning|error]` while the needle is something like
`-z lam-report=warning`.
The LD -v --help filter doesn't save much time, so just remove it.
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.
it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.
It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This second patch contains the actual implementation of a new sorting algorithm
for shared objects in the dynamic loader, which solves the slow behavior that
the current "old" algorithm falls into when the DSO set contains circular
dependencies.
The new algorithm implemented here is simply depth-first search (DFS) to obtain
the Reverse-Post Order (RPO) sequence, a topological sort. A new l_visited:1
bitfield is added to struct link_map to more elegantly facilitate such a search.
The DFS algorithm is applied to the input maps[nmap-1] backwards towards
maps[0]. This has the effect of a more "shallow" recursion depth in general
since the input is in BFS. Also, when combined with the natural order of
processing l_initfini[] at each node, this creates a resulting output sorting
closer to the intuitive "left-to-right" order in most cases.
Another notable implementation adjustment related to this _dl_sort_maps change
is the removing of two char arrays 'used' and 'done' in _dl_close_worker to
represent two per-map attributes. This has been changed to simply use two new
bit-fields l_map_used:1, l_map_done:1 added to struct link_map. This also allows
discarding the clunky 'used' array sorting that _dl_sort_maps had to sometimes
do along the way.
Tunable support for switching between different sorting algorithms at runtime is
also added. A new tunable 'glibc.rtld.dynamic_sort' with current valid values 1
(old algorithm) and 2 (new DFS algorithm) has been added. At time of commit
of this patch, the default setting is 1 (old algorithm).
Signed-off-by: Chung-Lin Tang <cltang@codesourcery.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
According to C11 6.6p6, `const int` as an operand may not make up a
constant expression. GCC -O0 errors:
../sysdeps/unix/sysv/linux/opendir.c:107:19: error: static_assert expression is not an integral constant expression
_Static_assert (allocation_size >= sizeof (struct dirent64),
-O2 -Wpedantic has a similar warning.
See https://gcc.gnu.org/PR102502 for GCC's inconsistency.
Use enum which is guaranteed to be a constant expression.
This also makes the file compilable with Clang.
Fixes: 4b962c9e85 ("linux: Simplify opendir buffer allocation")
1. Add sysdeps/x86_64/fpu/Makeconfig to auto-generate libmvec.mk, which
contains libmvec ABI test dependencies and CFLAGS, in the build directory.
2. Include libmvec.mk for libmvec ABI test dependencies and CFLAGS.
Tested on SSE4, AVX, AVX2 and AVX512 machines.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The powerpc optimization to provide a fast stacktrace requires some
ad-hoc code to handle Linux signal frames and the change is fragile
once the kernel decides to slight change its execution sequence [1].
The generic implementation work as-is and it should be future proof
since the kernel provides the expected CFI directives in vDSO shared
page.
Checked on powerpc-linux-gnu, powerpc64le-linux-gnu, and
powerpc64-linux-gnu.
[1] https://sourceware.org/pipermail/libc-alpha/2021-January/122027.html
1. Define DL_RO_DYN_SECTION to initalize bootstrap_map.l_ld_readonly
before calling elf_get_dynamic_info to get dynamic info in bootstrap_map,
2. Define a single
static inline bool
dl_relocate_ld (const struct link_map *l)
{
/* Don't relocate dynamic section if it is readonly */
return !(l->l_ld_readonly || DL_RO_DYN_SECTION);
}
This updates BZ #28340 fix.
This was found when testing the OpenRISC port I am working on. These
two tests fail with SIGSEGV:
FAIL: misc/tst-ntp_gettime
FAIL: misc/tst-ntp_gettimex
This was found to be due to the kernel overwriting the stack space
allocated by the timex structure. The reason for the overwrite being
that the kernel timex has 64-bit fields and user space code only
allocates enough stack space for timex with 32-bit fields.
On 32-bit systems with TIMESIZE=64 __USE_TIME_BITS64 is not defined.
This causes the timex structure to use 32-bit fields with type
__syscall_slong_t.
This patch adjusts the ifdef condition to allow 32-bit systems with
TIMESIZE=64 to use the 64-bit long long timex definition.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
5bf07e1b3a ("Linux: Simplify __opensock and fix race condition [BZ #28353]")
made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET
knows about network interfaces, so better actually specify that in
if_index.
INTR_MSG_TRAP was tinkering with esp to make it point to
_hurd_intr_rpc_mach_msg's parameters, and notably use (&msg)[-1] which is
meaningless in C.
Instead, just push the parameters on the stack, which also avoids leaving
local variables of _hurd_intr_rpc_mach_msg below esp. We now also
properly express that OPTION and TIMEOUT may be updated during the trap
call.
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:
1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
architectures and it changes the semantics of various machine
relocation functions.
2. The elf_get_dynamic_info() change was done sideways, previously
to 490e6c62aa get-dynamic-info.h was included by the first
dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
It means that the code within elf_get_dynamic_info() that uses
RTLD_BOOTSTRAP is in fact unused.
To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.
And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.
To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).
Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header. Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.
The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).
Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
No bug.
Optimization are
1. change control flow for L(more_2x_vec) to fall through to loop and
jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
size and saves jumps for length > 4x VEC_SIZE.
2. For EVEX/AVX512 move L(less_vec) closer to entry.
3. Avoid complex address mode for length > 2x VEC_SIZE
4. Slightly better aligning code for the loop from the perspective of
code size and uops.
5. Align targets so they make full use of their fetch block and if
possible cache line.
6. Try and reduce total number of icache lines that will need to be
pulled in for a given length.
7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
jumping to the stosb target in the sse2 code section will almost
certainly be to a new page. The new version does increase code size
marginally by duplicating the target but should get better iTLB
behavior as a result.
test-memset, test-wmemset, and test-bzero are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
No bug.
The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.
Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.
In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.
The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.
test-memcmp and test-wmemcmp are both passing.
Before to 490e6c62aa ('elf: Avoid nested functions in the loader
[BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on
the first dynamic-link.h include and later within _dl_start(). The
former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used
on setup_vdso() (since it is a global definition), while the former does
define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation.
With the commit change, the function is now included and defined once
instead of defined as a nested function. So rtld.c defines without
defining RTLD_BOOTSTRAP and it brokes at least powerpc32.
This patch fixes by moving the get-dynamic-info.h include out of
dynamic-link.h, which then the caller can corirectly set the expected
semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or
RESOLVE_MAP.
It also required to enable some asserts only for the loader bootstrap
to avoid issues when called from setup_vdso().
As a side note, this is another issues with nested functions: it is
not clear from pre-processed output (-E -dD) how the function will
be build and its semantic (since nested function will be local and
extra C defines may change it).
I checked on x86_64-linux-gnu (w/o --enable-static-pie),
i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4,
aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and
s390x-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
Building for nios2-linux-gnu has recently started showing a localplt
test failure, arising from a reference to __floatunsidf from
getloadavg after commit b5c8a3aa82
("Linux: implement getloadavg(3) using sysinfo(2)") (this is an
architecture with soft-fp in libc). Add this as a permitted local PLT
reference in localplt.data.
Tested with build-many-glibcs.py for nios2-linux-gnu.
Intel MPX failed to gain wide adoption and has been deprecated for a
while. GCC 9.1 removed Intel MPX support. Linux kernel removed MPX in
2019.
This patch removes the support code from the dynamic loader.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
No bug.
This change adds a new macro ENTRY_P2ALIGN which takes a second
argument, log2 of the desired function alignment.
The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
doesn't affect any existing functionality.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).
The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)
Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
elf_machine_lazy_rel can drop the `scope` parameter.
Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
When performing symbol lookup for references in executable without
indirect external access:
1. Disallow copy relocations in executable against protected data symbols
in a shared object with indirect external access.
2. Disallow non-zero symbol values of undefined function symbols in
executable, which are used as the function pointer, against protected
function symbols in a shared object with indirect external access.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
1. Add GNU_PROPERTY_1_NEEDED:
#define GNU_PROPERTY_1_NEEDED GNU_PROPERTY_UINT32_OR_LO
to indicate the needed properties by the object file.
2. Add GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS:
#define GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (1U << 0)
to indicate that the object file requires canonical function pointers and
cannot be used with copy relocation.
3. Scan GNU_PROPERTY_1_NEEDED property and store it in l_1_needed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Both new HWCAPs were introduced in these kernel commits:
- 7e8403ecaf884f307b627f3c371475913dd29292
"s390: add HWCAP_S390_PCI_MIO to ELF hwcaps"
- 7e82523f2583e9813e4109df3656707162541297
"s390/hwcaps: make sie capability regular hwcap"
Also note that the kernel commit 511ad531afd4090625def4d9aba1f5227bd44b8e
"s390/hwcaps: shorten HWCAP defines" has shortened the prefix of the macros
from "HWCAP_S390_" to "HWCAP_". For compatibility reasons, we do not
change the prefix in public glibc header file.
The largest errors over the full binary32 range are after this
patch (on x86_64):
RNDN: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDZ: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDU: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDD: libm wrong by up to 8.98e+00 ulp(s) [9] for x=0x1.4b7066p+7
Inputs that were yielding huge errors have been added to "make check".
Reviewed-by: Adhemeral Zanella <adhemerval.zanella@linaro.org>
The fix for bug 19329 caused a regression such that pthread_create can
deadlock when concurrent ctors from dlopen are waiting for it to finish.
Use a new GL(dl_load_tls_lock) in pthread_create that is not taken
around ctors in dlopen.
The new lock is also used in __tls_get_addr instead of GL(dl_load_lock).
The new lock is held in _dl_open_worker and _dl_close_worker around
most of the logic before/after the init/fini routines. When init/fini
routines are running then TLS is in a consistent, usable state.
In _dl_open_worker the new lock requires catching and reraising dlopen
failures that happen in the critical section.
The new lock is reinitialized in a fork child, to keep the existing
behaviour and it is kept recursive in case malloc interposition or TLS
access from signal handlers can retake it. It is not obvious if this
is necessary or helps, but avoids changing the preexisting behaviour.
The new lock may be more appropriate for dl_iterate_phdr too than
GL(dl_load_write_lock), since TLS state of an incompletely loaded
module may be accessed. If the new lock can replace the old one,
that can be a separate change.
Fixes bug 28357.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The choice between the kill vs tgkill system calls is not just about
the TID reuse race, but also about whether the signal is sent to the
whole process (and any thread in it) or to a specific thread.
This was caught by the openposix test suite:
LTP: openposix test suite - FAIL: SIGUSR1 is member of new thread pendingset.
<https://gitlab.com/cki-project/kernel-tests/-/issues/764>
Fixes commit 526c3cf11e ("nptl: Fix race
between pthread_kill and thread exit (bug 12889)").
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Linux added FUTEX_LOCK_PI2 to support clock selection
(commit bf22a6976897977b0a3f1aeba6823c959fc4fdae). With the new
flag we can now proper support CLOCK_MONOTONIC for
pthread_mutex_clocklock with Priority Inheritance. If kernel
does not support, EINVAL is returned instead.
The difference is the futex operation will be issued and the kernel
will advertise the missing support (instead of hard-code error
return).
Checked on x86_64-linux-gnu and i686-linux-gnu on Linux 5.14, 5.11,
and 4.15.