arm_sve.h depends on stdint.h but that relies on libc headers unless
compiled in freestanding mode. Without this change a bootstrap glibc
build (that uses a compiler without installed libc headers) failed with
checking for availability of SVE ACLE... In file included from [...]/arm_sve.h:28,
from conftest.c:1:
[...]/stdint.h:9:16: fatal error: stdint.h: No such file or directory
9 | # include_next <stdint.h>
| ^~~~~~~~~~
compilation terminated.
configure: error: mathvec is enabled but compiler does not have SVE ACLE. [...]
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.
The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.
Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.
Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Fix the computation to allow for cntfrq_el0 being larger than 1GHz.
Assume cntfrq_el0 is a multiple of 1MHz to increase the maximum
interval (1024 seconds at 1GHz).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Use shrn for narrowing the mask which simplifies code and speeds up small
strings. Unroll the first search loop to improve performance on large
strings.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are around 10% faster, large strings are 40% faster on
modern CPUs.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Simplify calculation of the mask using shrn. Unroll the main loop.
Small strings are 20% faster on modern CPUs.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
This patch fixes two problems with audit:
1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS,
resulting in x2 register value nulling in RG structure.
2. We need to preserve the x8 register before function call, but
don't have to save it's new value and restore it before return.
Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value
which is wrong (althoug doesn't affect anything).
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This makes it more likely that the compiler can compute the strlen
argument in _startup_fatal at compile time, which is required to
avoid a dependency on strlen this early during process startup.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
clang emits an warning when a double alias redirection is used, to warn
the the original symbol will be used even when weak definition is
overridden. However, this is a common pattern for weak_alias, where
multiple alias are set to same symbol.
Reviewed-by: Fangrui Song <maskray@google.com>
Use an empty wordcopy.c to avoid building the generic one.
It does not seem to be used anywhere.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it. If SVE is available, a SVE memcpy will be
used by default (including for Neoverse N2).
In the future, this will result in a compilation failure if the
macros are unexpectedly undefined (due to header inclusion ordering
or header inclusion missing altogether).
Assembler sources are more difficult to convert. In many cases,
they are hand-optimized for the mangling and no-mangling variants,
which is why they are not converted.
sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c
are special: These are C sources, but most of the implementation is
in assembler, so the PTR_DEMANGLE macro has to be undefined in some
cases, to match the assembler style.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This allows us to define a generic no-op version of PTR_MANGLE and
PTR_DEMANGLE. In the future, we can use PTR_MANGLE and PTR_DEMANGLE
unconditionally in C sources, avoiding an unintended loss of hardening
due to missing include files or unlucky header inclusion ordering.
In i386 and x86_64, we can avoid a <tls.h> dependency in the C
code by using the computed constant from <tcb-offsets.h>. <sysdep.h>
no longer includes these definitions, so there is no cyclic dependency
anymore when computing the <tcb-offsets.h> constants.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).
This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Rename atomic_exchange_rel/acq to use atomic_exchange_release/acquire
since these map to the standard C11 atomic builtins.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The start code can get linked into dynamic linked executables where
LGPL would require shipping the source or linkable binaries when the
executable is distributed.
On some targets the license exception was missing in start.S (which
is compiled into crt1.o and Scrt1.o which may end up linked into PDE
and PIE binaries).
I did not review what other code may end up in executables, just
fixed the start.S license inconsistency across targets.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Rather than buffering 16 MiB of entropy in userspace (by way of
chacha20), simply call getrandom() every time.
This approach is doubtlessly slower, for now, but trying to prematurely
optimize arc4random appears to be leading toward all sorts of nasty
properties and gotchas. Instead, this patch takes a much more
conservative approach. The interface is added as a basic loop wrapper
around getrandom(), and then later, the kernel and libc together can
work together on optimizing that.
This prevents numerous issues in which userspace is unaware of when it
really must throw away its buffer, since we avoid buffering all
together. Future improvements may include userspace learning more from
the kernel about when to do that, which might make these sorts of
chacha20-based optimizations more possible. The current heuristic of 16
MiB is meaningless garbage that doesn't correspond to anything the
kernel might know about. So for now, let's just do something
conservative that we know is correct and won't lead to cryptographic
issues for users of this function.
This patch might be considered along the lines of, "optimization is the
root of all evil," in that the much more complex implementation it
replaces moves too fast without considering security implications,
whereas the incremental approach done here is a much safer way of going
about things. Once this lands, we can take our time in optimizing this
properly using new interplay between the kernel and userspace.
getrandom(0) is used, since that's the one that ensures the bytes
returned are cryptographically secure. But on systems without it, we
fallback to using /dev/urandom. This is unfortunate because it means
opening a file descriptor, but there's not much of a choice. Secondly,
as part of the fallback, in order to get more or less the same
properties of getrandom(0), we poll on /dev/random, and if the poll
succeeds at least once, then we assume the RNG is initialized. This is a
rough approximation, as the ancient "non-blocking pool" initialized
after the "blocking pool", not before, and it may not port back to all
ancient kernels, though it does to all kernels supported by glibc
(≥3.2), so generally it's the best approximation we can do.
The motivation for including arc4random, in the first place, is to have
source-level compatibility with existing code. That means this patch
doesn't attempt to litigate the interface itself. It does, however,
choose a conservative approach for implementing it.
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Cristian Rodríguez <crrodriguez@opensuse.org>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Cc: Mark Harris <mark.hsj@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-aarch64.S. It is used as default and only
little-endian is supported (BE uses generic code).
As for generic implementation, the last step that XOR with the
input is omited. The final state register clearing is also
omitted.
On a virtualized Linux on Apple M1 it shows the following
improvements (using formatted bench-arc4random data):
GENERIC MB/s
-----------------------------------------------
arc4random [single-thread] 380.89
arc4random_buf(16) [single-thread] 500.73
arc4random_buf(32) [single-thread] 552.61
arc4random_buf(48) [single-thread] 566.82
arc4random_buf(64) [single-thread] 574.01
arc4random_buf(80) [single-thread] 581.02
arc4random_buf(96) [single-thread] 591.19
arc4random_buf(112) [single-thread] 592.29
arc4random_buf(128) [single-thread] 596.43
-----------------------------------------------
OPTIMIZED MB/s
-----------------------------------------------
arc4random [single-thread] 569.60
arc4random_buf(16) [single-thread] 825.78
arc4random_buf(32) [single-thread] 987.03
arc4random_buf(48) [single-thread] 1042.39
arc4random_buf(64) [single-thread] 1075.50
arc4random_buf(80) [single-thread] 1094.68
arc4random_buf(96) [single-thread] 1130.16
arc4random_buf(112) [single-thread] 1129.58
arc4random_buf(128) [single-thread] 1137.91
-----------------------------------------------
Checked on aarch64-linux-gnu.
We found that string functions were using AND+ADDP
to find the nibble/syndrome mask but there is an easier
opportunity through `SHRN dst.8b, src.8h, 4` (shift
right every 2 bytes by 4 and narrow to 1 byte) and has
same latency on all SIMD ARMv8 targets as ADDP. There
are also possible gaps for memcmp but that's for
another patch.
We see 10-20% savings for small-mid size cases (<=128)
which are primary cases for general workloads.
The AFP feature (Alternate floating-point behavior) was added in armv8.7 and
introduced new FPCR bits.
Currently, HWCAP2_AFP bits (bit 0, 1, 2) in FPCR are preserved when fenv is
set to default environment. This is a deviation from standard behaviour.
Clear these bits when setting the fenv to default.
There is no libc API to modify the new FPCR bits. Restoring those bits matters
if the user changed them directly.
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only
needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT.
TLSDESC/TLS_DTPMOD/TLS_DTPREL handling can be removed. Remove
`case AARCH64_R(RELATIVE)` as well as elf_machine_rela has checked it.
Tested on aarch64-linux-gnu.
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.
Passes buildmanyglibc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
This reverts commit 0910702c4d.
Say both a.so and b.so define protected data symbol `var` and the executable
copy relocates var. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange
semantics: a.so accesses the copy in the executable while b.so accesses its
own. This behavior requires that (a) the compiler emits GOT-generating
relocations (b) the linker produces GLOB_DAT instead of RELATIVE.
Without the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA code, b.so's GLOB_DAT
will bind to the executable (normal behavior).
For aarch64 it makes sense to restore the original behavior and don't
pay the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA cost. The behavior is very
unlikely used by anyone.
* Clang code generator treats STV_PROTECTED the same way as STV_HIDDEN:
no GOT-generating relocation in the first place.
* gold and lld reject copy relocation on a STV_PROTECTED symbol.
* Nowadays -fpie/-fpic modes are popular. GCC/Clang's codegen uses
GOT-generating relocation when accessing an default visibility
external symbol which avoids copy relocation.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
A separate asm file is easier to maintain than a macro that expands to
inline asm.
The RTLD_START macro is only needed now because _dl_start is local in
rtld.c, but _start has to call it, if _dl_start was made hidden then it
could be empty.
_dl_skip_args is no longer needed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
_dl_skip_args is always 0, so the target specific code that modifies
argv after relro protection is applied is no longer used.
After the patch relro protection is applied to _dl_argv consistently
on all targets.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
PI_STATIC_AND_HIDDEN indicates whether accesses to internal linkage
variables and hidden visibility variables in a shared object (ld.so)
need dynamic relocations (usually R_*_RELATIVE). PI (position
independent) in the macro name is a misnomer: a code sequence using GOT
is typically position-independent as well, but using dynamic relocations
does not meet the requirement.
Not defining PI_STATIC_AND_HIDDEN is legacy and we expect that all new
ports will define PI_STATIC_AND_HIDDEN. Current ports defining
PI_STATIC_AND_HIDDEN are more than the opposite. Change the configure
default.
No functional change.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The rtld audit support show two problems on aarch64:
1. _dl_runtime_resolve does not preserve x8, the indirect result
location register, which might generate wrong result calls
depending of the function signature.
2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve
were twice the size of D registers extracted from the stack frame by
_dl_runtime_profile.
While 2. might result in wrong information passed on the PLT tracing,
1. generates wrong runtime behaviour.
The aarch64 rtld audit support is changed to:
* Both La_aarch64_regs and La_aarch64_retval are expanded to include
both x8 and the full sized NEON V registers, as defined by the
ABI.
* dl_runtime_profile needed to extract registers saved by
_dl_runtime_resolve and put them into the new correctly sized
La_aarch64_regs structure.
* The LAV_CURRENT check is change to only accept new audit modules
to avoid the undefined behavior of not save/restore x8.
* Different than other architectures, audit modules older than
LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval
changed their layout and there are no requirements to support multiple
audit interface with the inherent aarch64 issues).
* A new field is also reserved on both La_aarch64_regs and
La_aarch64_retval to support variant pcs symbols.
Similar to x86, a new La_aarch64_vector type to represent the NEON
register is added on the La_aarch64_regs (so each type can be accessed
directly).
Since LAV_CURRENT was already bumped to support bind-now, there is
no need to increase it again.
Checked on aarch64-linux-gnu.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
And use machine-sp.h instead. The Linux implementation is based on
already provided CURRENT_STACK_FRAME (used on nptl code) and
STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.
It consolidates the code required to call la_pltexit audit
callback.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>