Commit Graph

492 Commits

Author SHA1 Message Date
Joe Ramsay
cb5d84f1f8 aarch64/fpu: Add vector variants of erf
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:48 +01:00
Wilco Dijkstra
2e94e2f5d2 AArch64: Check kernel version for SVE ifuncs
Old Linux kernels disable SVE after every system call.  Calling the
SVE-optimized memcpy afterwards will then cause a trap to reenable SVE.
As a result, applications with a high use of syscalls may run slower with
the SVE memcpy.  This is true for kernels between 4.15.0 and before 6.2.0,
except for 5.14.0 which was patched.  Avoid this by checking the kernel
version and selecting the SVE ifunc on modern kernels.

Parse the kernel version reported by uname() into a 24-bit kernel.major.minor
value without calling any library functions.  If uname() is not supported or
if the version format is not recognized, assume the kernel is modern.

Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-03-21 16:50:51 +00:00
Adhemerval Zanella
3d53d18fc7 elf: Enable TLS descriptor tests on aarch64
The aarch64 uses 'trad' for traditional tls and 'desc' for tls
descriptors, but unlike other targets it defaults to 'desc'.  The
gnutls2 configure check does not set aarch64 as an ABI that uses
TLS descriptors, which then disable somes stests.

Also rename the internal machinery fron gnu2 to tls descriptors.

Checked on aarch64-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-03-19 14:53:30 -03:00
Szabolcs Nagy
73c26018ed aarch64: fix check for SVE support in assembler
Due to GCC bug 110901 -mcpu can override -march setting when compiling
asm code and thus a compiler targetting a specific cpu can fail the
configure check even when binutils gas supports SVE.

The workaround is that explicit .arch directive overrides both -mcpu
and -march, and since that's what the actual SVE memcpy uses the
configure check should use that too even if the GCC issue is fixed
independently.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-03-14 14:27:56 +00:00
Joe Ramsay
e302e10213 aarch64/fpu: Sync libmvec routines from 2.39 and before with AOR
This includes a fix for big-endian in AdvSIMD log, some cosmetic
changes, and numerous small optimisations mainly around inlining and
using indexed variants of MLA intrinsics.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-02-26 09:45:50 -03:00
Adhemerval Zanella Netto
ae4b8d6a0e string: Use builtins for ffs and ffsll
It allows to remove a lot of arch-specific implementations.

Checked on x86_64, aarch64, powerpc64.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-02-01 09:31:33 -03:00
Joseph Myers
42cc619dfb Refer to C23 in place of C2X in glibc
WG14 decided to use the name C23 as the informal name of the next
revision of the C standard (notwithstanding the publication date in
2024).  Update references to C2X in glibc to use the C23 name.

This is intended to update everything *except* where it involves
renaming files (the changes involving renaming tests are intended to
be done separately).  In the case of the _ISOC2X_SOURCE feature test
macro - the only user-visible interface involved - support for that
macro is kept for backwards compatibility, while adding
_ISOC23_SOURCE.

Tested for x86_64.
2024-02-01 11:02:01 +00:00
Sergey Bugaev
520b1df08d aarch64: Make cpu-features definitions not Linux-specific
These describe generic AArch64 CPU features, and are not tied to a
kernel-specific way of determining them. We can share them between
the Linux and Hurd AArch64 ports.

Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-ID: <20240103171502.1358371-13-bugaevc@gmail.com>
2024-01-04 23:48:54 +01:00
Szabolcs Nagy
0c12c8c0cb aarch64: Add longjmp test for SME
Includes test for setcontext too.

The test directly checks after longjmp if ZA got disabled and the
ZA contents got saved following the lazy saving scheme. It does not
use ACLE code to verify that gcc can interoperate with glibc.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-01-02 16:54:21 +00:00
Szabolcs Nagy
a7373e457f aarch64: Add longjmp support for SME
For the ZA lazy saving scheme to work, longjmp has to call
__libc_arm_za_disable.

In ld.so we assume ZA is not used so longjmp does not need
special support there.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-01-02 15:43:30 +00:00
Szabolcs Nagy
d3c32ae207 aarch64: Add SME runtime support
The runtime support routines for the call ABI of the Scalable Matrix
Extension (SME) are mostly in libgcc. Since libc.so cannot depend on
libgcc_s.so have an implementation of __arm_za_disable in libc for
libc internal use in longjmp and similar APIs.

__libc_arm_za_disable follows the same PCS rules as __arm_za_disable,
but it's a hidden symbol so it does not need variant PCS marking.

Using __libc_fatal instead of abort because it can print a message and
works in ld.so too. But for now we don't need SME routines in ld.so.

To check the SME HWCAP in asm, we need the _dl_hwcap2 member offset in
_rtld_global_ro in the shared libc.so, while in libc.a the _dl_hwcap2
object is accessed.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-01-02 15:43:30 +00:00
Paul Eggert
dff8da6b3e Update copyright dates with scripts/update-copyrights 2024-01-01 10:53:40 -08:00
Joe Ramsay
667f277c78 aarch64: Add SIMD attributes to math functions with vector versions
Added annotations for autovec by GCC and GFortran - this enables GCC
>= 9 to autovectorise math calls at -Ofast.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20 08:41:25 +00:00
Joe Ramsay
cc0d77ba94 aarch64: Add half-width versions of AdvSIMD f32 libmvec routines
Compilers may emit calls to 'half-width' routines (two-lane
single-precision variants). These have been added in the form of
wrappers around the full-width versions, where the low half of the
vector is simply duplicated. This will perform poorly when one lane
triggers the special-case handler, as there will be a redundant call
to the scalar version, however this is expected to be rare at Ofast.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20 08:41:25 +00:00
Andreas Schwab
3f79842788 aarch64: correct CFI in rawmemchr (bug 31113)
The .cfi_return_column directive changes the return column for the whole
FDE range.  But the actual intent is to tell the unwinder that the value
in x30 (lr) now resides in x15 after the move, and that is expressed by
the .cfi_register directive.
2023-12-05 12:49:37 +01:00
Szabolcs Nagy
8e755f5bc8 aarch64: fix tested ifunc variants
Don't test a64fx string functions when BTI is enabled since they are
not BTI compatible.
2023-12-04 14:41:26 +00:00
Joe Ramsay
7b12776584 aarch64: Improve special-case handling in AdvSIMD double-precision libmvec routines
Avoids emitting many saves/restores of vector registers, reduces the
amount of code generated around the scalar fallback.
2023-11-29 15:03:36 +00:00
Joe Ramsay
bd70d3bacf aarch64: Fix libmvec benchmarks
These were broken by the new atan2 functions, as they were only
set up for univariate functions. Arity is now detected from the
input file - this revealed a mistake that the double-precision
inputs were being used for both single- and double-precision
routines, which is now remedied.
2023-11-22 09:10:43 +00:00
Adhemerval Zanella
55f41ef8de elf: Remove LD_PROFILE for static binaries
The _dl_non_dynamic_init does not parse LD_PROFILE, which does not
enable profile for dlopen objects.  Since dlopen is deprecated for
static objects, it is better to remove the support.

It also allows to trim down libc.a of profile support.

Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2023-11-21 16:15:42 -03:00
Joe Ramsay
a8830c9285 aarch64: Add vector implementations of expm1 routines
May discard sign of 0 - auto tests for -0 and -0x1p-10000 updated accordingly.
2023-11-20 17:53:14 +00:00
Wilco Dijkstra
2f5524cc53 AArch64: Remove Falkor memcpy
The latest implementations of memcpy are actually faster than the Falkor
implementations [1], so remove the falkor/phecda ifuncs for memcpy and
the now unused IS_FALKOR/IS_PHECDA defines.

[1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:52:50 +00:00
Wilco Dijkstra
3d7090f14b AArch64: Add memset_zva64
Add a specialized memset for the common ZVA size of 64 to avoid the
overhead of reading the ZVA size.  Since the code is identical to
__memset_falkor, remove the latter.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:50:44 +00:00
Wilco Dijkstra
9627ab99b5 AArch64: Cleanup emag memset
Cleanup emag memset - merge the memset_base64.S file, remove
the unused ZVA code (since it is disabled on emag).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:45:47 +00:00
Joe Ramsay
3548a4f087 aarch64: Add vector implementations of log1p routines
May discard sign of zero.
2023-11-10 17:07:43 +00:00
Joe Ramsay
b07038c5d3 aarch64: Add vector implementations of atan2 routines 2023-11-10 17:07:43 +00:00
Joe Ramsay
d30c39f80d aarch64: Add vector implementations of atan routines 2023-11-10 17:07:42 +00:00
Joe Ramsay
b5d23367a8 aarch64: Add vector implementations of acos routines 2023-11-10 17:07:42 +00:00
Joe Ramsay
9bed498418 aarch64: Add vector implementations of asin routines 2023-11-10 17:07:42 +00:00
Wilco Dijkstra
9fd3409842 AArch64: Cleanup ifuncs
Cleanup ifuncs.  Remove uses of libc_hidden_builtin_def, use ENTRY rather than
ENTRY_ALIGN, remove unnecessary defines and conditional compilation.  Rename
strlen_mte to strlen_generic.  Remove rtld-memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-11-01 13:41:59 +00:00
Wilco Dijkstra
2bd0017988 AArch64: Add support for MOPS memcpy/memmove/memset
Add support for MOPS in cpu_features and INIT_ARCH.  Add ifuncs using MOPS for
memcpy, memmove and memset (use .inst for now so it works with all binutils
versions without needing complex configure and conditional compilation).

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-10-24 13:37:48 +01:00
Joe Ramsay
31aaf6fed9 aarch64: Add vector implementations of exp10 routines
Double-precision routines either reuse the exp table (AdvSIMD) or use
SVE FEXPA intruction.
2023-10-23 15:00:45 +01:00
Joe Ramsay
067a34156c aarch64: Add vector implementations of log10 routines
A table is also added, which is shared between AdvSIMD and SVE log10.
2023-10-23 15:00:45 +01:00
Joe Ramsay
a8e3ab3074 aarch64: Add vector implementations of log2 routines
A table is also added, which is shared between AdvSIMD and SVE log2.
2023-10-23 15:00:45 +01:00
Joe Ramsay
b39e9db5e3 aarch64: Add vector implementations of exp2 routines
Some routines reuse table from v_exp_data.c
2023-10-23 15:00:45 +01:00
Joe Ramsay
f554334c05 aarch64: Add vector implementations of tan routines
This includes some utility headers for evaluating polynomials using
various schemes.
2023-10-23 15:00:44 +01:00
Joe Ramsay
5a4b6f8e4b aarch64: Optimise vecmath logs
* Transpose table layout for improved memory access
* Use half-vector special comparisons for AdvSIMD
* Improve register use near special-case branches
  - Due to the presence of a function call, return value would get
    mov-d out of x0 in order to facilitate PCS. By moving the final
    computation after the branch this can be avoided

Also change SVE routines to use overloaded intrinsics for readability.
2023-10-05 16:54:16 +01:00
Joe Ramsay
480a0dfe1a aarch64: Cosmetic change in SVE exp routines
Use overloaded intrinsics for readability. Codegen does not
change, however while we're bringing the routines up-to-date with
recent improvements to other routines in AOR it is worth copying
this change over as well.
2023-10-05 16:54:00 +01:00
Joe Ramsay
9180160e08 aarch64: Optimize SVE cos & cosf
Saves a mov by ensuring return value does not need to be moved out of
the way before special-case branch. Also change to use overloaded
intrinsics.
2023-10-05 16:53:38 +01:00
Joe Ramsay
8014d1e832 aarch64: Improve vecmath sin routines
* Update ULP comment reflecting a new observed max in [-pi/2, pi/2]
* Use the same polynomial in AdvSIMD and SVE, rather than FTRIG instructions
* Improve register use near special-case branch

Also use overloaded intrinsics for SVE.
2023-10-05 16:53:06 +01:00
Wilco Dijkstra
6b695e5c62 AArch64: Remove -0.0 check from vector sin
Remove the unnecessary extra checks for sin (-0.0) from vector sin/sinf,
improving performance.  Passes regress.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-09-26 13:40:07 +01:00
Siddhesh Poyarekar
c6cb8783b5 configure: Use autoconf 2.71
Bump autoconf requirement to 2.71 to allow regenerating configure on
more recent distributions.  autoconf 2.71 has been in Fedora since F36
and is the current version in Debian stable (bookworm).  It appears to
be current in Gentoo as well.

All sysdeps configure and preconfigure scripts have also been
regenerated; all changes are trivial transformations that do not affect
functionality.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-07-17 10:08:10 -04:00
Joe Ramsay
4a9392ffc2 aarch64: Add vector implementations of exp routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:26 +01:00
Joe Ramsay
78c01a5cbe aarch64: Add vector implementations of log routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines. Log lookup table
added as HIDDEN symbol to allow it to be shared between AdvSIMD and
SVE variants.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:22 +01:00
Joe Ramsay
3bb1af2051 aarch64: Add vector implementations of sin routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:16 +01:00
Joe Ramsay
aed39a3aa3 aarch64: Add vector implementations of cos routines
Replace the loop-over-scalar placeholder routines with optimised
implementations from Arm Optimized Routines (AOR).

Also add some headers containing utilities for aarch64 libmvec
routines, and update libm-test-ulps.

Data tables for new routines are used via a pointer with a
barrier on it, in order to prevent overly aggressive constant
inlining in GCC. This allows a single adrp, combined with offset
loads, to be used for every constant in the table.

Special-case handlers are marked NOINLINE in order to confine the
save/restore overhead of switching from vector to normal calling
standard. This way we only incur the extra memory access in the
exceptional cases. NOINLINE definitions have been moved to
math_private.h in order to reduce duplication.

AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
selective masking (and later fixing up) of invalid lanes, in
order to trigger fp exceptions correctly (AdvSIMD only). This is
tested and maintained in AOR, however it is configured off at
source level here for performance reasons. We keep the
WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
the upstreaming process from AOR to glibc.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:10 +01:00
Paul Pluzhnikov
2cbeda847b Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
Paul Pluzhnikov
65cc53fe7c Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
Szabolcs Nagy
642f1b9b3d aarch64: More configure checks for libmvec
Check assembler and linker support too, not just SVE ACLE in the
compiler, since variant PCS requires at least binutils 2.32.1.
2023-05-05 11:34:44 +01:00
Szabolcs Nagy
ee68e9cba4 aarch64: SVE ACLE configure test cleanups
Use more idiomatic configure test for better autoconf cache and logs.
2023-05-05 10:28:29 +01:00
Szabolcs Nagy
1a62d7e5c3 aarch64: fix SVE ACLE check for bootstrap glibc builds
arm_sve.h depends on stdint.h but that relies on libc headers unless
compiled in freestanding mode.  Without this change a bootstrap glibc
build (that uses a compiler without installed libc headers) failed with

checking for availability of SVE ACLE... In file included from [...]/arm_sve.h:28,
                 from conftest.c:1:
[...]/stdint.h:9:16: fatal error: stdint.h: No such file or directory
    9 | # include_next <stdint.h>
      |                ^~~~~~~~~~
compilation terminated.
configure: error: mathvec is enabled but compiler does not have SVE ACLE. [...]
2023-05-04 10:19:11 +01:00
Joe Ramsay
cd94326a13 Enable libmvec support for AArch64
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.

The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.

Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.

Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-05-03 12:09:49 +01:00
Szabolcs Nagy
2ce48fbd5a aarch64: update libm test ulps 2023-02-24 10:55:38 +00:00
Jun Tang
311a7e0256 AArch64: Fix HP_TIMING_DIFF computation [BZ# 29329]
Fix the computation to allow for cntfrq_el0 being larger than 1GHz.
Assume cntfrq_el0 is a multiple of 1MHz to increase the maximum
interval (1024 seconds at 1GHz).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2023-02-22 16:45:59 +00:00
Adhemerval Zanella
a9b3b770f5 string: Remove string_private.h
Now that _STRING_ARCH_unaligned is not used anymore.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2023-02-17 15:56:54 -03:00
Adhemerval Zanella
22999b2f0f string: Add libc_hidden_proto for memrchr
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.

Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
2023-02-08 17:13:58 -03:00
Adhemerval Zanella
7ea510127e string: Add libc_hidden_proto for strchrnul
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.

Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
2023-02-08 17:13:56 -03:00
Wilco Dijkstra
d2d3f3720c AArch64: Improve SVE memcpy and memmove
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-02-06 16:15:34 +00:00
Wilco Dijkstra
55599d4804 AArch64: Improve strrchr
Use shrn for narrowing the mask which simplifies code and speeds up small
strings.  Unroll the first search loop to improve performance on large
strings.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
ad098893ba AArch64: Optimize strnlen
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are around 10% faster, large strings are 40% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
03c8ce5000 AArch64: Optimize strlen
Optimize strlen by unrolling the main loop.  Large strings are 64% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
349e48c01e AArch64: Optimize strcpy
Unroll the main loop.  Large strings are around 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
09ebd8549b AArch64: Improve strchrnul
Unroll the main loop, which improves performance slightly.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
51541a2297 AArch64: Optimize strchr
Simplify calculation of the mask using shrn.  Unroll the main loop.
Small strings are 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
1bbb1a2022 AArch64: Improve strlen_asimd
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
0077624177 AArch64: Optimize memrchr
Optimize the main loop - large strings are 43% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
ce758d4f06 AArch64: Optimize memchr
Optimize the main loop - large strings are 40% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Joseph Myers
6d7e8eda9b Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
Vladislav Khmelevsky
eb4181e9f4 elf: Fix rtld-audit trampoline for aarch64
This patch fixes two problems with audit:

  1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS,
     resulting in x2 register value nulling in RG structure.

  2. We need to preserve the x8 register before function call, but
     don't have to save it's new value and restore it before return.

Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value
which is wrong (althoug doesn't affect anything).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-11-21 10:20:40 -03:00
Florian Weimer
1f34a23288 elf: Introduce <dl-call_tls_init_tp.h> and call_tls_init_tp (bug 29249)
This makes it more likely that the compiler can compute the strlen
argument in _startup_fatal at compile time, which is required to
avoid a dependency on strlen this early during process startup.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2022-11-03 17:28:03 +01:00
Adhemerval Zanella
8d98c7c00f configure: Use -Wno-ignored-attributes if compiler warns about multiple aliases
clang emits an warning when a double alias redirection is used, to warn
the the original symbol will be used even when weak definition is
overridden.  However, this is a common pattern for weak_alias, where
multiple alias are set to same symbol.

Reviewed-by: Fangrui Song <maskray@google.com>
2022-11-01 09:51:06 -03:00
Szabolcs Nagy
dd31f42e39 aarch64: Don't build wordcopy
Use an empty wordcopy.c to avoid building the generic one.
It does not seem to be used anywhere.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-28 11:14:54 +01:00
Wilco Dijkstra
e6f3fe362f aarch64: Use memcpy_simd as the default memcpy
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it.  If SVE is available, a SVE memcpy will be
used by default (including for Neoverse N2).
2022-10-26 14:16:50 +01:00
Wilco Dijkstra
a8e72913fe aarch64: Cleanup memset ifunc
Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of
256, so add an explicit check.
2022-10-26 14:12:55 +01:00
Florian Weimer
58548b9d68 Use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources
In the future, this will result in a compilation failure if the
macros are unexpectedly undefined (due to header inclusion ordering
or header inclusion missing altogether).

Assembler sources are more difficult to convert.  In many cases,
they are hand-optimized for the mangling and no-mangling variants,
which is why they are not converted.

sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c
are special: These are C sources, but most of the implementation is
in assembler, so the PTR_DEMANGLE macro has to be undefined in some
cases, to match the assembler style.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-18 17:04:10 +02:00
Florian Weimer
88f4b6929c Introduce <pointer_guard.h>, extracted from <sysdep.h>
This allows us to define a generic no-op version of PTR_MANGLE and
PTR_DEMANGLE.  In the future, we can use PTR_MANGLE and PTR_DEMANGLE
unconditionally in C sources, avoiding an unintended loss of hardening
due to missing include files or unlucky header inclusion ordering.

In i386 and x86_64, we can avoid a <tls.h> dependency in the C
code by using the computed constant from <tcb-offsets.h>.  <sysdep.h>
no longer includes these definitions, so there is no cyclic dependency
anymore when computing the <tcb-offsets.h> constants.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-18 17:03:55 +02:00
Adhemerval Zanella
5355f9ca7b elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).

This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-10-10 10:32:28 -03:00
Wilco Dijkstra
22f4ab2d20 Use atomic_exchange_release/acquire
Rename atomic_exchange_rel/acq to use atomic_exchange_release/acquire
since these map to the standard C11 atomic builtins.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-09-26 16:58:08 +01:00
Szabolcs Nagy
06d4381dd8 csu: Change start code license to have link exception
The start code can get linked into dynamic linked executables where
LGPL would require shipping the source or linkable binaries when the
executable is distributed.

On some targets the license exception was missing in start.S (which
is compiled into crt1.o and Scrt1.o which may end up linked into PDE
and PIE binaries).

I did not review what other code may end up in executables, just
fixed the start.S license inconsistency across targets.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-08-26 09:14:53 +01:00
Wilco Dijkstra
12182ba18d AArch64: Fix typo in sve configure check (BZ# 29394)
Fix a typo in the SVE configure check. This fixes [BZ# 29394].
2022-08-11 17:52:00 +01:00
Jason A. Donenfeld
eaad4f9e8f arc4random: simplify design for better safety
Rather than buffering 16 MiB of entropy in userspace (by way of
chacha20), simply call getrandom() every time.

This approach is doubtlessly slower, for now, but trying to prematurely
optimize arc4random appears to be leading toward all sorts of nasty
properties and gotchas. Instead, this patch takes a much more
conservative approach. The interface is added as a basic loop wrapper
around getrandom(), and then later, the kernel and libc together can
work together on optimizing that.

This prevents numerous issues in which userspace is unaware of when it
really must throw away its buffer, since we avoid buffering all
together. Future improvements may include userspace learning more from
the kernel about when to do that, which might make these sorts of
chacha20-based optimizations more possible. The current heuristic of 16
MiB is meaningless garbage that doesn't correspond to anything the
kernel might know about. So for now, let's just do something
conservative that we know is correct and won't lead to cryptographic
issues for users of this function.

This patch might be considered along the lines of, "optimization is the
root of all evil," in that the much more complex implementation it
replaces moves too fast without considering security implications,
whereas the incremental approach done here is a much safer way of going
about things. Once this lands, we can take our time in optimizing this
properly using new interplay between the kernel and userspace.

getrandom(0) is used, since that's the one that ensures the bytes
returned are cryptographically secure. But on systems without it, we
fallback to using /dev/urandom. This is unfortunate because it means
opening a file descriptor, but there's not much of a choice. Secondly,
as part of the fallback, in order to get more or less the same
properties of getrandom(0), we poll on /dev/random, and if the poll
succeeds at least once, then we assume the RNG is initialized. This is a
rough approximation, as the ancient "non-blocking pool" initialized
after the "blocking pool", not before, and it may not port back to all
ancient kernels, though it does to all kernels supported by glibc
(≥3.2), so generally it's the best approximation we can do.

The motivation for including arc4random, in the first place, is to have
source-level compatibility with existing code. That means this patch
doesn't attempt to litigate the interface itself. It does, however,
choose a conservative approach for implementing it.

Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Cristian Rodríguez <crrodriguez@opensuse.org>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Cc: Mark Harris <mark.hsj@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-07-27 08:58:27 -03:00
Adhemerval Zanella Netto
4c128c7823 aarch64: Add optimized chacha20
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-aarch64.S.  It is used as default and only
little-endian is supported (BE uses generic code).

As for generic implementation, the last step that XOR with the
input is omited.  The final state register clearing is also
omitted.

On a virtualized Linux on Apple M1 it shows the following
improvements (using formatted bench-arc4random data):

GENERIC                                    MB/s
-----------------------------------------------
arc4random [single-thread]               380.89
arc4random_buf(16) [single-thread]       500.73
arc4random_buf(32) [single-thread]       552.61
arc4random_buf(48) [single-thread]       566.82
arc4random_buf(64) [single-thread]       574.01
arc4random_buf(80) [single-thread]       581.02
arc4random_buf(96) [single-thread]       591.19
arc4random_buf(112) [single-thread]      592.29
arc4random_buf(128) [single-thread]      596.43
-----------------------------------------------

OPTIMIZED                                  MB/s
-----------------------------------------------
arc4random [single-thread]               569.60
arc4random_buf(16) [single-thread]       825.78
arc4random_buf(32) [single-thread]       987.03
arc4random_buf(48) [single-thread]      1042.39
arc4random_buf(64) [single-thread]      1075.50
arc4random_buf(80) [single-thread]      1094.68
arc4random_buf(96) [single-thread]      1130.16
arc4random_buf(112) [single-thread]     1129.58
arc4random_buf(128) [single-thread]     1137.91
-----------------------------------------------

Checked on aarch64-linux-gnu.
2022-07-22 11:58:27 -03:00
Danila Kutenin
3c99806989 aarch64: Optimize string functions with shrn instruction
We found that string functions were using AND+ADDP
to find the nibble/syndrome mask but there is an easier
opportunity through `SHRN dst.8b, src.8h, 4` (shift
right every 2 bytes by 4 and narrow to 1 byte) and has
same latency on all SIMD ARMv8 targets as ADDP. There
are also possible gaps for memcmp but that's for
another patch.

We see 10-20% savings for small-mid size cases (<=128)
which are primary cases for general workloads.
2022-07-06 09:26:20 +01:00
Tejas Belagod
05844d18f7 AArch64: Reset HWCAP2_AFP bits in FPCR for default fenv
The AFP feature (Alternate floating-point behavior) was added in armv8.7 and
introduced new FPCR bits.

Currently, HWCAP2_AFP bits (bit 0, 1, 2) in FPCR are preserved when fenv is
set to default environment.  This is a deviation from standard behaviour.
Clear these bits when setting the fenv to default.

There is no libc API to modify the new FPCR bits.  Restoring those bits matters
if the user changed them directly.
2022-07-05 14:01:17 +01:00
Fangrui Song
e89913d0aa aarch64: Handle fewer relocations for RTLD_BOOTSTRAP
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself.  It only
needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT.
TLSDESC/TLS_DTPMOD/TLS_DTPREL handling can be removed.  Remove
`case AARCH64_R(RELATIVE)` as well as elf_machine_rela has checked it.

Tested on aarch64-linux-gnu.
2022-06-15 19:21:53 -07:00
Wilco Dijkstra
fdaf78656f Add bounds check to __libc_ifunc_impl_list
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.

Passes buildmanyglibc.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-06-10 17:13:29 +01:00
Wilco Dijkstra
eea282d9c6 AArch64: Sort makefile entries
Sort makefile entries to reduce conflicts.
2022-06-07 16:58:15 +01:00
Wilco Dijkstra
9f298bfe1f AArch64: Add SVE memcpy
Add an initial SVE memcpy implementation.  Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
2022-06-07 16:58:03 +01:00
Fangrui Song
e555954e02 Revert "[AArch64][BZ #17711] Fix extern protected data handling"
This reverts commit 0910702c4d.

Say both a.so and b.so define protected data symbol `var` and the executable
copy relocates var.  ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange
semantics: a.so accesses the copy in the executable while b.so accesses its
own.  This behavior requires that (a) the compiler emits GOT-generating
relocations (b) the linker produces GLOB_DAT instead of RELATIVE.

Without the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA code, b.so's GLOB_DAT
will bind to the executable (normal behavior).

For aarch64 it makes sense to restore the original behavior and don't
pay the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA cost.  The behavior is very
unlikely used by anyone.

* Clang code generator treats STV_PROTECTED the same way as STV_HIDDEN:
  no GOT-generating relocation in the first place.
* gold and lld reject copy relocation on a STV_PROTECTED symbol.
* Nowadays -fpie/-fpic modes are popular.  GCC/Clang's codegen uses
  GOT-generating relocation when accessing an default visibility
  external symbol which avoids copy relocation.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2022-05-23 13:37:05 -07:00
Szabolcs Nagy
1da064c015 aarch64: Move ld.so _start to separate file and drop _dl_skip_args
A separate asm file is easier to maintain than a macro that expands to
inline asm.

The RTLD_START macro is only needed now because _dl_start is local in
rtld.c, but _start has to call it, if _dl_start was made hidden then it
could be empty.

_dl_skip_args is no longer needed.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2022-05-17 10:14:03 +01:00
Szabolcs Nagy
86147bbeec rtld: Remove DL_ARGV_NOT_RELRO and make _dl_skip_args const
_dl_skip_args is always 0, so the target specific code that modifies
argv after relro protection is applied is no longer used.

After the patch relro protection is applied to _dl_argv consistently
on all targets.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2022-05-17 10:14:03 +01:00
Fangrui Song
098a657fe4 elf: Replace PI_STATIC_AND_HIDDEN with opposite HIDDEN_VAR_NEEDS_DYNAMIC_RELOC
PI_STATIC_AND_HIDDEN indicates whether accesses to internal linkage
variables and hidden visibility variables in a shared object (ld.so)
need dynamic relocations (usually R_*_RELATIVE). PI (position
independent) in the macro name is a misnomer: a code sequence using GOT
is typically position-independent as well, but using dynamic relocations
does not meet the requirement.

Not defining PI_STATIC_AND_HIDDEN is legacy and we expect that all new
ports will define PI_STATIC_AND_HIDDEN. Current ports defining
PI_STATIC_AND_HIDDEN are more than the opposite. Change the configure
default.

No functional change.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-04-26 09:26:22 -07:00
Ben Woodard
ce9a68c57c elf: Fix runtime linker auditing on aarch64 (BZ #26643)
The rtld audit support show two problems on aarch64:

  1. _dl_runtime_resolve does not preserve x8, the indirect result
      location register, which might generate wrong result calls
      depending of the function signature.

  2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve
     were twice the size of D registers extracted from the stack frame by
     _dl_runtime_profile.

While 2. might result in wrong information passed on the PLT tracing,
1. generates wrong runtime behaviour.

The aarch64 rtld audit support is changed to:

  * Both La_aarch64_regs and La_aarch64_retval are expanded to include
    both x8 and the full sized NEON V registers, as defined by the
    ABI.

  * dl_runtime_profile needed to extract registers saved by
    _dl_runtime_resolve and put them into the new correctly sized
    La_aarch64_regs structure.

  * The LAV_CURRENT check is change to only accept new audit modules
    to avoid the undefined behavior of not save/restore x8.

  * Different than other architectures, audit modules older than
    LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval
    changed their layout and there are no requirements to support multiple
    audit interface with the inherent aarch64 issues).

  * A new field is also reserved on both La_aarch64_regs and
    La_aarch64_retval to support variant pcs symbols.

Similar to x86, a new La_aarch64_vector type to represent the NEON
register is added on the La_aarch64_regs (so each type can be accessed
directly).

Since LAV_CURRENT was already bumped to support bind-now, there is
no need to increase it again.

Checked on aarch64-linux-gnu.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
2022-02-01 14:49:46 -03:00
Wilco Dijkstra
e5fa62b8db AArch64: Check for SVE in ifuncs [BZ #28744]
Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove.
This fixes BZ #28744.
2022-01-06 14:36:28 +00:00
Paul Eggert
581c785bf3 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.

I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah.  I don't
know why I run into these diagnostics whereas others evidently do not.

remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2022-01-01 11:40:24 -08:00
Adhemerval Zanella
83b8d5027d malloc: Remove memusage.h
And use machine-sp.h instead.  The Linux implementation is based on
already provided CURRENT_STACK_FRAME (used on nptl code) and
STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.
2021-12-28 14:57:57 -03:00
Adhemerval Zanella
92ff345137 Remove atomic-machine.h atomic typedefs
Now that memusage.c uses generic types we can remove them.
2021-12-28 14:57:57 -03:00
Adhemerval Zanella
8c0664e2b8 elf: Add _dl_audit_pltexit
It consolidates the code required to call la_pltexit audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2021-12-28 08:40:38 -03:00
Adhemerval Zanella
691d9ae9e6 Remove ununsed tcb-offset
Some architectures do not use the auto-generated tcb-offsets.h.
2021-12-17 17:47:29 -03:00
Adhemerval Zanella
ecb94e9587 aarch64: Add math-use-builtins-f{max,min}.h
It allows to remove the arch-specific implementations.
2021-12-13 10:08:46 -03:00
Florian Weimer
627f5ede70 Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGN
TLS_INIT_TCB_ALIGN is not actually used.  TLS_TCB_ALIGN was likely
introduced to support a configuration where the thread pointer
has not the same alignment as THREAD_SELF.  Only ia64 seems to use
that, but for the stack/pointer guard, not for storing tcbhead_t.
Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift
the thread pointer, potentially landing in a different residue class
modulo the alignment, but the changes should not impact that.

In general, given that TLS variables have their own alignment
requirements, having different alignment for the (unshifted) thread
pointer and struct pthread would potentially result in dynamic
offsets, leading to more complexity.

hppa had different values before: __alignof__ (tcbhead_t), which
seems to be 4, and __alignof__ (struct pthread), which was 8
(old default) and is now 32.  However, it defines THREAD_SELF as:

/* Return the thread descriptor for the current thread.  */
# define THREAD_SELF \
  ({ struct pthread *__self;			\
	__self = __get_cr27();			\
	__self - 1;				\
   })

So the thread pointer points after struct pthread (hence __self - 1),
and they have to have the same alignment on hppa as well.

Similarly, on ia64, the definitions were different.  We have:

# define TLS_PRE_TCB_SIZE \
  (sizeof (struct pthread)						\
   + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t)		\
      ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1)	\
	 & ~(__alignof__ (struct pthread) - 1))				\
      : 0))
# define THREAD_SELF \
  ((struct pthread *) ((char *) __thread_self - TLS_PRE_TCB_SIZE))

And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment
(confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c).

On m68k, we have a larger gap between tcbhead_t and struct pthread.
But as far as I can tell, the port is fine with that.  The definition
of TCB_OFFSET is sufficient to handle the shifted TCB scenario.

This fixes commit 23c77f6018
("nptl: Increase default TCB alignment to 32").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-12-09 23:47:49 +01:00