Commit Graph

410 Commits

Author SHA1 Message Date
Joe Ramsay
a8e3ab3074 aarch64: Add vector implementations of log2 routines
A table is also added, which is shared between AdvSIMD and SVE log2.
2023-10-23 15:00:45 +01:00
Joe Ramsay
b39e9db5e3 aarch64: Add vector implementations of exp2 routines
Some routines reuse table from v_exp_data.c
2023-10-23 15:00:45 +01:00
Joe Ramsay
f554334c05 aarch64: Add vector implementations of tan routines
This includes some utility headers for evaluating polynomials using
various schemes.
2023-10-23 15:00:44 +01:00
Joe Ramsay
5a4b6f8e4b aarch64: Optimise vecmath logs
* Transpose table layout for improved memory access
* Use half-vector special comparisons for AdvSIMD
* Improve register use near special-case branches
  - Due to the presence of a function call, return value would get
    mov-d out of x0 in order to facilitate PCS. By moving the final
    computation after the branch this can be avoided

Also change SVE routines to use overloaded intrinsics for readability.
2023-10-05 16:54:16 +01:00
Joe Ramsay
480a0dfe1a aarch64: Cosmetic change in SVE exp routines
Use overloaded intrinsics for readability. Codegen does not
change, however while we're bringing the routines up-to-date with
recent improvements to other routines in AOR it is worth copying
this change over as well.
2023-10-05 16:54:00 +01:00
Joe Ramsay
9180160e08 aarch64: Optimize SVE cos & cosf
Saves a mov by ensuring return value does not need to be moved out of
the way before special-case branch. Also change to use overloaded
intrinsics.
2023-10-05 16:53:38 +01:00
Joe Ramsay
8014d1e832 aarch64: Improve vecmath sin routines
* Update ULP comment reflecting a new observed max in [-pi/2, pi/2]
* Use the same polynomial in AdvSIMD and SVE, rather than FTRIG instructions
* Improve register use near special-case branch

Also use overloaded intrinsics for SVE.
2023-10-05 16:53:06 +01:00
Wilco Dijkstra
6b695e5c62 AArch64: Remove -0.0 check from vector sin
Remove the unnecessary extra checks for sin (-0.0) from vector sin/sinf,
improving performance.  Passes regress.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-09-26 13:40:07 +01:00
Siddhesh Poyarekar
c6cb8783b5 configure: Use autoconf 2.71
Bump autoconf requirement to 2.71 to allow regenerating configure on
more recent distributions.  autoconf 2.71 has been in Fedora since F36
and is the current version in Debian stable (bookworm).  It appears to
be current in Gentoo as well.

All sysdeps configure and preconfigure scripts have also been
regenerated; all changes are trivial transformations that do not affect
functionality.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-07-17 10:08:10 -04:00
Joe Ramsay
4a9392ffc2 aarch64: Add vector implementations of exp routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:26 +01:00
Joe Ramsay
78c01a5cbe aarch64: Add vector implementations of log routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines. Log lookup table
added as HIDDEN symbol to allow it to be shared between AdvSIMD and
SVE variants.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:22 +01:00
Joe Ramsay
3bb1af2051 aarch64: Add vector implementations of sin routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:16 +01:00
Joe Ramsay
aed39a3aa3 aarch64: Add vector implementations of cos routines
Replace the loop-over-scalar placeholder routines with optimised
implementations from Arm Optimized Routines (AOR).

Also add some headers containing utilities for aarch64 libmvec
routines, and update libm-test-ulps.

Data tables for new routines are used via a pointer with a
barrier on it, in order to prevent overly aggressive constant
inlining in GCC. This allows a single adrp, combined with offset
loads, to be used for every constant in the table.

Special-case handlers are marked NOINLINE in order to confine the
save/restore overhead of switching from vector to normal calling
standard. This way we only incur the extra memory access in the
exceptional cases. NOINLINE definitions have been moved to
math_private.h in order to reduce duplication.

AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
selective masking (and later fixing up) of invalid lanes, in
order to trigger fp exceptions correctly (AdvSIMD only). This is
tested and maintained in AOR, however it is configured off at
source level here for performance reasons. We keep the
WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
the upstreaming process from AOR to glibc.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:10 +01:00
Paul Pluzhnikov
2cbeda847b Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
Paul Pluzhnikov
65cc53fe7c Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
Szabolcs Nagy
642f1b9b3d aarch64: More configure checks for libmvec
Check assembler and linker support too, not just SVE ACLE in the
compiler, since variant PCS requires at least binutils 2.32.1.
2023-05-05 11:34:44 +01:00
Szabolcs Nagy
ee68e9cba4 aarch64: SVE ACLE configure test cleanups
Use more idiomatic configure test for better autoconf cache and logs.
2023-05-05 10:28:29 +01:00
Szabolcs Nagy
1a62d7e5c3 aarch64: fix SVE ACLE check for bootstrap glibc builds
arm_sve.h depends on stdint.h but that relies on libc headers unless
compiled in freestanding mode.  Without this change a bootstrap glibc
build (that uses a compiler without installed libc headers) failed with

checking for availability of SVE ACLE... In file included from [...]/arm_sve.h:28,
                 from conftest.c:1:
[...]/stdint.h:9:16: fatal error: stdint.h: No such file or directory
    9 | # include_next <stdint.h>
      |                ^~~~~~~~~~
compilation terminated.
configure: error: mathvec is enabled but compiler does not have SVE ACLE. [...]
2023-05-04 10:19:11 +01:00
Joe Ramsay
cd94326a13 Enable libmvec support for AArch64
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.

The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.

Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.

Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-05-03 12:09:49 +01:00
Szabolcs Nagy
2ce48fbd5a aarch64: update libm test ulps 2023-02-24 10:55:38 +00:00
Jun Tang
311a7e0256 AArch64: Fix HP_TIMING_DIFF computation [BZ# 29329]
Fix the computation to allow for cntfrq_el0 being larger than 1GHz.
Assume cntfrq_el0 is a multiple of 1MHz to increase the maximum
interval (1024 seconds at 1GHz).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2023-02-22 16:45:59 +00:00
Adhemerval Zanella
a9b3b770f5 string: Remove string_private.h
Now that _STRING_ARCH_unaligned is not used anymore.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2023-02-17 15:56:54 -03:00
Adhemerval Zanella
22999b2f0f string: Add libc_hidden_proto for memrchr
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.

Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
2023-02-08 17:13:58 -03:00
Adhemerval Zanella
7ea510127e string: Add libc_hidden_proto for strchrnul
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.

Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
2023-02-08 17:13:56 -03:00
Wilco Dijkstra
d2d3f3720c AArch64: Improve SVE memcpy and memmove
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-02-06 16:15:34 +00:00
Wilco Dijkstra
55599d4804 AArch64: Improve strrchr
Use shrn for narrowing the mask which simplifies code and speeds up small
strings.  Unroll the first search loop to improve performance on large
strings.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
ad098893ba AArch64: Optimize strnlen
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are around 10% faster, large strings are 40% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
03c8ce5000 AArch64: Optimize strlen
Optimize strlen by unrolling the main loop.  Large strings are 64% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
349e48c01e AArch64: Optimize strcpy
Unroll the main loop.  Large strings are around 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
09ebd8549b AArch64: Improve strchrnul
Unroll the main loop, which improves performance slightly.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
51541a2297 AArch64: Optimize strchr
Simplify calculation of the mask using shrn.  Unroll the main loop.
Small strings are 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
1bbb1a2022 AArch64: Improve strlen_asimd
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
0077624177 AArch64: Optimize memrchr
Optimize the main loop - large strings are 43% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Wilco Dijkstra
ce758d4f06 AArch64: Optimize memchr
Optimize the main loop - large strings are 40% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Joseph Myers
6d7e8eda9b Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
Vladislav Khmelevsky
eb4181e9f4 elf: Fix rtld-audit trampoline for aarch64
This patch fixes two problems with audit:

  1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS,
     resulting in x2 register value nulling in RG structure.

  2. We need to preserve the x8 register before function call, but
     don't have to save it's new value and restore it before return.

Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value
which is wrong (althoug doesn't affect anything).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-11-21 10:20:40 -03:00
Florian Weimer
1f34a23288 elf: Introduce <dl-call_tls_init_tp.h> and call_tls_init_tp (bug 29249)
This makes it more likely that the compiler can compute the strlen
argument in _startup_fatal at compile time, which is required to
avoid a dependency on strlen this early during process startup.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2022-11-03 17:28:03 +01:00
Adhemerval Zanella
8d98c7c00f configure: Use -Wno-ignored-attributes if compiler warns about multiple aliases
clang emits an warning when a double alias redirection is used, to warn
the the original symbol will be used even when weak definition is
overridden.  However, this is a common pattern for weak_alias, where
multiple alias are set to same symbol.

Reviewed-by: Fangrui Song <maskray@google.com>
2022-11-01 09:51:06 -03:00
Szabolcs Nagy
dd31f42e39 aarch64: Don't build wordcopy
Use an empty wordcopy.c to avoid building the generic one.
It does not seem to be used anywhere.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-28 11:14:54 +01:00
Wilco Dijkstra
e6f3fe362f aarch64: Use memcpy_simd as the default memcpy
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it.  If SVE is available, a SVE memcpy will be
used by default (including for Neoverse N2).
2022-10-26 14:16:50 +01:00
Wilco Dijkstra
a8e72913fe aarch64: Cleanup memset ifunc
Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of
256, so add an explicit check.
2022-10-26 14:12:55 +01:00
Florian Weimer
58548b9d68 Use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources
In the future, this will result in a compilation failure if the
macros are unexpectedly undefined (due to header inclusion ordering
or header inclusion missing altogether).

Assembler sources are more difficult to convert.  In many cases,
they are hand-optimized for the mangling and no-mangling variants,
which is why they are not converted.

sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c
are special: These are C sources, but most of the implementation is
in assembler, so the PTR_DEMANGLE macro has to be undefined in some
cases, to match the assembler style.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-18 17:04:10 +02:00
Florian Weimer
88f4b6929c Introduce <pointer_guard.h>, extracted from <sysdep.h>
This allows us to define a generic no-op version of PTR_MANGLE and
PTR_DEMANGLE.  In the future, we can use PTR_MANGLE and PTR_DEMANGLE
unconditionally in C sources, avoiding an unintended loss of hardening
due to missing include files or unlucky header inclusion ordering.

In i386 and x86_64, we can avoid a <tls.h> dependency in the C
code by using the computed constant from <tcb-offsets.h>.  <sysdep.h>
no longer includes these definitions, so there is no cyclic dependency
anymore when computing the <tcb-offsets.h> constants.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-10-18 17:03:55 +02:00
Adhemerval Zanella
5355f9ca7b elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).

This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-10-10 10:32:28 -03:00
Wilco Dijkstra
22f4ab2d20 Use atomic_exchange_release/acquire
Rename atomic_exchange_rel/acq to use atomic_exchange_release/acquire
since these map to the standard C11 atomic builtins.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-09-26 16:58:08 +01:00
Szabolcs Nagy
06d4381dd8 csu: Change start code license to have link exception
The start code can get linked into dynamic linked executables where
LGPL would require shipping the source or linkable binaries when the
executable is distributed.

On some targets the license exception was missing in start.S (which
is compiled into crt1.o and Scrt1.o which may end up linked into PDE
and PIE binaries).

I did not review what other code may end up in executables, just
fixed the start.S license inconsistency across targets.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-08-26 09:14:53 +01:00
Wilco Dijkstra
12182ba18d AArch64: Fix typo in sve configure check (BZ# 29394)
Fix a typo in the SVE configure check. This fixes [BZ# 29394].
2022-08-11 17:52:00 +01:00
Jason A. Donenfeld
eaad4f9e8f arc4random: simplify design for better safety
Rather than buffering 16 MiB of entropy in userspace (by way of
chacha20), simply call getrandom() every time.

This approach is doubtlessly slower, for now, but trying to prematurely
optimize arc4random appears to be leading toward all sorts of nasty
properties and gotchas. Instead, this patch takes a much more
conservative approach. The interface is added as a basic loop wrapper
around getrandom(), and then later, the kernel and libc together can
work together on optimizing that.

This prevents numerous issues in which userspace is unaware of when it
really must throw away its buffer, since we avoid buffering all
together. Future improvements may include userspace learning more from
the kernel about when to do that, which might make these sorts of
chacha20-based optimizations more possible. The current heuristic of 16
MiB is meaningless garbage that doesn't correspond to anything the
kernel might know about. So for now, let's just do something
conservative that we know is correct and won't lead to cryptographic
issues for users of this function.

This patch might be considered along the lines of, "optimization is the
root of all evil," in that the much more complex implementation it
replaces moves too fast without considering security implications,
whereas the incremental approach done here is a much safer way of going
about things. Once this lands, we can take our time in optimizing this
properly using new interplay between the kernel and userspace.

getrandom(0) is used, since that's the one that ensures the bytes
returned are cryptographically secure. But on systems without it, we
fallback to using /dev/urandom. This is unfortunate because it means
opening a file descriptor, but there's not much of a choice. Secondly,
as part of the fallback, in order to get more or less the same
properties of getrandom(0), we poll on /dev/random, and if the poll
succeeds at least once, then we assume the RNG is initialized. This is a
rough approximation, as the ancient "non-blocking pool" initialized
after the "blocking pool", not before, and it may not port back to all
ancient kernels, though it does to all kernels supported by glibc
(≥3.2), so generally it's the best approximation we can do.

The motivation for including arc4random, in the first place, is to have
source-level compatibility with existing code. That means this patch
doesn't attempt to litigate the interface itself. It does, however,
choose a conservative approach for implementing it.

Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Cristian Rodríguez <crrodriguez@opensuse.org>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Cc: Mark Harris <mark.hsj@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-07-27 08:58:27 -03:00
Adhemerval Zanella Netto
4c128c7823 aarch64: Add optimized chacha20
It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-aarch64.S.  It is used as default and only
little-endian is supported (BE uses generic code).

As for generic implementation, the last step that XOR with the
input is omited.  The final state register clearing is also
omitted.

On a virtualized Linux on Apple M1 it shows the following
improvements (using formatted bench-arc4random data):

GENERIC                                    MB/s
-----------------------------------------------
arc4random [single-thread]               380.89
arc4random_buf(16) [single-thread]       500.73
arc4random_buf(32) [single-thread]       552.61
arc4random_buf(48) [single-thread]       566.82
arc4random_buf(64) [single-thread]       574.01
arc4random_buf(80) [single-thread]       581.02
arc4random_buf(96) [single-thread]       591.19
arc4random_buf(112) [single-thread]      592.29
arc4random_buf(128) [single-thread]      596.43
-----------------------------------------------

OPTIMIZED                                  MB/s
-----------------------------------------------
arc4random [single-thread]               569.60
arc4random_buf(16) [single-thread]       825.78
arc4random_buf(32) [single-thread]       987.03
arc4random_buf(48) [single-thread]      1042.39
arc4random_buf(64) [single-thread]      1075.50
arc4random_buf(80) [single-thread]      1094.68
arc4random_buf(96) [single-thread]      1130.16
arc4random_buf(112) [single-thread]     1129.58
arc4random_buf(128) [single-thread]     1137.91
-----------------------------------------------

Checked on aarch64-linux-gnu.
2022-07-22 11:58:27 -03:00
Danila Kutenin
3c99806989 aarch64: Optimize string functions with shrn instruction
We found that string functions were using AND+ADDP
to find the nibble/syndrome mask but there is an easier
opportunity through `SHRN dst.8b, src.8h, 4` (shift
right every 2 bytes by 4 and narrow to 1 byte) and has
same latency on all SIMD ARMv8 targets as ADDP. There
are also possible gaps for memcmp but that's for
another patch.

We see 10-20% savings for small-mid size cases (<=128)
which are primary cases for general workloads.
2022-07-06 09:26:20 +01:00