Commit Graph

125 Commits

Author SHA1 Message Date
Joe Ramsay
16a59571e4 AArch64: Simplify rounding-multiply pattern in several AdvSIMD routines
This operation can be simplified to use simpler multiply-round-convert
sequence, which uses fewer instructions and constants.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:08 +01:00
Joe Ramsay
7900ac490d AArch64: Improve codegen in users of ADVSIMD expm1f helper
Rearrange operations so MOV is not necessary in reduction or around
the special-case handler.  Reduce memory access by using more indexed
MLAs in polynomial.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
5bc100bd4b AArch64: Improve codegen in users of AdvSIMD log1pf helper
log1pf is quite register-intensive - use fewer registers for the
polynomial, and make various changes to shorten dependency chains in
parent routines.  There is now no spilling with GCC 14.  Accuracy moves
around a little - comments adjusted accordingly but does not require
regen-ulps.

Use the helper in log1pf as well, instead of having separate
implementations.  The more accurate polynomial means special-casing can
be simplified, and the shorter dependency chain avoids the usual dance
around v0, which is otherwise difficult.

There is a small duplication of vectors containing 1.0f (or 0x3f800000) -
GCC is not currently able to efficiently handle values which fit in FMOV
but not MOVI, and are reinterpreted to integer.  There may be potential
for more optimisation if this is fixed.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
a15b1394b5 AArch64: Improve codegen in SVE F32 logs
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions
where possible.  Similar to the recent change to AdvSIMD F32 logs,
adjust special-case arguments and bounds to allow for more optimal
register usage.  For all 3 routines one MOVPRFX remains in the
reduction, which cannot be avoided as immediate AND and ASR are both
destructive.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
7b8c134b54 AArch64: Improve codegen in SVE expf & related routines
Reduce MOV and MOVPRFX by improving special-case handling.  Use inline
helper to duplicate the entire computation between the special- and
non-special case branches, removing the contention for z0 between x
and the return value.

Also rearrange some MLAs and MLSs - by making the multiplicand the
destination we can avoid a MOVPRFX in several cases.  Also change which
constants go in the vector used for lanewise ops - the last lane is no
longer wasted.

Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the
comment that explains it.  Fixed - worst-case ULP for exp2f moves
around but it doesn't change significantly for either routine.

Worst-case error for coshf increases due to passing x to exp rather
than abs(x) - updated the comment, but does not require regen-ulps.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
751a5502be AArch64: Add vector logp1 alias for log1p
This enables vectorisation of C23 logp1, which is an alias for log1p.
There are no new tests or ulp entries because the new symbols are simply
aliases.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-19 17:53:34 +01:00
Joe Ramsay
8b09af572b aarch64: Avoid redundant MOVs in AdvSIMD F32 logs
Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-09 13:03:49 +01:00
Joe Ramsay
0fed0b250f aarch64/fpu: Add vector variants of pow
Plus a small amount of moving includes around in order to be able to
remove duplicate definition of asuint64.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-21 14:38:49 +01:00
Joe Ramsay
75207bde68 aarch64/fpu: Add vector variants of cbrt
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16 14:35:06 +01:00
Joe Ramsay
157f89fa3d aarch64/fpu: Add vector variants of hypot
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16 14:34:43 +01:00
Joe Ramsay
90a6ca8b28 aarch64: Fix AdvSIMD libmvec routines for big-endian
Previously many routines used * to load from vector types stored
in the data table. This is emitted as ldr, which byte-swaps the
entire vector register, and causes bugs for big-endian when not
all lanes contain the same value. When a vector is to be used
this way, it has been replaced with an array and the load with an
explicit ld1 intrinsic, which byte-swaps only within lanes.

As well, many routines previously used non-standard GCC syntax
for vector operations such as indexing into vectors types with []
and assembling vectors using {}. This syntax should not be mixed
with ACLE, as the former does not respect endianness whereas the
latter does. Such examples have been replaced with, for instance,
vcombine_* and vgetq_lane* intrinsics. Helpers which only use the
GCC syntax, such as the v_call helpers, do not need changing as
they do not use intrinsics.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-14 13:10:33 +01:00
Joe Ramsay
87cb1dfcd6 aarch64/fpu: Add vector variants of erfc
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:24 +01:00
Joe Ramsay
3d3a4fb8e4 aarch64/fpu: Add vector variants of tanh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:20 +01:00
Joe Ramsay
eedbbca0bf aarch64/fpu: Add vector variants of sinh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:16 +01:00
Joe Ramsay
8b67920528 aarch64/fpu: Add vector variants of atanh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:12 +01:00
Joe Ramsay
81406ea3c5 aarch64/fpu: Add vector variants of asinh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:02 +01:00
Joe Ramsay
b09fee1d21 aarch64/fpu: Add vector variants of acosh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:58 +01:00
Joe Ramsay
bdb5705b7b aarch64/fpu: Add vector variants of cosh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:52 +01:00
Joe Ramsay
cb5d84f1f8 aarch64/fpu: Add vector variants of erf
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:48 +01:00
Joe Ramsay
e302e10213 aarch64/fpu: Sync libmvec routines from 2.39 and before with AOR
This includes a fix for big-endian in AdvSIMD log, some cosmetic
changes, and numerous small optimisations mainly around inlining and
using indexed variants of MLA intrinsics.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-02-26 09:45:50 -03:00
Paul Eggert
dff8da6b3e Update copyright dates with scripts/update-copyrights 2024-01-01 10:53:40 -08:00
Joe Ramsay
667f277c78 aarch64: Add SIMD attributes to math functions with vector versions
Added annotations for autovec by GCC and GFortran - this enables GCC
>= 9 to autovectorise math calls at -Ofast.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20 08:41:25 +00:00
Joe Ramsay
cc0d77ba94 aarch64: Add half-width versions of AdvSIMD f32 libmvec routines
Compilers may emit calls to 'half-width' routines (two-lane
single-precision variants). These have been added in the form of
wrappers around the full-width versions, where the low half of the
vector is simply duplicated. This will perform poorly when one lane
triggers the special-case handler, as there will be a redundant call
to the scalar version, however this is expected to be rare at Ofast.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20 08:41:25 +00:00
Joe Ramsay
7b12776584 aarch64: Improve special-case handling in AdvSIMD double-precision libmvec routines
Avoids emitting many saves/restores of vector registers, reduces the
amount of code generated around the scalar fallback.
2023-11-29 15:03:36 +00:00
Joe Ramsay
bd70d3bacf aarch64: Fix libmvec benchmarks
These were broken by the new atan2 functions, as they were only
set up for univariate functions. Arity is now detected from the
input file - this revealed a mistake that the double-precision
inputs were being used for both single- and double-precision
routines, which is now remedied.
2023-11-22 09:10:43 +00:00
Joe Ramsay
a8830c9285 aarch64: Add vector implementations of expm1 routines
May discard sign of 0 - auto tests for -0 and -0x1p-10000 updated accordingly.
2023-11-20 17:53:14 +00:00
Joe Ramsay
3548a4f087 aarch64: Add vector implementations of log1p routines
May discard sign of zero.
2023-11-10 17:07:43 +00:00
Joe Ramsay
b07038c5d3 aarch64: Add vector implementations of atan2 routines 2023-11-10 17:07:43 +00:00
Joe Ramsay
d30c39f80d aarch64: Add vector implementations of atan routines 2023-11-10 17:07:42 +00:00
Joe Ramsay
b5d23367a8 aarch64: Add vector implementations of acos routines 2023-11-10 17:07:42 +00:00
Joe Ramsay
9bed498418 aarch64: Add vector implementations of asin routines 2023-11-10 17:07:42 +00:00
Joe Ramsay
31aaf6fed9 aarch64: Add vector implementations of exp10 routines
Double-precision routines either reuse the exp table (AdvSIMD) or use
SVE FEXPA intruction.
2023-10-23 15:00:45 +01:00
Joe Ramsay
067a34156c aarch64: Add vector implementations of log10 routines
A table is also added, which is shared between AdvSIMD and SVE log10.
2023-10-23 15:00:45 +01:00
Joe Ramsay
a8e3ab3074 aarch64: Add vector implementations of log2 routines
A table is also added, which is shared between AdvSIMD and SVE log2.
2023-10-23 15:00:45 +01:00
Joe Ramsay
b39e9db5e3 aarch64: Add vector implementations of exp2 routines
Some routines reuse table from v_exp_data.c
2023-10-23 15:00:45 +01:00
Joe Ramsay
f554334c05 aarch64: Add vector implementations of tan routines
This includes some utility headers for evaluating polynomials using
various schemes.
2023-10-23 15:00:44 +01:00
Joe Ramsay
5a4b6f8e4b aarch64: Optimise vecmath logs
* Transpose table layout for improved memory access
* Use half-vector special comparisons for AdvSIMD
* Improve register use near special-case branches
  - Due to the presence of a function call, return value would get
    mov-d out of x0 in order to facilitate PCS. By moving the final
    computation after the branch this can be avoided

Also change SVE routines to use overloaded intrinsics for readability.
2023-10-05 16:54:16 +01:00
Joe Ramsay
480a0dfe1a aarch64: Cosmetic change in SVE exp routines
Use overloaded intrinsics for readability. Codegen does not
change, however while we're bringing the routines up-to-date with
recent improvements to other routines in AOR it is worth copying
this change over as well.
2023-10-05 16:54:00 +01:00
Joe Ramsay
9180160e08 aarch64: Optimize SVE cos & cosf
Saves a mov by ensuring return value does not need to be moved out of
the way before special-case branch. Also change to use overloaded
intrinsics.
2023-10-05 16:53:38 +01:00
Joe Ramsay
8014d1e832 aarch64: Improve vecmath sin routines
* Update ULP comment reflecting a new observed max in [-pi/2, pi/2]
* Use the same polynomial in AdvSIMD and SVE, rather than FTRIG instructions
* Improve register use near special-case branch

Also use overloaded intrinsics for SVE.
2023-10-05 16:53:06 +01:00
Wilco Dijkstra
6b695e5c62 AArch64: Remove -0.0 check from vector sin
Remove the unnecessary extra checks for sin (-0.0) from vector sin/sinf,
improving performance.  Passes regress.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-09-26 13:40:07 +01:00
Joe Ramsay
4a9392ffc2 aarch64: Add vector implementations of exp routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:26 +01:00
Joe Ramsay
78c01a5cbe aarch64: Add vector implementations of log routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines. Log lookup table
added as HIDDEN symbol to allow it to be shared between AdvSIMD and
SVE variants.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:22 +01:00
Joe Ramsay
3bb1af2051 aarch64: Add vector implementations of sin routines
Optimised implementations for single and double precision, Advanced
SIMD and SVE, copied from Arm Optimized Routines.

As previously, data tables are used via a barrier to prevent
overly aggressive constant inlining. Special-case handlers are
marked NOINLINE to avoid incurring the penalty of switching call
standards unnecessarily.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:16 +01:00
Joe Ramsay
aed39a3aa3 aarch64: Add vector implementations of cos routines
Replace the loop-over-scalar placeholder routines with optimised
implementations from Arm Optimized Routines (AOR).

Also add some headers containing utilities for aarch64 libmvec
routines, and update libm-test-ulps.

Data tables for new routines are used via a pointer with a
barrier on it, in order to prevent overly aggressive constant
inlining in GCC. This allows a single adrp, combined with offset
loads, to be used for every constant in the table.

Special-case handlers are marked NOINLINE in order to confine the
save/restore overhead of switching from vector to normal calling
standard. This way we only incur the extra memory access in the
exceptional cases. NOINLINE definitions have been moved to
math_private.h in order to reduce duplication.

AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
selective masking (and later fixing up) of invalid lanes, in
order to trigger fp exceptions correctly (AdvSIMD only). This is
tested and maintained in AOR, however it is configured off at
source level here for performance reasons. We keep the
WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
the upstreaming process from AOR to glibc.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-06-30 09:04:10 +01:00
Paul Pluzhnikov
2cbeda847b Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
Joe Ramsay
cd94326a13 Enable libmvec support for AArch64
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.

The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.

Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.

Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-05-03 12:09:49 +01:00
Joseph Myers
6d7e8eda9b Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
Tejas Belagod
05844d18f7 AArch64: Reset HWCAP2_AFP bits in FPCR for default fenv
The AFP feature (Alternate floating-point behavior) was added in armv8.7 and
introduced new FPCR bits.

Currently, HWCAP2_AFP bits (bit 0, 1, 2) in FPCR are preserved when fenv is
set to default environment.  This is a deviation from standard behaviour.
Clear these bits when setting the fenv to default.

There is no libc API to modify the new FPCR bits.  Restoring those bits matters
if the user changed them directly.
2022-07-05 14:01:17 +01:00
Paul Eggert
581c785bf3 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.

I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah.  I don't
know why I run into these diagnostics whereas others evidently do not.

remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2022-01-01 11:40:24 -08:00