The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.
The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):
latency master patched improvement
x86_64 82.3961 54.8052 33.49%
x86_64v2 82.3415 54.8052 33.44%
x86_64v3 69.3661 50.4864 27.22%
i686 219.271 45.5396 79.23%
aarch64 29.2127 19.1951 34.29%
power10 19.5060 16.2760 16.56%
reciprocal-throughput master patched improvement
x86_64 28.3976 19.7334 30.51%
x86_64v2 28.4568 19.7334 30.65%
x86_64v3 21.1815 16.1811 23.61%
i686 105.016 15.1426 85.58%
aarch64 18.1573 10.7681 40.70%
power10 8.7207 8.7097 0.13%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The hardware architects have a new recommendation not to use
non-temporal load/stores for memset. This patch removes this path.
I found there was no difference in the memset speed with/without
non-temporal load/stores either.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The hardware architects have a new recommendation not to use
non-temporal load/stores for memcpy. This patch removes this path.
I found there was no difference in the memcpy speed with/without
non-temporal load/stores either.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
ThunderX1 and ThunderX2 have been retired for a few years now.
So let's remove the thunderx{,2} specific versions of memcpy.
The performance gain or them was for medium and large sizes
while the generic (aarch64) memcpy will handle just slightly worse.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
By using a combination of mask-and-add instead of the shift-based
index calculation the routines can share the same table as other
variants with no performance degradation.
The tables change name because of other changes in downstream AOR.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.
The benchmark inputs are based on exp2f ones.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6042 48.7104 -19.96%
x86_64v2 40.7506 35.9032 11.90%
x86_64v3 35.2301 31.7956 9.75%
i686 102.094 94.6657 7.28%
aarch64 18.2704 15.1387 17.14%
power10 11.9444 8.2402 31.01%
reciprocal-throughput master patched improvement
x86_64 20.8683 16.1428 22.64%
x86_64v2 19.5076 10.4474 46.44%
x86_64v3 19.2106 10.4014 45.86%
i686 56.4054 59.3004 -5.13%
aarch64 12.0781 7.3953 38.77%
power10 6.5306 5.9388 9.06%
The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded). This explains the performance difference for x86_64.
Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'. When built wth same
flags, the new algorithm shows a better performance:
master patched improvement
latency 102.094 91.2823 10.59%
reciprocal-throughput 56.4054 52.7984 6.39%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 45.4690 49.5845 -9.05%
x86_64v2 46.1604 36.2665 21.43%
x86_64v3 37.8442 31.0359 17.99%
i686 121.367 93.0079 23.37%
aarch64 21.1126 15.0165 28.87%
power10 12.7426 8.4929 33.35%
reciprocal-throughput master patched improvement
x86_64 19.6005 17.4005 11.22%
x86_64v2 19.6008 11.1977 42.87%
x86_64v3 17.5427 10.2898 41.34%
i686 59.4215 60.9675 -2.60%
aarch64 13.9814 7.9173 43.37%
power10 6.7814 6.4258 5.24%
The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
In both routines, reduce register pressure such that GCC 14 emits no
spills for erf and fewer spills for erfc. Also use more efficient
comparison for the special-case in erf.
Benchtests show erf improves by 6.4%, erfc by 1.0%.
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.
Tested on x86_64 and x86 (cfarm26).
With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:
"tgammaf": {
"": {
"duration": 3.50188e+09,
"iterations": 2e+07,
"max": 602.891,
"min": 65.1415,
"mean": 175.094
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.30825e+09,
"iterations": 5e+07,
"max": 211.592,
"min": 32.0325,
"mean": 66.1649
}
}
With the initial GNU libc code it gave on cfarm26 (i686):
"tgammaf": {
"": {
"duration": 3.70505e+09,
"iterations": 6e+06,
"max": 2420.23,
"min": 243.154,
"mean": 617.509
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.24497e+09,
"iterations": 1.8e+07,
"max": 1238.15,
"min": 101.155,
"mean": 180.276
}
}
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Changes in v2:
- include <math.h> (fix the linknamespace failures)
- restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
- restored original wrapper code (math/w_tgammaf_compat.c),
except for the dealing with the sign
- removed the tgammaf/float entries in all libm-test-ulps files
- address other comments from Joseph Myers
(https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)
Changes in v3:
- pass NULL argument for signgam from w_tgammaf_compat.c
- use of math_narrow_eval
- added more comments
Changes in v4:
- initialize local_signgam to 0 in math/w_tgamma_template.c
- replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file
Changes in v5:
- do not mention local_signgam any more in math/w_tgammaf_compat.c
- initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
and added comment
Changes in v6:
- pass NULL as 2nd argument of __ieee754_gammaf_r in
w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c
Changes in v7:
- added Signed-off-by line for Alexei Sibidanov (author of the code)
Changes in v8:
- added Signed-off-by line for Paul Zimmermann (submitted of the patch)
Changes in v9:
- address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This operation can be simplified to use simpler multiply-round-convert
sequence, which uses fewer instructions and constants.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Rearrange operations so MOV is not necessary in reduction or around
the special-case handler. Reduce memory access by using more indexed
MLAs in polynomial.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
log1pf is quite register-intensive - use fewer registers for the
polynomial, and make various changes to shorten dependency chains in
parent routines. There is now no spilling with GCC 14. Accuracy moves
around a little - comments adjusted accordingly but does not require
regen-ulps.
Use the helper in log1pf as well, instead of having separate
implementations. The more accurate polynomial means special-casing can
be simplified, and the shorter dependency chain avoids the usual dance
around v0, which is otherwise difficult.
There is a small duplication of vectors containing 1.0f (or 0x3f800000) -
GCC is not currently able to efficiently handle values which fit in FMOV
but not MOVI, and are reinterpreted to integer. There may be potential
for more optimisation if this is fixed.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions
where possible. Similar to the recent change to AdvSIMD F32 logs,
adjust special-case arguments and bounds to allow for more optimal
register usage. For all 3 routines one MOVPRFX remains in the
reduction, which cannot be avoided as immediate AND and ASR are both
destructive.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reduce MOV and MOVPRFX by improving special-case handling. Use inline
helper to duplicate the entire computation between the special- and
non-special case branches, removing the contention for z0 between x
and the return value.
Also rearrange some MLAs and MLSs - by making the multiplicand the
destination we can avoid a MOVPRFX in several cases. Also change which
constants go in the vector used for lanewise ops - the last lane is no
longer wasted.
Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the
comment that explains it. Fixed - worst-case ULP for exp2f moves
around but it doesn't change significantly for either routine.
Worst-case error for coshf increases due to passing x to exp rather
than abs(x) - updated the comment, but does not require regen-ulps.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
This enables vectorisation of C23 logp1, which is an alias for log1p.
There are no new tests or ulp entries because the new symbols are simply
aliases.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Qualcom's new core, oryon-1, has a different characteristics for
memset than the current versions of memset. For non-zero, larger
sizes, using GPRs rather than the SIMD stores is ~30% faster.
For even larger sizes, using the nontemporal stores is needed
not to polute the L1/L2 caches.
For zero values, using `dc zva` should be used. Since we
know the size will always be 64 bytes, we don't need to figure
out the size there.
I started with the emag memset and added back the `dc zva` code.
Changes since v1:
* v3: Fix comment formating
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Qualcomm's new core (oryon-1) has a different performance characteristic
than other cores. For memcpy, it is faster to use the GPRs to
do the copy for large sizes (2x faster). For even larger sizes,
it is better to use the nontemporal load/store instructions so
we don't pollute the L1/L2 caches.
For smaller sizes, the characteristic are very similar to
other cores.
I used the thunderx memcpy as a starting point and expanded from there.
Changes since v1:
* v2: Fix ordering in Makefile.
* v3: Fix comment grammar about the ldnp/stnp instructions.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
As discussed at the patch review meeting
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p). As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.
Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.
The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either). It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately. For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).
Tested for x86_64 and x86, and with build-many-glibcs.py.
The generic version provides weak definitions of memchr/strlen,
which are already provided by the ifunc resolvers.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Plus a small amount of moving includes around in order to be able to
remove duplicate definition of asuint64.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Previously many routines used * to load from vector types stored
in the data table. This is emitted as ldr, which byte-swaps the
entire vector register, and causes bugs for big-endian when not
all lanes contain the same value. When a vector is to be used
this way, it has been replaced with an array and the load with an
explicit ld1 intrinsic, which byte-swaps only within lanes.
As well, many routines previously used non-standard GCC syntax
for vector operations such as indexing into vectors types with []
and assembling vectors using {}. This syntax should not be mixed
with ACLE, as the former does not respect endianness whereas the
latter does. Such examples have been replaced with, for instance,
vcombine_* and vgetq_lane* intrinsics. Helpers which only use the
GCC syntax, such as the v_call helpers, do not need changing as
they do not use intrinsics.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The 680c597e9c commit made loader reject ill-formatted strings by
first tracking all set tunables and then applying them. However, it does
not take into consideration if the same tunable is set multiple times,
where parse_tunables_string appends the found tunable without checking
if it was already in the list. It leads to a stack-based buffer overflow
if the tunable is specified more than the total number of tunables. For
instance:
GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of
total support for different tunable).
Instead, use the index of the tunable list to get the expected tunable
entry. Since now the initial list is zero-initialized, the compiler
might emit an extra memset and this requires some minor adjustment
on some ports.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reported-by: Yuto Maeda <maeda@cyberdefense.jp>
Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This prints some information from struct cpu_features, and the midr_el1
and dczid_el0 system register contents on every CPU.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>