Commit Graph

492 Commits

Author SHA1 Message Date
Adhemerval Zanella
bccb0648ea math: Use tanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       82.3961       54.8052       33.49%
x86_64v2                     82.3415       54.8052       33.44%
x86_64v3                     69.3661       50.4864       27.22%
i686                         219.271       45.5396       79.23%
aarch64                      29.2127       19.1951       34.29%
power10                      19.5060       16.2760       16.56%

reciprocal-throughput         master       patched  improvement
x86_64                       28.3976       19.7334       30.51%
x86_64v2                     28.4568       19.7334       30.65%
x86_64v3                     21.1815       16.1811       23.61%
i686                         105.016       15.1426       85.58%
aarch64                      18.1573       10.7681       40.70%
power10                       8.7207        8.7097        0.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella
d846f4c12d math: Use lgammaf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic lgammaf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, to use math_narrow_eval
on overflow usage, and to adapt to make it reentrant.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       86.5609       70.3278       18.75%
x86_64v2                     78.3030       69.9709       10.64%
x86_64v3                     74.7470       59.8457       19.94%
i686                         387.355       229.761       40.68%
aarch64                      40.8341       33.7563       17.33%
power10                      26.5520       16.1672       39.11%
powerpc                      28.3145       17.0625       39.74%

reciprocal-throughput         master       patched  improvement
x86_64                       68.0461       48.3098       29.00%
x86_64v2                     55.3256       47.2476       14.60%
x86_64v3                     52.3015       38.9028       25.62%
i686                         340.848       195.707       42.58%
aarch64                      36.8000       30.5234       17.06%
power10                      20.4043       12.6268       38.12%
powerpc                      22.6588       13.8866       38.71%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella
994fec2397 math: Use erff from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erff.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       85.7363       45.1372       47.35%
x86_64v2                     86.6337       38.5816       55.47%
x86_64v3                     71.3810       34.0843       52.25%
i686                         190.143       97.5014       48.72%
aarch64                      34.9091       14.9320       57.23%
power10                      38.6160        8.5188       77.94%
powerpc                      39.7446       8.45781       78.72%

reciprocal-throughput         master       patched  improvement
x86_64                       35.1739       14.7603       58.04%
x86_64v2                     34.5976       11.2283       67.55%
x86_64v3                     27.3260        9.8550       63.94%
i686                         91.0282       30.8840       66.07%
aarch64                      22.5831        6.9615       69.17%
power10                      18.0386        3.0918       82.86%
powerpc                      20.7277       3.63396       82.47%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella
c5d241f06b math: Use cbrtf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cbrtf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master        patched       improvement
x86_64                       68.6348        36.8908            46.25%
x86_64v2                     67.3418        36.6968            45.51%
x86_64v3                     63.4981        32.7859            48.37%
aarch64                      29.3172        12.1496            58.56%
power10                      18.0845         8.8893            50.85%
powerpc                      18.0859        8.79527            51.37%

reciprocal-throughput         master        patched       improvement
x86_64                       36.4369        13.3565            63.34%
x86_64v2                     37.3611        13.1149            64.90%
x86_64v3                     31.6024        11.2102            64.53%
aarch64                      18.6866        7.3474             60.68%
power10                       9.4758        3.6329             61.66%
powerpc                      9.58896        3.90439            59.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-22 10:01:03 -03:00
Andrew Pinski
e6590f0c86 aarch64: Remove non-temporal load/stores from oryon-1's memset
The hardware architects have a new recommendation not to use
non-temporal load/stores for memset. This patch removes this path.
I found there was no difference in the memset speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:23 -03:00
Andrew Pinski
eb5eeb4740 aarch64: Remove non-temporal load/stores from oryon-1's memcpy
The hardware architects have a new recommendation not to use
non-temporal load/stores for memcpy. This patch removes this path.
I found there was no difference in the memcpy speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:17 -03:00
Andrew Pinski
e162ab2bf1 AArch64: Remove thunderx{,2} memcpy
ThunderX1 and ThunderX2 have been retired for a few years now.
So let's remove the thunderx{,2} specific versions of memcpy.
The performance gain or them was for medium and large sizes
while the generic (aarch64) memcpy will handle just slightly worse.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-20 11:23:53 +00:00
Joe Ramsay
2d82d781a5 AArch64: Remove SVE erf and erfc tables
By using a combination of mask-and-add instead of the shift-based
index calculation the routines can share the same table as other
variants with no performance degradation.

The tables change name because of other changes in downstream AOR.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-01 16:10:41 +00:00
Adhemerval Zanella
f338c7c5f5 math: Use log10p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log10p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.5251        32.2627        52.92%
x86_64v2                    68.8912        32.7887        52.41%
x86_64v3                    59.3427        27.0521        54.41%
i686                        162.026        103.383        36.19%
aarch64                     26.8513        14.5695        45.74%
power10                     12.7426         8.4929        33.35%
powerpc                     16.6768        9.29135        44.29%

reciprocal-throughput        master        patched   improvement
x86_64                      26.0969        12.4023        52.48%
x86_64v2                    25.0045        11.0748        55.71%
x86_64v3                    20.5610        10.2995        49.91%
i686                        89.8842        78.5211        12.64%
aarch64                     17.1200         9.4832        44.61%
power10                      6.7814         6.4258         5.24%
powerpc                      15.769         7.6825        51.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella
8ae9e51376 math: Use log1pf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log1pf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      71.8142        38.9668        45.74%
x86_64v2                    71.9094        39.1321        45.58%
x86_64v3                    60.1000        32.4016        46.09%
i686                        147.105        104.258        29.13%
aarch64                     26.4439        14.0050        47.04%
power10                     19.4874         9.4146        51.69%
powerpc                     17.6145        8.00736        54.54%

reciprocal-throughput        master        patched   improvement
x86_64                      19.7604        12.7254        35.60%
x86_64v2                    19.0039        11.9455        37.14%
x86_64v3                    16.8559        11.9317        29.21%
i686                        82.3426        73.9718        10.17%
aarch64                     14.4665         7.9614        44.97%
power10                     11.9974         8.4117        29.89%
powerpc                     7.15222         6.0914        14.83%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella
c369580814 math: Use log2p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic log2p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      70.1462        47.0090        32.98%
x86_64v2                    70.2513        47.6160        32.22%
x86_64v3                    60.4840        39.9443        33.96%
i686                        164.068        122.909        25.09%
aarch64                     25.9169        16.9207        34.71%
power10                     18.1261        9.8592         45.61%
powerpc                     17.2683        9.38665        45.64%

reciprocal-throughput        master        patched   improvement
x86_64                      26.2240        16.4082        37.43%
x86_64v2                    25.0911        15.7480        37.24%
x86_64v3                    20.9371        11.7264        43.99%
i686                        90.4209        95.3073        -5.40%
aarch64                     16.8537        8.9561         46.86%
power10                     12.9401        6.5555         49.34%
powerpc                     9.01763        7.54745        16.30%

The performance decrease for i686 is mostly due the use of x87 fpu,
when building with '-msse2 -mfpmath=sse:

                             master        patched   improvement
latency                     164.068        102.982        37.23%
reciprocal-throughput       89.1968        82.5117         7.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella
bbd578b38d math: Use expm1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic expm1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      96.7402        36.4026        62.37%
x86_64v2                    97.5391        33.4625        65.69%
x86_64v3                    82.1778        30.8668        62.44%
i686                         120.58        94.8302        21.35%
aarch64                     32.3558        12.8881        60.17%
power10                     23.5087        9.8574         58.07%
powerpc                     23.4776        9.06325        61.40%

reciprocal-throughput        master        patched   improvement
x86_64                      27.8224        15.9255        42.76%
x86_64v2                    27.8364        9.6438         65.36%
x86_64v3                    20.3227        9.6146         52.69%
i686                        63.5629        59.4718         6.44%
aarch64                     17.4838        7.1082         59.34%
power10                     12.4644        8.7829         29.54%
powerpc                     14.2152        5.94765        58.16%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella
5c22fd25c1 math: Use exp2m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.

The benchmark inputs are based on exp2f ones.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6042        48.7104       -19.96%
x86_64v2                    40.7506        35.9032        11.90%
x86_64v3                    35.2301        31.7956        9.75%
i686                        102.094        94.6657        7.28%
aarch64                     18.2704        15.1387        17.14%
power10                     11.9444         8.2402        31.01%

reciprocal-throughput        master        patched   improvement
x86_64                      20.8683        16.1428        22.64%
x86_64v2                    19.5076        10.4474        46.44%
x86_64v3                    19.2106        10.4014        45.86%
i686                        56.4054        59.3004        -5.13%
aarch64                     12.0781         7.3953        38.77%
power10                      6.5306         5.9388         9.06%

The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded).  This explains the performance difference for x86_64.

Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'.  When built wth same
flags, the new algorithm shows a better performance:

                            master        patched    improvement
latency                    102.094        91.2823         10.59%
reciprocal-throughput      56.4054        52.7984          6.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella
5fa89852fa math: Use exp10m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      45.4690        49.5845        -9.05%
x86_64v2                    46.1604        36.2665        21.43%
x86_64v3                    37.8442        31.0359        17.99%
i686                        121.367        93.0079        23.37%
aarch64                     21.1126        15.0165        28.87%
power10                     12.7426        8.4929         33.35%

reciprocal-throughput        master        patched   improvement
x86_64                      19.6005        17.4005        11.22%
x86_64v2                    19.6008        11.1977        42.87%
x86_64v3                    17.5427        10.2898        41.34%
i686                        59.4215        60.9675        -2.60%
aarch64                     13.9814        7.9173         43.37%
power10                      6.7814        6.4258          5.24%

The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:26 -03:00
Joe Ramsay
1cf29fbc5b AArch64: Small optimisation in AdvSIMD erf and erfc
In both routines, reduce register pressure such that GCC 14 emits no
spills for erf and fewer spills for erfc.  Also use more efficient
comparison for the special-case in erf.

Benchtests show erf improves by 6.4%, erfc by 1.0%.
2024-10-28 15:01:37 +00:00
Paul Zimmermann
392b3f0971 replace tgammaf by the CORE-MATH implementation
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.

Tested on x86_64 and x86 (cfarm26).

With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:

      "tgammaf": {
       "": {
        "duration": 3.50188e+09,
        "iterations": 2e+07,
        "max": 602.891,
        "min": 65.1415,
        "mean": 175.094
       }
      }

With the new code:

      "tgammaf": {
       "": {
        "duration": 3.30825e+09,
        "iterations": 5e+07,
        "max": 211.592,
        "min": 32.0325,
        "mean": 66.1649
       }
      }

With the initial GNU libc code it gave on cfarm26 (i686):

  "tgammaf": {
   "": {
    "duration": 3.70505e+09,
    "iterations": 6e+06,
    "max": 2420.23,
    "min": 243.154,
    "mean": 617.509
   }
  }

With the new code:

  "tgammaf": {
   "": {
    "duration": 3.24497e+09,
    "iterations": 1.8e+07,
    "max": 1238.15,
    "min": 101.155,
    "mean": 180.276
   }
  }

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

Changes in v2:
    - include <math.h> (fix the linknamespace failures)
    - restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
    - restored original wrapper code (math/w_tgammaf_compat.c),
      except for the dealing with the sign
    - removed the tgammaf/float entries in all libm-test-ulps files
    - address other comments from Joseph Myers
      (https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)

Changes in v3:
    - pass NULL argument for signgam from w_tgammaf_compat.c
    - use of math_narrow_eval
    - added more comments

Changes in v4:
    - initialize local_signgam to 0 in math/w_tgamma_template.c
    - replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file

Changes in v5:
    - do not mention local_signgam any more in math/w_tgammaf_compat.c
    - initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
      and added comment

Changes in v6:
    - pass NULL as 2nd argument of __ieee754_gammaf_r in
      w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c

Changes in v7:
    - added Signed-off-by line for Alexei Sibidanov (author of the code)

Changes in v8:
    - added Signed-off-by line for Paul Zimmermann (submitted of the patch)

Changes in v9:
    - address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-10-11 11:12:32 +02:00
Joe Ramsay
16a59571e4 AArch64: Simplify rounding-multiply pattern in several AdvSIMD routines
This operation can be simplified to use simpler multiply-round-convert
sequence, which uses fewer instructions and constants.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:08 +01:00
Joe Ramsay
7900ac490d AArch64: Improve codegen in users of ADVSIMD expm1f helper
Rearrange operations so MOV is not necessary in reduction or around
the special-case handler.  Reduce memory access by using more indexed
MLAs in polynomial.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
5bc100bd4b AArch64: Improve codegen in users of AdvSIMD log1pf helper
log1pf is quite register-intensive - use fewer registers for the
polynomial, and make various changes to shorten dependency chains in
parent routines.  There is now no spilling with GCC 14.  Accuracy moves
around a little - comments adjusted accordingly but does not require
regen-ulps.

Use the helper in log1pf as well, instead of having separate
implementations.  The more accurate polynomial means special-casing can
be simplified, and the shorter dependency chain avoids the usual dance
around v0, which is otherwise difficult.

There is a small duplication of vectors containing 1.0f (or 0x3f800000) -
GCC is not currently able to efficiently handle values which fit in FMOV
but not MOVI, and are reinterpreted to integer.  There may be potential
for more optimisation if this is fixed.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
a15b1394b5 AArch64: Improve codegen in SVE F32 logs
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions
where possible.  Similar to the recent change to AdvSIMD F32 logs,
adjust special-case arguments and bounds to allow for more optimal
register usage.  For all 3 routines one MOVPRFX remains in the
reduction, which cannot be avoided as immediate AND and ASR are both
destructive.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
7b8c134b54 AArch64: Improve codegen in SVE expf & related routines
Reduce MOV and MOVPRFX by improving special-case handling.  Use inline
helper to duplicate the entire computation between the special- and
non-special case branches, removing the contention for z0 between x
and the return value.

Also rearrange some MLAs and MLSs - by making the multiplicand the
destination we can avoid a MOVPRFX in several cases.  Also change which
constants go in the vector used for lanewise ops - the last lane is no
longer wasted.

Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the
comment that explains it.  Fixed - worst-case ULP for exp2f moves
around but it doesn't change significantly for either routine.

Worst-case error for coshf increases due to passing x to exp rather
than abs(x) - updated the comment, but does not require regen-ulps.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay
751a5502be AArch64: Add vector logp1 alias for log1p
This enables vectorisation of C23 logp1, which is an alias for log1p.
There are no new tests or ulp entries because the new symbols are simply
aliases.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-19 17:53:34 +01:00
Wilco Dijkstra
8ecb477ea1 AArch64: Remove memset-reg.h
Remove memset-reg.h by moving register definitions into the memset
implementations.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-09-10 14:18:03 +01:00
Wilco Dijkstra
cec3aef324 AArch64: Optimize memset
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes.  Remove unnecessary code for ZVA sizes
other than 64 and 128.  Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-09-09 15:30:00 +01:00
Joe Ramsay
8b09af572b aarch64: Avoid redundant MOVs in AdvSIMD F32 logs
Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-09 13:03:49 +01:00
Adhemerval Zanella
e2f88d8524 aarch64: Regenerate ULPs
From new tests added by 0797283910.
2024-08-07 11:02:03 -03:00
Wilco Dijkstra
3dc426b642 AArch64: Improve generic strlen
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates.  Change final
size computation to avoid increasing latency.  On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-08-07 14:58:46 +01:00
Adhemerval Zanella
cfc9b07346 aarch64: Regenerate ULPs
From new tests added by 4dc22baa84.
2024-07-25 10:41:30 -03:00
Andrew Pinski
2f1f7a5f8a
Aarch64: Add new memset for Qualcomm's oryon-1 core
Qualcom's new core, oryon-1, has a different characteristics for
memset than the current versions of memset. For non-zero, larger
sizes, using GPRs rather than the SIMD stores is ~30% faster.
For even larger sizes, using the nontemporal stores is needed
not to polute the L1/L2 caches.

For zero values, using `dc zva` should be used. Since we
know the size will always be 64 bytes, we don't need to figure
out the size there.

I started with the emag memset and added back the `dc zva` code.

Changes since v1:
* v3: Fix comment formating

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-30 13:47:17 +02:00
Andrew Pinski
4dc83cac78
Aarch64: Add memcpy for qualcomm's oryon-1 core
Qualcomm's new core (oryon-1) has a different performance characteristic
than other cores. For memcpy, it is faster to use the GPRs to
do the copy for large sizes (2x faster). For even larger sizes,
it is better to use the nontemporal load/store instructions so
we don't pollute the L1/L2 caches.

For smaller sizes, the characteristic are very similar to
other cores.
I used the thunderx memcpy as a starting point and expanded from there.

Changes since v1:
* v2: Fix ordering in Makefile.
* v3: Fix comment grammar about the ldnp/stnp instructions.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-30 13:46:33 +02:00
Adhemerval Zanella
45f5f51b85 aarch64: Update ulps
For the exp10m1, exp2m1, and log10p1 implementations.
2024-06-18 17:31:10 -03:00
Andreas K. Hüttel
98ffc1bfeb
Convert to autoconf 2.72 (vanilla release, no distribution patches)
As discussed at the patch review meeting

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
2024-06-17 21:15:28 +02:00
Joseph Myers
bb014f50c4 Implement C23 logp1
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p).  As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.

Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.

The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either).  It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately.  For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-06-17 13:47:09 +00:00
Adhemerval Zanella
ef9596352b aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777)
The generic version provides weak definitions of memchr/strlen,
which are already provided by the ifunc resolvers.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-23 09:36:08 -03:00
Joe Ramsay
0fed0b250f aarch64/fpu: Add vector variants of pow
Plus a small amount of moving includes around in order to be able to
remove duplicate definition of asuint64.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-21 14:38:49 +01:00
Adhemerval Zanella
241338bd6f aarch64: Update ulps
For the log2p1 implementation.
2024-05-20 13:12:23 -03:00
Joe Ramsay
75207bde68 aarch64/fpu: Add vector variants of cbrt
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16 14:35:06 +01:00
Joe Ramsay
157f89fa3d aarch64/fpu: Add vector variants of hypot
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16 14:34:43 +01:00
Joe Ramsay
90a6ca8b28 aarch64: Fix AdvSIMD libmvec routines for big-endian
Previously many routines used * to load from vector types stored
in the data table. This is emitted as ldr, which byte-swaps the
entire vector register, and causes bugs for big-endian when not
all lanes contain the same value. When a vector is to be used
this way, it has been replaced with an array and the load with an
explicit ld1 intrinsic, which byte-swaps only within lanes.

As well, many routines previously used non-standard GCC syntax
for vector operations such as indexing into vectors types with []
and assembling vectors using {}. This syntax should not be mixed
with ACLE, as the former does not respect endianness whereas the
latter does. Such examples have been replaced with, for instance,
vcombine_* and vgetq_lane* intrinsics. Helpers which only use the
GCC syntax, such as the v_call helpers, do not need changing as
they do not use intrinsics.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-14 13:10:33 +01:00
Adhemerval Zanella
bcae44ea85 elf: Only process multiple tunable once (BZ 31686)
The 680c597e9c commit made loader reject ill-formatted strings by
first tracking all set tunables and then applying them. However, it does
not take into consideration if the same tunable is set multiple times,
where parse_tunables_string appends the found tunable without checking
if it was already in the list. It leads to a stack-based buffer overflow
if the tunable is specified more than the total number of tunables.  For
instance:

  GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of
  total support for different tunable).

Instead, use the index of the tunable list to get the expected tunable
entry.  Since now the initial list is zero-initialized, the compiler
might emit an extra memset and this requires some minor adjustment
on some ports.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reported-by: Yuto Maeda <maeda@cyberdefense.jp>
Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-05-07 12:16:36 -03:00
Wilco Dijkstra
6dae61567f AArch64: Remove unused defines of CPU names
Remove unused defines of CPU names in cpu-features.h.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-04-30 13:32:29 +01:00
Florian Weimer
f8d8b1b1e6 aarch64: Enhanced CPU diagnostics for ld.so
This prints some information from struct cpu_features, and the midr_el1
and dczid_el0 system register contents on every CPU.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-08 16:48:55 +02:00
Adhemerval Zanella
50c2be2390 aarch64: Remove ld.so __tls_get_addr plt usage
Use the hidden alias instead.

Checked on aarch64-linux-gnu.
2024-04-04 17:02:32 -03:00
Joe Ramsay
87cb1dfcd6 aarch64/fpu: Add vector variants of erfc
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:24 +01:00
Joe Ramsay
3d3a4fb8e4 aarch64/fpu: Add vector variants of tanh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:20 +01:00
Joe Ramsay
eedbbca0bf aarch64/fpu: Add vector variants of sinh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:16 +01:00
Joe Ramsay
8b67920528 aarch64/fpu: Add vector variants of atanh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:12 +01:00
Joe Ramsay
81406ea3c5 aarch64/fpu: Add vector variants of asinh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:33:02 +01:00
Joe Ramsay
b09fee1d21 aarch64/fpu: Add vector variants of acosh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:58 +01:00
Joe Ramsay
bdb5705b7b aarch64/fpu: Add vector variants of cosh
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04 10:32:52 +01:00