Commit Graph

16449 Commits

Author SHA1 Message Date
Siddhesh Poyarekar
713d6d7e78 x86/string: Use movsl instead of movsd in strncat [BZ #32344]
The previous patch missed strncat, so fixed that.

Resolves: BZ #32344

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-11-21 17:11:01 -05:00
Andrew Pinski
e6590f0c86 aarch64: Remove non-temporal load/stores from oryon-1's memset
The hardware architects have a new recommendation not to use
non-temporal load/stores for memset. This patch removes this path.
I found there was no difference in the memset speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:23 -03:00
Andrew Pinski
eb5eeb4740 aarch64: Remove non-temporal load/stores from oryon-1's memcpy
The hardware architects have a new recommendation not to use
non-temporal load/stores for memcpy. This patch removes this path.
I found there was no difference in the memcpy speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:17 -03:00
Sachin Monga
3051f3495c powerpc64le: _init/_fini file changes for ROP
The ROP instructions were added in ISA 3.1 (ie, Power10), however they
were defined so that if executed on older cpus, they would behave as
nops.  This allows us to emit them on older cpus and they'd just be
ignored, but if run on a Power10, then the binary would be ROP protected.

Hash instructions use negative offsets so the default position
of ROP pointer is FRAME_ROP_SAVE from caller's SP.

Modified FRAME_MIN_SIZE_PARM to 112 for ELFv2 to reserve
additional 16 bytes for ROP save slot and padding.

Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-11-20 16:50:34 -05:00
Yury Khrustalev
f4d00dd60d AArch64: Add support for memory protection keys
This patch adds support for memory protection keys on AArch64 systems with
enabled Stage 1 permission overlays feature introduced in Armv8.9 / 9.4
(FEAT_S1POE) [1].

 1. Internal functions "pkey_read" and "pkey_write" to access data
    associated with memory protection keys.
 2. Implementation of API functions "pkey_get" and "pkey_set" for
    the AArch64 target.
 3. AArch64-specific PKEY flags for READ and EXECUTE (see below).
 4. New target-specific test that checks behaviour of pkeys on
    AArch64 targets.
 5. This patch also extends existing generic test for pkeys.
 6. HWCAP constant for Permission Overlay Extension feature.

To support more accurate mapping of underlying permissions to the
PKEY flags, we introduce additional AArch64-specific flags. The full
list of flags is:

 - PKEY_UNRESTRICTED: 0x0 (for completeness)
 - PKEY_DISABLE_ACCESS: 0x1 (existing flag)
 - PKEY_DISABLE_WRITE: 0x2 (existing flag)
 - PKEY_DISABLE_EXECUTE: 0x4 (new flag, AArch64 specific)
 - PKEY_DISABLE_READ: 0x8 (new flag, AArch64 specific)

The problem here is that PKEY_DISABLE_ACCESS has unusual semantics as
it overlaps with existing PKEY_DISABLE_WRITE and new PKEY_DISABLE_READ.
For this reason mapping between permission bits RWX and "restrictions"
bits awxr (a for disable access, etc) becomes complicated:

 - PKEY_DISABLE_ACCESS disables both R and W
 - PKEY_DISABLE_{WRITE,READ} disables W and R respectively
 - PKEY_DISABLE_EXECUTE disables X

Combinations like the one below are accepted although they are redundant:

 - PKEY_DISABLE_ACCESS | PKEY_DISABLE_READ | PKEY_DISABLE_WRITE

Reverse mapping tries to retain backward compatibility and ORs
PKEY_DISABLE_ACCESS whenever both flags PKEY_DISABLE_READ and
PKEY_DISABLE_WRITE would be present.

This will break code that compares pkey_get output with == instead
of using bitwise operations. The latter is more correct since PKEY_*
constants are essentially bit flags.

It should be noted that PKEY_DISABLE_ACCESS does not prevent execution.

[1] https://developer.arm.com/documentation/ddi0487/ka/ section D8.4.1.4

Co-authored-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-20 11:30:58 +00:00
Andrew Pinski
e162ab2bf1 AArch64: Remove thunderx{,2} memcpy
ThunderX1 and ThunderX2 have been retired for a few years now.
So let's remove the thunderx{,2} specific versions of memcpy.
The performance gain or them was for medium and large sizes
while the generic (aarch64) memcpy will handle just slightly worse.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-20 11:23:53 +00:00
Joseph Myers
d899b48a30 Fix femode_t conditionals for arc and or1k
Two of the architecture bits/fenv.h headers define femode_t if
__GLIBC_USE (IEC_60559_BFP_EXT), instead of the correct condition
__GLIBC_USE (IEC_60559_BFP_EXT_C23) (both were added after commit
0175c9e9be, but were probably first
developed before it and then not updated to take account of its
changes).  This results in failures of the installed headers check for
fenv.h when building with GCC 15 (defaults to -std=gnu23 - we don't
yet have an installed-headers test specifically for C23 mode and don't
yet require a compiler with such a mode for building glibc) together
with a combination of options leaving C23 features enabled, since the
declarations of functions using femode_t use the correct conditions;
see
<https://sourceware.org/pipermail/libc-testresults/2024q4/013163.html>.
Fix the conditionals to get <fenv.h> to work correctly in C23 mode
again.

Tested with build-many-glibcs.py (arc-linux-gnu, arch-linux-gnuhf,
or1k-linux-gnu-hard, or1k-linux-gnu-soft).
2024-11-19 22:25:39 +00:00
Mahesh Bodapati
3ef7e42861 powerpc64le: Optimized strcat for POWER10
This patch adds an optimized strcat which makes use of the default
strcat function which calls the Power10 strcpy and strlen routines.
2024-11-19 15:59:15 -05:00
Peter Bergner
229265cc2c powerpc: Improve the inline asm for syscall wrappers
Update the inline asm syscall wrappers to match the newer register constraint
usage in INTERNAL_VSYSCALL_CALL_TYPE.  Use the faster mfocrf instruction when
available, rather than the slower mfcr microcoded instruction.
2024-11-19 12:43:57 -05:00
gfleury
7f045c0b48 htl: move pthread_attr_init into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
1a1cedd635 htl: move pthread_attr_setguardsize into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
f26b272a75 htl: move pthread_attr_setschedparam into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
32aa498ceb htl: move pthread_attr_setscope into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
4a8b7d7e62 htl: move pthread_attr_setstackaddr into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
d69a010e7b htl: move pthread_attr_setstacksize into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
330c1fad5b htl: move pthread_attr_getstack into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
1428ae39e8 htl: move pthread_attr_getstackaddr into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:37:35 +01:00
gfleury
993440a260 htl move pthread_attr_getstacksize into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:34:34 +01:00
gfleury
4bcda927fe htl move pthread_attr_getscope into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:19:00 +01:00
gfleury
6caf24c972 htl move pthread_attr_getguardsize into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:18:59 +01:00
gfleury
f55cf584ff htl: move __pthread_default_attr into libc
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:08:27 +01:00
gfleury
736befab6c htl: move pthread_attr_destroy into libc.
Signed-off-by: gfleury <gfleury@disroot.org>
2024-11-19 01:08:14 +01:00
Noah Goldstein
c510681a69 x86/string: Use movsl instead of movsd in strncpy/strncat [BZ #32344]
`ld`, starting at 2.40, emits a warning when using `movsd`. There is
no change to the actual code produced.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-11-13 10:09:30 -06:00
John David Anglin
b919fe1f6d hppa: Update libm-test-ulps
Update imaginary part of csin.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
2024-11-12 21:32:54 -05:00
Samuel Thibault
e5c2738f17 Revert "hurd: Stop depending on the default_pager stubs provided by gnumach"
This reverts commit f7f7dd8009.

default_pager is actually also used in e.g. xosview.
2024-11-13 01:34:09 +01:00
Adhemerval Zanella
461cab1de7 linux: Add support for getrandom vDSO
Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque
state allocated with mmap using flags specified by the vDSO.

Multiple states are allocated at once, as many as fit into a page, and
these are held in an array of available states to be doled out to each
thread upon first use, and recycled when a thread terminates. As these
states run low, more are allocated.

To make this procedure async-signal-safe, a simple guard is used in the
LSB of the opaque state address, falling back to the syscall if there's
reentrancy contention.

Also, _Fork() is handled by blocking signals on opaque state allocation
(so _Fork() always sees a consistent state even if it interrupts a
getrandom() call) and by iterating over the thread stack cache on
reclaim_stack. Each opaque state will be in the free states list
(grnd_alloc.states) or allocated to a running thread.

The cancellation is handled by always using GRND_NONBLOCK flags while
calling the vDSO, and falling back to the cancellable syscall if the
kernel returns EAGAIN (would block). Since getrandom is not defined by
POSIX and cancellation is supported as an extension, the cancellation is
handled as 'may occur' instead of 'shall occur' [1], meaning that if
vDSO does not block (the expected behavior) getrandom will not act as a
cancellation entrypoint. It avoids a pthread_testcancel call on the fast
path (different than 'shall occur' functions, like sem_wait()).

It is currently enabled for x86_64, which is available in Linux 6.11,
and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are
available in Linux 6.12.

Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1]
Co-developed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> # x86_64
Tested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> # x86_64, aarch64
Tested-by: Xi Ruoyao <xry111@xry111.site> # x86_64, aarch64, loongarch64
Tested-by: Stefan Liebler <stli@linux.ibm.com> # s390x
2024-11-12 14:42:12 -03:00
caiyinyu
ab4388f91c LoongArch: Update ulps
Needed for test-float-cacosh, test-float-csin, test-float32-cacosh and
test-float32-csin.

Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-11-12 09:19:23 +08:00
Samuel Thibault
d2e65aa7d6 mach: Fix __xpg_strerror_r on in-range but undefined errors [BZ #32350]
For instance, 1073741906 leads to system 16, subsystem 0 and code 82,
which is in range (max_code is 122), but not defined. Return EINVAL in
that case, like
2024-11-09 20:00:40 +01:00
Noah Goldstein
6754b5becf x86/string: Use movsl instead of movsd [BZ #32344]
`ld`, starting at 2.40, emits a warning when using `movsd`. There is
no change to the actual code produced.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-11-08 17:23:05 -06:00
Joseph Myers
c7dcf594f4 Rename new tst-sem17 test to tst-sem18
As noted by Adhemerval, we already have a tst-sem17 in nptl.

Tested for x86_64.
2024-11-08 17:08:09 +00:00
Joseph Myers
f745d78e26 Avoid uninitialized result in sem_open when file does not exist
A static analyzer apparently reported an uninitialized use of the
variable result in sem_open in the case where the file is required to
exist but does not exist.

The report appears to be correct; set result to SEM_FAILED in that
case, and add a test for it.

Note: the test passes for me even without the sem_open fix, I guess
because result happens to get value SEM_FAILED (i.e. 0) when
uninitialized.

Tested for x86_64.
2024-11-08 01:53:48 +00:00
Michael Jeanson
97f60abd25 nptl: initialize rseq area prior to registration
Per the rseq syscall documentation, 3 fields are required to be
initialized by userspace prior to registration, they are 'cpu_id',
'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread'
is cleared on all architectures, explicitly set those 3 fields prior to
registration.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-11-07 22:23:49 +01:00
Mark Wielaard
c18de3b76a s390x: Update ulps
Needed for test-float-cacosh, test-float-csin, test-float32-cacosh and
test-float32-csin.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-11-07 20:58:05 +01:00
Adhemerval Zanella
12b8dd7718 math: Fix log10f on some ABIs
The commit 9247f53219 triggered some regressions on loongarch and
riscv:

math/test-float-log10
math/test-float32-log10

And it is due a wrong sync with CORE-MATH for special 0.0/-0.0
inputs.

Checked on aarch64-linux-gnu and loongarch64-linux-gnu-lp64d.
2024-11-07 07:59:43 -03:00
caiyinyu
1b70a0a024 nptl: fix __builtin_thread_pointer detection on LoongArch
Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
2024-11-07 14:08:30 +08:00
Florian Weimer
ba60be8735 math: Fix incorrect results of exp10m1f with some GCC versions
On GCC 11 (x86-64), the previous code produced test failures like
this one:

Failure: Test: exp10m1_towardzero (-0x1.1p+4)
Result:
 is:         -1.00000000e+00  -0x1.000000p+0
 should be:  -9.99999940e-01  -0x1.fffffep-1
 difference:  5.96046447e-08   0x1.000000p-24
 ulp       :  1.0000
 max.ulp   :  0.0000

Apply a similar fix to exp2m1f.

Co-authored-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-06 16:09:05 +01:00
Yury Khrustalev
ff254cabd6 misc: Align argument name for pkey_*() functions with the manual
Change name of the access_rights argument to access_restrictions
of the following functions:

 - pkey_alloc()
 - pkey_set()

as this argument refers to access restrictions rather than access
rights and previous name might have been misleading.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-06 13:11:33 +00:00
Florian Weimer
f2326c2ec0 elf: Introduce _dl_relocate_object_no_relro
And make _dl_protect_relro apply RELRO conditionally.

Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-06 10:33:44 +01:00
Aurelien Jarno
273694cd78 Add Arm HWCAP2_* constants from Linux 3.15 and 6.2 to <bits/hwcap.h>
Linux 3.15 and 6.2 added HWCAP2_* values for Arm. These bits have
already been added to dl-procinfo.{c,h} in commits 9aea0cb842 and
8ebe9c0b38. Also add them to <bits/hwcap.h> so that they can be used
in user code. For example, for checking bits in the value returned by
getauxval(AT_HWCAP2).

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2024-11-05 21:03:37 +01:00
Joe Ramsay
2d82d781a5 AArch64: Remove SVE erf and erfc tables
By using a combination of mask-and-add instead of the shift-based
index calculation the routines can share the same table as other
variants with no performance degradation.

The tables change name because of other changes in downstream AOR.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-01 16:10:41 +00:00
Adhemerval Zanella
6d477b8de8 x86_64: Add exp2m1f with FMA
The CORE-MATH exp2m1f implementation showed slight worse latency
when using x86_64 baseline ABI.  This patch adds a ifunc variant
with similar performance for x86_64-v3.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella
c28f8d7f19 x86_64: Add exp10m1f with FMA
The CORE-MATH exp10m1f implementation showed slight worse latency
when using x86_64 baseline ABI.  This patch adds a ifunc variant
with similar performance for x86_64-v3.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella
f338c7c5f5 math: Use log10p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log10p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.5251        32.2627        52.92%
x86_64v2                    68.8912        32.7887        52.41%
x86_64v3                    59.3427        27.0521        54.41%
i686                        162.026        103.383        36.19%
aarch64                     26.8513        14.5695        45.74%
power10                     12.7426         8.4929        33.35%
powerpc                     16.6768        9.29135        44.29%

reciprocal-throughput        master        patched   improvement
x86_64                      26.0969        12.4023        52.48%
x86_64v2                    25.0045        11.0748        55.71%
x86_64v3                    20.5610        10.2995        49.91%
i686                        89.8842        78.5211        12.64%
aarch64                     17.1200         9.4832        44.61%
power10                      6.7814         6.4258         5.24%
powerpc                      15.769         7.6825        51.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella
8ae9e51376 math: Use log1pf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log1pf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      71.8142        38.9668        45.74%
x86_64v2                    71.9094        39.1321        45.58%
x86_64v3                    60.1000        32.4016        46.09%
i686                        147.105        104.258        29.13%
aarch64                     26.4439        14.0050        47.04%
power10                     19.4874         9.4146        51.69%
powerpc                     17.6145        8.00736        54.54%

reciprocal-throughput        master        patched   improvement
x86_64                      19.7604        12.7254        35.60%
x86_64v2                    19.0039        11.9455        37.14%
x86_64v3                    16.8559        11.9317        29.21%
i686                        82.3426        73.9718        10.17%
aarch64                     14.4665         7.9614        44.97%
power10                     11.9974         8.4117        29.89%
powerpc                     7.15222         6.0914        14.83%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella
c369580814 math: Use log2p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic log2p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      70.1462        47.0090        32.98%
x86_64v2                    70.2513        47.6160        32.22%
x86_64v3                    60.4840        39.9443        33.96%
i686                        164.068        122.909        25.09%
aarch64                     25.9169        16.9207        34.71%
power10                     18.1261        9.8592         45.61%
powerpc                     17.2683        9.38665        45.64%

reciprocal-throughput        master        patched   improvement
x86_64                      26.2240        16.4082        37.43%
x86_64v2                    25.0911        15.7480        37.24%
x86_64v3                    20.9371        11.7264        43.99%
i686                        90.4209        95.3073        -5.40%
aarch64                     16.8537        8.9561         46.86%
power10                     12.9401        6.5555         49.34%
powerpc                     9.01763        7.54745        16.30%

The performance decrease for i686 is mostly due the use of x87 fpu,
when building with '-msse2 -mfpmath=sse:

                             master        patched   improvement
latency                     164.068        102.982        37.23%
reciprocal-throughput       89.1968        82.5117         7.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella
9247f53219 math: Use log10f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic log10f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      49.9017        33.5143        32.84%
x86_64v2                    50.4878        33.5623        33.52%
x86_64v3                    50.0991        27.6078        44.89%
i686                        140.874        106.086        24.69%
aarch64                     19.2846        11.3573        41.11%
power10                     14.0994        7.7739        44.86%
powerpc                     14.2898        7.92497        44.54%

reciprocal-throughput        master        patched   improvement
x86_64                      17.8336        12.9074        27.62%
x86_64v2                    16.4418        11.3220        31.14%
x86_64v3                    15.6002        10.5158        32.59%
i686                        66.0678        80.2287        -21.43%
aarch64                      9.4906        6.8393        27.94%
power10                      7.5255        5.5084        26.80%
powerpc                      9.5204        6.98055        26.68%

The performance decrease for i686 is mostly due the use of x87 fpu,
when building with '-msse2 -mfpmath=sse':

                             master        patched   improvement
latency                     140.874        77.1137        45.26%
reciprocal-throughput        64.481        56.4397        12.47%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella
bbd578b38d math: Use expm1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic expm1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      96.7402        36.4026        62.37%
x86_64v2                    97.5391        33.4625        65.69%
x86_64v3                    82.1778        30.8668        62.44%
i686                         120.58        94.8302        21.35%
aarch64                     32.3558        12.8881        60.17%
power10                     23.5087        9.8574         58.07%
powerpc                     23.4776        9.06325        61.40%

reciprocal-throughput        master        patched   improvement
x86_64                      27.8224        15.9255        42.76%
x86_64v2                    27.8364        9.6438         65.36%
x86_64v3                    20.3227        9.6146         52.69%
i686                        63.5629        59.4718         6.44%
aarch64                     17.4838        7.1082         59.34%
power10                     12.4644        8.7829         29.54%
powerpc                     14.2152        5.94765        58.16%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella
5c22fd25c1 math: Use exp2m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.

The benchmark inputs are based on exp2f ones.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6042        48.7104       -19.96%
x86_64v2                    40.7506        35.9032        11.90%
x86_64v3                    35.2301        31.7956        9.75%
i686                        102.094        94.6657        7.28%
aarch64                     18.2704        15.1387        17.14%
power10                     11.9444         8.2402        31.01%

reciprocal-throughput        master        patched   improvement
x86_64                      20.8683        16.1428        22.64%
x86_64v2                    19.5076        10.4474        46.44%
x86_64v3                    19.2106        10.4014        45.86%
i686                        56.4054        59.3004        -5.13%
aarch64                     12.0781         7.3953        38.77%
power10                      6.5306         5.9388         9.06%

The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded).  This explains the performance difference for x86_64.

Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'.  When built wth same
flags, the new algorithm shows a better performance:

                            master        patched    improvement
latency                    102.094        91.2823         10.59%
reciprocal-throughput      56.4054        52.7984          6.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella
5fa89852fa math: Use exp10m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      45.4690        49.5845        -9.05%
x86_64v2                    46.1604        36.2665        21.43%
x86_64v3                    37.8442        31.0359        17.99%
i686                        121.367        93.0079        23.37%
aarch64                     21.1126        15.0165        28.87%
power10                     12.7426        8.4929         33.35%

reciprocal-throughput        master        patched   improvement
x86_64                      19.6005        17.4005        11.22%
x86_64v2                    19.6008        11.1977        42.87%
x86_64v3                    17.5427        10.2898        41.34%
i686                        59.4215        60.9675        -2.60%
aarch64                     13.9814        7.9173         43.37%
power10                      6.7814        6.4258          5.24%

The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:26 -03:00
Adhemerval Zanella
345e9c7d0b math: Add e_gammaf_r to glibc code and style
Also remove the use of builtins in favor of standard names, compiler
already inline them (if supported) with current compiler options.
It also fixes and issue where __builtin_roundeven is not support on
gcc older than version 10.

Checked on x86_64-linux-gnu and i686-linux_gnu.

Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:17:04 -03:00