Implement vectorized erf/erff containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector erf/erff with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized acosh/acoshf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector acosh/acoshf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atanh/atanhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atanh/atanhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log1p/log1pf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log1p/log1pf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log2/log2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log2/log2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log10/log10f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log10/log10f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atan2/atan2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atan2/atan2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized cbrt/cbrtf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector cbrt/cbrtf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized sinh/sinhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector sinh/sinhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized expm1/expm1f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector expm1/expm1f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized cosh/coshf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector cosh/coshf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized exp10/exp10f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector exp10/exp10f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized exp2/exp2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector exp2/exp2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized hypot/hypotf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector hypot/hypotf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized asin/asinf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector asin/asinf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atan/atanf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atan/atanf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized acos/acosf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector acos/acosf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
1. Add sysdeps/x86_64/fpu/Makeconfig to auto-generate libmvec.mk, which
contains libmvec ABI test dependencies and CFLAGS, in the build directory.
2. Include libmvec.mk for libmvec ABI test dependencies and CFLAGS.
Tested on SSE4, AVX, AVX2 and AVX512 machines.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch adds the narrowing fused multiply-add functions from TS
18661-1 / TS 18661-3 / C2X to glibc's libm: ffma, ffmal, dfmal,
f32fmaf64, f32fmaf32x, f32xfmaf64 for all configurations; f32fmaf64x,
f32fmaf128, f64fmaf64x, f64fmaf128, f32xfmaf64x, f32xfmaf128,
f64xfmaf128 for configurations with _Float64x and _Float128;
__f32fmaieee128 and __f64fmaieee128 aliases in the powerpc64le case
(for calls to ffmal and dfmal when long double is IEEE binary128).
Corresponding tgmath.h macro support is also added.
The changes are mostly similar to those for the other narrowing
functions previously added, especially that for sqrt, so the
description of those generally applies to this patch as well. As with
sqrt, I reused the same test inputs in auto-libm-test-in as for
non-narrowing fma rather than adding extra or separate inputs for
narrowing fma. The tests in libm-test-narrow-fma.inc also follow
those for non-narrowing fma.
The non-narrowing fma has a known bug (bug 6801) that it does not set
errno on errors (overflow, underflow, Inf * 0, Inf - Inf). Rather
than fixing this or having narrowing fma check for errors when
non-narrowing does not (complicating the cases when narrowing fma can
otherwise be an alias for a non-narrowing function), this patch does
not attempt to check for errors from narrowing fma and set errno; the
CHECK_NARROW_FMA macro is still present, but as a placeholder that
does nothing, and this missing errno setting is considered to be
covered by the existing bug rather than needing a separate open bug.
missing-errno annotations are duly added to many of the
auto-libm-test-in test inputs for fma.
This completes adding all the new functions from TS 18661-1 to glibc,
so will be followed by corresponding stdc-predef.h changes to define
__STDC_IEC_60559_BFP__ and __STDC_IEC_60559_COMPLEX__, as the support
for TS 18661-1 will be at a similar level to that for C standard
floating-point facilities up to C11 (pragmas not implemented, but
library functions done). (There are still further changes to be done
to implement changes to the types of fromfp functions from N2548.)
Tested as followed: natively with the full glibc testsuite for x86_64
(GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC
11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32
hard float, mips64 (all three ABIs, both hard and soft float). The
different GCC versions are to cover the different cases in tgmath.h
and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in
glibc headers, GCC 7 has proper _Float* support, GCC 8 adds
__builtin_tgmath).
include/math.h has a mechanism to redirect internal calls to various
libm functions, that can often be inlined by the compiler, to call
non-exported __* names for those functions in the case when the calls
aren't inlined, with the redirection being disabled when
NO_MATH_REDIRECT. Add fma to the functions to which this mechanism is
applied.
At present, libm-internal fma calls (generally to __builtin_fma*
functions) are only done when it's known the call will be inlined,
with alternative code not relying on an fma operation being used in
the caller otherwise. This patch is in preparation for adding the TS
18661 / C2X narrowing fma functions to glibc; it will be natural for
the narrowing function implementations to call the underlying fma
functions unconditionally, with this either being inlined or resulting
in an __fma* call. (Using two levels of round-to-odd computation like
that, in the case where there isn't an fma hardware instruction, isn't
optimal but is certainly a lot simpler for the initial implementation
than writing different narrowing fma implementations for all the
various pairs of formats.)
Tested with build-many-glibcs.py that installed stripped shared
libraries are unchanged by the patch (using
<https://sourceware.org/pipermail/libc-alpha/2021-September/130991.html>
to fix installed library stripping in build-many-glibcs.py). Also
tested for x86_64.
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date. Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.
Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions. These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.
The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively. These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:
https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dchttps://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Optimize loads of all bits set into ZMM register in AVX512 SVML codes
by replacing
vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX
and
vmovups .L_2il0floatpacket.13(%rip), %zmmX
with
vpternlogd $0xff, %zmmX, %zmmX, %zmmX
This fixes BZ #28252.
Finally remove all mpa related files, headers, declarations, probes, unused
tables and update makefiles.
Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
A typo in commit 107e6a3c22 causes the
FMA4 code path to be taken on systems that support FMA, even if they do
not support FMA4. Fix this to detect FMA4.
Support usable check for all CPU features with the following changes:
1. Change struct cpu_features to
struct cpuid_features
{
struct cpuid_registers cpuid;
struct cpuid_registers usable;
};
struct cpu_features
{
struct cpu_features_basic basic;
struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};
so that there is a usable bit for each cpuid bit.
2. After the cpuid bits have been initialized, copy the known bits to the
usable bits. EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for
CPU feature detection.
3. Clear the usable bits which require OS support.
4. If the feature is supported by OS, copy its cpuid bit to its usable
bit.
5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE
and CPU_FEATURE_USABLE_P to check if a feature is usable.
6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13.
7. Unset MPX feature since it has been deprecated.
The results are
1. If the feature is known and doesn't requre OS support, its usable bit
is copied from the cpuid bit.
2. Otherwise, its usable bit is copied from the cpuid bit only if the
feature is known to supported by OS.
3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the
feature can be used.
4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports
the feature.
This patch adds a new macro, libm_alias_finite, to define all _finite
symbol. It sets all _finite symbol as compat symbol based on its first
version (obtained from the definition at built generated first-versions.h).
The <fn>f128_finite symbols were introduced in GLIBC 2.26 and so need
special treatment in code that is shared between long double and float128.
It is done by adding a list, similar to internal symbol redifinition,
on sysdeps/ieee754/float128/float128_private.h.
Alpha also needs some tricky changes to ensure we still emit 2 compat
symbols for sqrt(f).
Passes buildmanyglibc.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This patch replaces s_trunc.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_ceil.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_floor.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_rint.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 file.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_nearbyint.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 file.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove _finite tests and references from x86_64. Rather than calling
__exp_finite, use exp directly (since it's the same entry point).
x86_64 builds and passes testsuite.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The current s_sincosf.c is faster than s_sincosf-sse2.S. On Broadwell
with FMA disabled, bench-sincosf shows:
Before After Improvement
max 154.032 114.517 34%
min 6.25 5.609 11%
mean 14.8728 12.8589 15%
* sysdeps/x86_64/fpu/s_sincosf.S: Removed.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.c: New file.
Add <sincosf_poly.h> and include it in s_sincosf.h to allow vectorized
sincosf_poly. Add x86 sincosf_poly.h to vectorize sincosf_poly. On
Broadwell, bench-sincosf shows:
Before After Improvement
max 160.273 114.198 40%
min 6.25 5.625 11%
mean 13.0325 10.6462 22%
Vectorized sincosf_poly shows
Before After Improvement
max 138.653 114.198 21%
min 5.004 5.625 -11%
mean 11.5934 10.6462 9%
Tested on x86-64 and i686 as well as with build-many-glibcs.py.
* sysdeps/ieee754/flt-32/s_sincosf.h: Include <sincosf_poly.h>.
(sincos_t, sincosf_poly, sinf_poly): Moved to ...
* sysdeps/ieee754/flt-32/sincosf_poly.h: Here. New file.
* sysdeps/x86/fpu/s_sincosf_data.c: New file.
* sysdeps/x86/fpu/sincosf_poly.h: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: Just include
<sysdeps/ieee754/flt-32/s_sincosf.c>.
Introduce new pow symbol version that doesn't do SVID compatible error
handling. The standard errno and fp exception based error handling is
inline in the new code and does not have significant overhead.
The wrapper is disabled for sysdeps/ieee754/dbl-64 by using empty
w_pow.c and enabled for targets with their own pow implementation or
ifunc dispatch on __ieee754_pow by including math/w_pow.c.
The compatibility symbol version still uses the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously powl was an alias of pow, now it points to
the compatibility symbol with the wrapper, because it still need the
SVID compatible error handling. This affects NO_LONG_DOUBLE (e.g. arm)
and LONG_DOUBLE_COMPAT (e.g. alpha) targets as well.
The __pow_finite symbol is now an alias of pow. Both __pow_finite and
pow set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that
may affect that header.
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add pow.
* math/w_pow_compat.c (__pow_compat): Change to versioned compat
symbol.
* math/w_pow.c: New file.
* sysdeps/i386/fpu/w_pow.c: New file.
* sysdeps/ia64/fpu/e_pow.S: Add versioned symbols.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Rename to __pow
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_pow.c: New file.
* sysdeps/m68k/m680x0/fpu/w_pow.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__ieee754_pow): Rename to
__pow.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__ieee754_pow): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_pow.c (__ieee754_pow): Likewise.
* sysdeps/x86_64/fpu/multiarch/w_pow.c: New file.
Introduce new log symbol version that doesn't do SVID compatible error
handling. The standard errno and fp exception based error handling is
inline in the new code and does not have significant overhead.
The wrapper is disabled for sysdeps/ieee754/dbl-64 by using empty
w_log.c and enabled for targets with their own log implementation by
including math/w_log.c.
The compatibility symbol version still uses the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously logl was an alias of log, now it points to
the compatibility symbol with the wrapper, because it still need the
SVID compatible error handling. This affects NO_LONG_DOUBLE (e.g. arm)
and LONG_DOUBLE_COMPAT (e.g. alpha) targets as well.
The __log_finite symbol is now an alias of log. Both __log_finite and
log set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that may
affect that header.
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add log.
* math/w_log_compat.c (__log_compat): Change to versioned compat
symbol.
* math/w_log.c: New file.
* sysdeps/i386/fpu/w_log.c: New file.
* sysdeps/ia64/fpu/e_log.S: Update.
* sysdeps/ieee754/dbl-64/e_log.c (__ieee754_log): Rename to __log
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_log.c: New file.
* sysdeps/m68k/m680x0/fpu/w_log.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_log-avx.c (__ieee754_log): Rename to
__log.
* sysdeps/x86_64/fpu/multiarch/e_log-fma.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_log-fma4.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_log.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/w_log.c: New file.
Introduce new exp and exp2 symbol version that don't do SVID compatible
error handling. The standard errno and fp exception based error handling
is inline in the new code and does not have significant overhead.
The double precision wrappers are disabled for sysdeps/ieee754/dbl-64
by using empty w_exp.c and w_exp2.c files, the math/w_exp.c and
math/w_exp2.c files use the wrapper template and can be included by
targets that have their own exp and exp2 implementations or use ifunc
on the glibc internal __ieee754_exp symbol.
The compatibility symbol versions still use the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously expl and exp2l were aliases of exp and exp2,
now they point to the compatibility symbols with the wrapper, because
they still need the SVID compatible error handling. This affects
NO_LONG_DOUBLE (e.g arm) and LONG_DOUBLE_COMPAT (e.g. alpha) targets
as well.
The _finite symbols are now aliases of the standard symbols (they have
no performance advantage anymore). Both the standard symbols and
_finite symbols set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that may
affect that header (the new macro name is __exp instead of __ieee754_exp
which breaks some math.h macros).
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add exp and exp2.
* math/w_exp2_compat.c (__exp2_compat): Change to versioned compat
symbol, handle NO_LONG_DOUBLE and LONG_DOUBLE_COMPAT explicitly.
* math/w_exp_compat.c (__exp_compat): Likewise.
* math/w_exp.c: New file.
* math/w_exp2.c: New file.
* sysdeps/i386/fpu/w_exp.c: New file.
* sysdeps/i386/fpu/w_exp2.c: New file.
* sysdeps/ia64/fpu/e_exp.S: Add versioned symbols.
* sysdeps/ia64/fpu/e_exp2.S: Likewise.
* sysdeps/ieee754/dbl-64/e_exp.c (__ieee754_exp): Rename to __exp
and add necessary aliases.
* sysdeps/ieee754/dbl-64/e_exp2.c (__ieee754_exp2): Rename to __exp2
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_exp.c: New file.
* sysdeps/ieee754/dbl-64/w_exp2.c: New file.
* sysdeps/m68k/m680x0/fpu/w_exp.c: New file.
* sysdeps/m68k/m680x0/fpu/w_exp2.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_exp-avx.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp.c (__ieee754_exp): Rename to
__exp.
* sysdeps/x86_64/fpu/multiarch/w_exp.c: New file.
The algorithm is exp(y * log(x)), where log(x) is computed with about
1.3*2^-68 relative error (1.5*2^-68 without fma), returning the result
in two doubles, and the exp part uses the same algorithm (and lookup
tables) as exp, but takes the input as two doubles and a sign (to handle
negative bases with odd integer exponent). The __exp1 internal symbol
is no longer necessary.
There is separate code path when fma is not available but the worst case
error is about 0.54 ULP in both cases. The lookup table and consts for
log are 4168 bytes. The .rodata+.text is decreased by 37908 bytes on
aarch64. The non-nearest rounding error is less than 1 ULP.
Improvements on Cortex-A72 compared to current glibc master:
pow thruput: 2.40x in [0.01 11.1]x[0.01 11.1]
pow latency: 1.84x in [0.01 11.1]x[0.01 11.1]
Tested on
aarch64-linux-gnu (defined __FP_FAST_FMA, TOINT_INTRINSICS) and
arm-linux-gnueabihf (!defined __FP_FAST_FMA, !TOINT_INTRINSICS) and
x86_64-linux-gnu (!defined __FP_FAST_FMA, !TOINT_INTRINSICS) and
powerpc64le-linux-gnu (defined __FP_FAST_FMA, !TOINT_INTRINSICS) targets.
* NEWS: Mention pow improvements.
* math/Makefile (type-double-routines): Add e_pow_log_data.
* sysdeps/generic/math_private.h (__exp1): Remove.
* sysdeps/i386/fpu/e_pow_log_data.c: New file.
* sysdeps/ia64/fpu/e_pow_log_data.c: New file.
* sysdeps/ieee754/dbl-64/Makefile (CFLAGS-e_pow.c): Allow fma
contraction.
* sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove.
(exp_inline): Remove.
(__ieee754_exp): Only single double input is handled.
* sysdeps/ieee754/dbl-64/e_pow.c: Rewrite.
* sysdeps/ieee754/dbl-64/e_pow_log_data.c: New file.
* sysdeps/ieee754/dbl-64/math_config.h (issignaling_inline): Define.
(__pow_log_data): Define.
* sysdeps/ieee754/dbl-64/upow.h: Remove.
* sysdeps/ieee754/dbl-64/upow.tbl: Remove.
* sysdeps/m68k/m680x0/fpu/e_pow_log_data.c: New file.
* sysdeps/x86_64/fpu/multiarch/Makefile (CFLAGS-e_pow-fma.c): Allow fma
contraction.
(CFLAGS-e_pow-fma4.c): Likewise.
Continuing the move to use, within libm, public names for libm
functions that can be inlined as built-in functions on many
architectures, this patch moves calls to __rint functions to call the
corresponding rint names instead, with asm redirection to __rint when
the calls are not inlined. The x86_64 math_private.h is removed as no
longer useful after this patch.
This patch is relative to a tree with my floor patch
<https://sourceware.org/ml/libc-alpha/2018-09/msg00148.html> applied,
and much the same considerations arise regarding possibly replacing an
IFUNC call with a direct inline expansion.
Tested for x86_64, and with build-many-glibcs.py.
* include/math.h [!_ISOMAC && !(__FINITE_MATH_ONLY__ &&
__FINITE_MATH_ONLY__ > 0) && !NO_MATH_REDIRECT] (rint): Redirect
using MATH_REDIRECT.
* sysdeps/aarch64/fpu/s_rint.c: Define NO_MATH_REDIRECT before
header inclusion.
* sysdeps/aarch64/fpu/s_rintf.c: Likewise.
* sysdeps/alpha/fpu/s_rint.c: Likewise.
* sysdeps/alpha/fpu/s_rintf.c: Likewise.
* sysdeps/i386/fpu/s_rintl.c: Likewise.
* sysdeps/ieee754/dbl-64/s_rint.c: Likewise.
* sysdeps/ieee754/dbl-64/wordsize-64/s_rint.c: Likewise.
* sysdeps/ieee754/float128/s_rintf128.c: Likewise.
* sysdeps/ieee754/flt-32/s_rintf.c: Likewise.
* sysdeps/ieee754/ldbl-128/s_rintl.c: Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_rintl.c: Likewise.
* sysdeps/m68k/coldfire/fpu/s_rint.c: Likewise.
* sysdeps/m68k/coldfire/fpu/s_rintf.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rint.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rintf.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rintl.c: Likewise.
* sysdeps/powerpc/fpu/s_rint.c: Likewise.
* sysdeps/powerpc/fpu/s_rintf.c: Likewise.
* sysdeps/riscv/rv64/rvd/s_rint.c: Likewise.
* sysdeps/riscv/rvf/s_rintf.c: Likewise.
* sysdeps/sparc/sparc32/sparcv9/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/sparc/sparc32/sparcv9/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/x86_64/fpu/math_private.h: Remove file.
* math/e_scalb.c (invalid_fn): Use rint functions instead of
__rint variants.
* math/e_scalbf.c (invalid_fn): Likewise.
* math/e_scalbl.c (invalid_fn): Likewise.
* sysdeps/ieee754/dbl-64/e_gamma_r.c (__ieee754_gamma_r):
Likewise.
* sysdeps/ieee754/flt-32/e_gammaf_r.c (__ieee754_gammaf_r):
Likewise.
* sysdeps/ieee754/k_standard.c (__kernel_standard): Likewise.
* sysdeps/ieee754/k_standardl.c (__kernel_standard_l): Likewise.
* sysdeps/ieee754/ldbl-128/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/ieee754/ldbl-96/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_llrint.c (__llrint): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_llrintf.c (__llrintf): Likewise.
Similar to the changes that were made to call sqrt functions directly
in glibc, instead of __ieee754_sqrt variants, so that the compiler
could inline them automatically without needing special inline
definitions in lots of math_private.h headers, this patch makes libm
code call floor functions directly instead of __floor variants,
removing the inlines / macros for x86_64 (SSE4.1) and powerpc
(POWER5).
The redirection used to ensure that __ieee754_sqrt does still get
called when the compiler doesn't inline a built-in function expansion
is refactored so it can be applied to other functions; the refactoring
is arranged so it's not limited to unary functions either (it would be
reasonable to use this mechanism for copysign - removing the inline in
math_private_calls.h but also eliminating unnecessary local PLT entry
use in the cases (powerpc soft-float and e500v1, for IBM long double)
where copysign calls don't get inlined).
The point of this change is that more architectures can get floor
calls inlined where they weren't previously (AArch64, for example),
without needing special inline definitions in their math_private.h,
and existing such definitions in math_private.h headers can be
removed.
Note that it's possible that in some cases an inline may be used where
an IFUNC call was previously used - this is the case on x86_64, for
example. I think the direct calls to floor are still appropriate; if
there's any significant performance cost from inline SSE2 floor
instead of an IFUNC call ending up with SSE4.1 floor, that indicates
that either the function should be doing something else that's faster
than using floor at all, or it should itself have IFUNC variants, or
that the compiler choice of inlining for generic tuning should change
to allow for the possibility that, by not inlining, an SSE4.1 IFUNC
might be called at runtime - but not that glibc should avoid calling
floor internally. (After all, all the same considerations would apply
to any user program calling floor, where it might either be inlined or
left as an out-of-line call allowing for a possible IFUNC.)
Tested for x86_64, and with build-many-glibcs.py.
* include/math.h [!_ISOMAC && !(__FINITE_MATH_ONLY__ &&
__FINITE_MATH_ONLY__ > 0) && !NO_MATH_REDIRECT] (MATH_REDIRECT):
New macro.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_LDBL): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_F128): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_UNARY_ARGS): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (sqrt): Redirect using MATH_REDIRECT.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (floor): Likewise.
* sysdeps/aarch64/fpu/s_floor.c: Define NO_MATH_REDIRECT before
header inclusion.
* sysdeps/aarch64/fpu/s_floorf.c: Likewise.
* sysdeps/ieee754/dbl-64/s_floor.c: Likewise.
* sysdeps/ieee754/dbl-64/wordsize-64/s_floor.c: Likewise.
* sysdeps/ieee754/float128/s_floorf128.c: Likewise.
* sysdeps/ieee754/flt-32/s_floorf.c: Likewise.
* sysdeps/ieee754/ldbl-128/s_floorl.c: Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_floorl.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_floor_template.c: Likewise.
* sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/riscv/rv64/rvd/s_floor.c: Likewise.
* sysdeps/riscv/rvf/s_floorf.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/powerpc/fpu/math_private.h [_ARCH_PWR5X] (__floor):
Remove macro.
[_ARCH_PWR5X] (__floorf): Likewise.
* sysdeps/x86_64/fpu/math_private.h [__SSE4_1__] (__floor): Remove
inline function.
[__SSE4_1__] (__floorf): Likewise.
* math/w_lgamma_main.c (LGFUNC (__lgamma)): Use floor functions
instead of __floor variants.
* math/w_lgamma_r_compat.c (__lgamma_r): Likewise.
* math/w_lgammaf_main.c (LGFUNC (__lgammaf)): Likewise.
* math/w_lgammaf_r_compat.c (__lgammaf_r): Likewise.
* math/w_lgammal_main.c (LGFUNC (__lgammal)): Likewise.
* math/w_lgammal_r_compat.c (__lgammal_r): Likewise.
* math/w_tgamma_compat.c (__tgamma): Likewise.
* math/w_tgamma_template.c (M_DECL_FUNC (__tgamma)): Likewise.
* math/w_tgammaf_compat.c (__tgammaf): Likewise.
* math/w_tgammal_compat.c (__tgammal): Likewise.
* sysdeps/ieee754/dbl-64/e_lgamma_r.c (sin_pi): Likewise.
* sysdeps/ieee754/dbl-64/k_rem_pio2.c (__kernel_rem_pio2):
Likewise.
* sysdeps/ieee754/dbl-64/lgamma_neg.c (__lgamma_neg): Likewise.
* sysdeps/ieee754/flt-32/e_lgammaf_r.c (sin_pif): Likewise.
* sysdeps/ieee754/flt-32/lgamma_negf.c (__lgamma_negf): Likewise.
* sysdeps/ieee754/ldbl-128/e_lgammal_r.c (__ieee754_lgammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128/e_powl.c (__ieee754_powl): Likewise.
* sysdeps/ieee754/ldbl-128/lgamma_negl.c (__lgamma_negl):
Likewise.
* sysdeps/ieee754/ldbl-128/s_expm1l.c (__expm1l): Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_lgammal_r.c (__ieee754_lgammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_powl.c (__ieee754_powl): Likewise.
* sysdeps/ieee754/ldbl-128ibm/lgamma_negl.c (__lgamma_negl):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_expm1l.c (__expm1l): Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_truncl.c (__truncl): Likewise.
* sysdeps/ieee754/ldbl-96/e_lgammal_r.c (sin_pi): Likewise.
* sysdeps/ieee754/ldbl-96/lgamma_negl.c (__lgamma_negl): Likewise.
* sysdeps/powerpc/power5+/fpu/s_modf.c (__modf): Likewise.
* sysdeps/powerpc/power5+/fpu/s_modff.c (__modff): Likewise.
The second patch improves performance of sinf and cosf using the same
algorithms and polynomials. The returned values are identical to sincosf
for the same input. ULP definitions for AArch64 and x64 are updated.
sinf/cosf througput gains on Cortex-A72:
* |x| < 0x1p-12 : 1.2x
* |x| < M_PI_4 : 1.8x
* |x| < 2 * M_PI: 1.7x
* |x| < 120.0 : 2.3x
* |x| < Inf : 3.0x
* NEWS: Mention sinf, cosf, sincosf.
* sysdeps/aarch64/libm-test-ulps: Update ULP for sinf, cosf, sincosf.
* sysdeps/x86_64/fpu/libm-test-ulps: Update ULP for sinf and cosf.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: Add definitions of
constants rather than including generic sincosf.h.
* sysdeps/x86_64/fpu/s_sincosf_data.c: Remove.
* sysdeps/ieee754/flt-32/s_cosf.c (cosf): Rewrite.
* sysdeps/ieee754/flt-32/s_sincosf.h (reduced_sin): Remove.
(reduced_cos): Remove.
(sinf_poly): New function.
* sysdeps/ieee754/flt-32/s_sinf.c (sinf): Rewrite.
There is no need to include <init-arch.h> in assembly codes since all
x86 IFUNC selector functions are written in C. Tested on i686 and
x86-64. There is no code change in libc.so, ld.so and libmvec.so.
* sysdeps/i386/i686/multiarch/bzero-ia32.S: Don't include
<init-arch.h>.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core-avx2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
Remove the slow paths from pow. Like several other double precision math
functions, pow is exactly rounded. This is not required from math functions
and causes major overheads as it requires multiple fallbacks using higher
precision arithmetic if a result is close to 0.5ULP. Ridiculous slowdowns
of up to 100000x have been reported when the highest precision path triggers.
All GLIBC math tests pass on AArch64 and x64 (with ULP of pow set to 1).
The worst case error is ~0.506ULP. A simple test over a few hundred million
values shows pow is 10% faster on average. This fixes BZ #13932.
[BZ #13932]
* sysdeps/ieee754/dbl-64/uexp.h (err_1): Remove.
* benchtests/pow-inputs: Update comment for slow path cases.
* manual/probes.texi (slowpow_p10): Delete removed probe.
(slowpow_p10): Likewise.
* math/Makefile: Remove halfulp.c and slowpow.c.
* sysdeps/aarch64/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/generic/math_private.h (__exp1): Remove error argument.
(__halfulp): Remove.
(__slowpow): Remove.
* sysdeps/i386/fpu/halfulp.c: Delete file.
* sysdeps/i386/fpu/slowpow.c: Likewise.
* sysdeps/ia64/fpu/halfulp.c: Likewise.
* sysdeps/ia64/fpu/slowpow.c: Likewise.
* sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove error argument,
improve comments and add error analysis.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Add error analysis.
(power1): Remove function:
(log1): Remove error argument, add error analysis.
(my_log2): Remove function.
* sysdeps/ieee754/dbl-64/halfulp.c: Delete file.
* sysdeps/ieee754/dbl-64/slowpow.c: Likewise.
* sysdeps/m68k/m680x0/fpu/halfulp.c: Likewise.
* sysdeps/m68k/m680x0/fpu/slowpow.c: Likewise.
* sysdeps/powerpc/power4/fpu/Makefile: Remove CPPFLAGS-slowpow.c.
* sysdeps/x86_64/fpu/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/x86_64/fpu/multiarch/Makefile: Remove slowpow-fma.c,
slowpow-fma4.c, halfulp-fma.c, halfulp-fma4.c.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__slowpow): Remove define.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__slowpow): Likewise.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma.c: Delete file.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c: Likewise.
Since the x86-64 assembly version of sincosf is higly optimized with
vector instructions, there isn't much room for improvement. However
s_sincosf.c written in C with vector math and intrinsics can be
optimized by GCC with FMA.
On Skylake, bench-sincosf reports performance improvement:
Assembly FMA improvement
max 104.042 101.008 3%
min 9.426 8.586 10%
mean 20.6209 18.2238 13%
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_sincosf-sse2 and s_sincosf-fma.
(CFLAGS-s_sincosf-fma.c): New.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf.c: Likewise.
* sysdeps/x86_64/fpu/s_sincosf.S: Don't add alias if
__sincosf is defined.
These changes will be active for all platforms that don't provide
their own exp() routines. They will also be active for ieee754
versions of ccos, ccosh, cosh, csin, csinh, sinh, exp10, gamma, and
erf.
Typical performance gains is typically around 5x when measured on
Sparc s7 for common values between exp(1) and exp(40).
Using the glibc perf tests on sparc,
sparc (nsec) x86 (nsec)
old new old new
max 17629 395 5173 144
min 399 54 15 13
mean 5317 200 1349 23
The extreme max times for the old (ieee754) exp are due to the
multiprecision computation in the old algorithm when the true value is
very near 0.5 ulp away from an value representable in double
precision. The new algorithm does not take special measures for those
cases. The current glibc exp perf tests overrepresent those values.
Informal testing suggests approximately one in 200 cases might
invoke the high cost computation. The performance advantage of the new
algorithm for other values is still large but not as large as indicated
by the chart above.
Glibc correctness tests for exp() and expf() were run. Within the
test suite 3 input values were found to cause 1 bit differences (ulp)
when "FE_TONEAREST" rounding mode is set. No differences in exp() were
seen for the tested values for the other rounding modes.
Typical example:
exp(-0x1.760cd2p+0) (-1.46113312244415283203125)
new code: 2.31973271630014299393707e-01 0x1.db14cd799387ap-3
old code: 2.31973271630014271638132e-01 0x1.db14cd7993879p-3
exp = 2.31973271630014285508337 (high precision)
Old delta: off by 0.49 ulp
New delta: off by 0.51 ulp
In addition, because ieee754_exp() is used by other routines, cexp()
showed test results with very small imaginary input values where the
imaginary portion of the result was off by 3 ulp when in upward
rounding mode, but not in the other rounding modes. For x86, tgamma
showed a few values where the ulp increased to 6 (max ulp for tgamma
is 5). Sparc tgamma did not show these failures. I presume the tgamma
differences are due to compiler optimization differences within the
gamma function.The gamma function is known to be difficult to compute
accurately.
* sysdeps/ieee754/dbl-64/e_exp.c: Include <math-svid-compat.h> and
<errno.h>. Include "eexp.tbl".
(half): New constant.
(one): Likewise.
(__ieee754_exp): Rewrite.
(__slowexp): Remove prototype.
* sysdeps/ieee754/dbl-64/eexp.tbl: New file.
* sysdeps/ieee754/dbl-64/slowexp.c: Remove file.
* sysdeps/i386/fpu/slowexp.c: Likewise.
* sysdeps/ia64/fpu/slowexp.c: Likewise.
* sysdeps/m68k/m680x0/fpu/slowexp.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-avx.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-fma.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c: Likewise.
* sysdeps/generic/math_private.h (__slowexp): Remove prototype.
* sysdeps/ieee754/dbl-64/e_pow.c: Remove mention of slowexp.c in
comment.
* sysdeps/powerpc/power4/fpu/Makefile [$(subdir) = math]
(CPPFLAGS-slowexp.c): Remove variable.
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Remove slowexp-fma, slowexp-fma4 and slowexp-avx.
(CFLAGS-slowexp-fma.c): Remove variable.
(CFLAGS-slowexp-fma4.c): Likewise.
(CFLAGS-slowexp-avx.c): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_exp-avx.c (__slowexp): Do not
define as macro.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma.c (__slowexp): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c (__slowexp): Likewise.
* math/Makefile (type-double-routines): Remove slowexp.
* manual/probes.texi (slowexp_p6): Remove.
(slowexp_p32): Likewise.
On Skylake, bench-sinf reports performance improvement:
Before After Improvement
max 153.996 100.094 54%
min 8.546 6.852 25%
mean 18.1223 11.802 54%
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_sinf-sse2 and s_sinf-fma.
(CFLAGS-s_sinf-fma.c): New.
* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
Continuing the preparation for additional _FloatN / _FloatNx function
aliases, this patch makes x86_64 libm function implementations use
libm_alias_float to define function aliases, or libm_alias_float_other
where the main name is defined with versioned_symbol.
Tested with the glibc testsuite for x86_64, and tested with
build-many-glibcs.py for all its x86_64 configurations that installed
stripped shared libraries are unchanged by the patch.
* sysdeps/x86_64/fpu/multiarch/e_exp2f.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_expf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_log2f.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_logf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_powf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/s_ceilf.c: Include
<libm-alias-float.h>.
(ceilf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_floorf.c: Include
<libm-alias-float.h>.
(floorf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_fmaf.c: Include
<libm-alias-float.h>.
(fmaf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_nearbyintf.c: Include
<libm-alias-float.h>.
(nearbyintf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_rintf.c: Include
<libm-alias-float.h>.
(rintf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_truncf.c: Include
<libm-alias-float.h>.
(truncf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_copysignf.S: Include <libm-alias-float.h>.
(copysignf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_cosf.S: Include <libm-alias-float.h>.
(cosf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fabsf.c: Include <libm-alias-float.h>.
(fabsf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fmaxf.S: Include <libm-alias-float.h>.
(fmaxf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fminf.S: Include <libm-alias-float.h>.
(fminf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_llrintf.S: Include <libm-alias-float.h>.
(llrintf): Define using libm_alias_float.
[!__ILP32__] (lrintf): Likewise.
* sysdeps/x86_64/fpu/s_sincosf.S: Include <libm-alias-float.h>.
(sincosf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_sinf.S: Include <libm-alias-float.h>.
(sinf): Define using libm_alias_float.
* sysdeps/x86_64/x32/fpu/s_lrintf.S: Include <libm-alias-float.h>.
(lrintf): Define using libm_alias_float.
Continuing the preparation for additional _FloatN / _FloatNx function
aliases, this patch makes x86_64 libm function implementations use
libm_alias_double to define function aliases.
Tested with the glibc testsuite for x86_64, and tested with
build-many-glibcs.py for all its x86_64 configurations that installed
stripped shared libraries are unchanged by the patch.
* sysdeps/x86_64/fpu/multiarch/s_atan.c: Include
<libm-alias-double.h>.
(atan): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_ceil.c: Include
<libm-alias-double.h>.
(ceil): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_floor.c: Include
<libm-alias-double.h>.
(floor): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_fma.c: Include
<libm-alias-double.h>.
(fma): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_nearbyint.c: Include
<libm-alias-double.h>.
(nearbyint): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_rint.c: Include
<libm-alias-double.h>.
(rint): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_sin.c: Include
<libm-alias-double.h>.
(sin): Define using libm_alias_double.
(cos): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan.c: Include
<libm-alias-double.h>.
(tan): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_trunc.c: Include
<libm-alias-double.h>.
(trunc): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_copysign.S: Include <libm-alias-double.h>.
(copysign): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fabs.c: Include <libm-alias-double.h>.
(fabs): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fmax.S: Include <libm-alias-double.h>.
(fmax): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fmin.S: Include <libm-alias-double.h>.
(fmin): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_llrint.S: Include <libm-alias-double.h>.
(llrint): Define using libm_alias_double.
[!__ILP32__] (lrint): Likewise.
* sysdeps/x86_64/x32/fpu/s_lrint.S: Include <libm-alias-double.h>.
(lrint): Define using libm_alias_double.
This patch converts the dbl-64 implementations of atan and tan into
weak aliases of __atan and __tan, in preparation for making them use
libm_alias_double. Consequent changes are made to the x86_64
multiarch versions wrapping round them (with the dbl-64 functions,
like other such functions, being made not to define their aliases at
all if __atan or __tan are defined as macros by an including file).
Tested for x86_64, and with build-many-glibcs.py.
* sysdeps/ieee754/dbl-64/s_atan.c (atan): Rename to __atan and
define as weak alias of __atan. Do not define any aliases if
[__atan].
[NO_LONG_DOUBLE] (__atanl): Define as strong alias of __atan.
[NO_LONG_DOUBLE] (atanl): Define as weak alias of __atanl.
* sysdeps/ieee754/dbl-64/s_tan.c (tan): Rename to __tan and define
as weak alias of __tan. Do not define any aliases if [__tan].
[NO_LONG_DOUBLE] (__tanl): Define as strong alias of __tan.
[NO_LONG_DOUBLE] (tanl): Define as weak alias of __tanl.
* sysdeps/x86_64/fpu/multiarch/s_atan-avx.c (atan): Rename to
__atan.
* sysdeps/x86_64/fpu/multiarch/s_atan-fma.c (atan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_atan-fma4.c (atan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_atan.c (atan): Rename to __atan
and define as weak alias of __atan.
* sysdeps/x86_64/fpu/multiarch/s_tan-avx.c (tan): Rename to
__atan.
* sysdeps/x86_64/fpu/multiarch/s_tan-fma.c (tan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan-fma4.c (tan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan.c (tan): Rename to __tan and
define as weak alias of __tan.
This patch adds SSE4.1 versions of trunc and truncf, using the roundsd
/ roundss instructions, similar to the versions of ceil, floor, rint
and nearbyint functions we already have. In my testing with the glibc
benchtests these are about 30% faster than the C versions for double,
20% faster for float.
Tested for x86_64.
[BZ #20142]
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_trunc-c, s_truncf-c, s_trunc-sse4_1 and s_truncf-sse4_1.
* sysdeps/x86_64/fpu/multiarch/s_trunc-c.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_trunc-sse4_1.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_trunc.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf-c.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf-sse4_1.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf.c: Likewise.
AVX512 functions in mathvec are used on machines with AVX512. An AVX2
wrapper is also provided and it can be used when the AVX512 version
isn't profitable. MathVec_Prefer_No_AVX512 is addded to cpu-features.
If glibc.tune.hwcaps=MathVec_Prefer_No_AVX512 is set in GLIBC_TUNABLES
environment variable, the AVX2 wrapper will be used.
Tested on x86-64 machines with and without AVX512. Also verified
glibc.tune.hwcaps=MathVec_Prefer_No_AVX512 on AVX512 machine.
[BZ #21967]
* sysdeps/x86/cpu-features.h (bit_arch_MathVec_Prefer_No_AVX512):
New.
(index_arch_MathVec_Prefer_No_AVX512): Likewise.
* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
Handle MathVec_Prefer_No_AVX512.
* sysdeps/x86_64/fpu/multiarch/ifunc-mathvec-avx512.h
(IFUNC_SELECTOR): Return AVX2 version if MathVec_Prefer_No_AVX512
is set.
Since binutils 2.25 or later is required to build glibc, we can replace
AVX512F .byte sequences with AVX512F instructions.
Tested on x86-64 and x32. There are no code differences in libmvec.so
and libmvec.a.
* sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Replace AVX512F
.byte sequences with AVX512F instructions.
* sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_wrapper_impl.h: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S:
Likewise.
Since the AVX2 version of mathvec functions uses FMA, it can only be
used when FMA is usable.
[BZ #21966]
* sysdeps/x86_64/fpu/multiarch/ifunc-mathvec-avx2.h
(IFUNC_SELECTOR): Don't use the AVX2 version if FMA isn't
usable.
If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used
to save the first 8 vector registers, which only saves the lower 256
bits of vector register, for lazy binding. When it is called on AVX512
platform, the upper 256 bits of ZMM registers are clobbered. Parameters
passed in ZMM registers will be wrong when the function is called the
first time. This patch requires binutils 2.24, whose assembler can store
and load ZMM registers, to build x86-64 glibc. Since mathvec library
needs assembler support for AVX512DQ, we disable mathvec if assembler
doesn't support AVX512DQ.
[BZ #20139]
* config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ...
(HAVE_AVX512DQ_ASM_SUPPORT): This.
* sysdeps/x86_64/configure.ac: Require assembler from binutils
2.24 or above.
(HAVE_AVX512_ASM_SUPPORT): Removed.
(HAVE_AVX512DQ_ASM_SUPPORT): New.
* sysdeps/x86_64/configure: Regenerated.
* sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT
check unconditional.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset.S: Likewise.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check
HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT.
* sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S:
Likewise.
current vector function declaration "#pragma omp declare simd notinbranch",
according to which vector sincos should have vector of pointers for second and
third parameters. It is fixed with implementation as wrapper to version
having second and third parameters as pointers.
[BZ #20024]
* sysdeps/x86/fpu/test-math-vector-sincos.h: New.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos2_core_sse4.S: Fixed ABI
of this implementation of vector function.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos4_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf4_core_sse4.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf8_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos2_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c: Use another wrapper
for testing vector sincos with fixed ABI.
* sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx.c: New test.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise.
* sysdeps/x86_64/fpu/Makefile: Added new tests.
Continuing fixes for ceil and floor functions not to raise the
"inexact" exception, this patch fixes the x86_64 SSE4.1 versions. The
roundss / roundsd instructions take an immediate operand that
determines the rounding mode and whether to raise "inexact"; this just
needs bit 3 set to disable "inexact", which this patch does.
Remark: we don't have an SSE4.1 version of trunc / truncf (using this
instruction with operand 11); I'd expect one to make sense, but of
course it should be benchmarked against the existing C code. I'll
file a bug in Bugzilla for the lack of such a version.
Tested for x86_64.
[BZ #15479]
* sysdeps/x86_64/fpu/multiarch/s_ceil.S (__ceil_sse41): Set bit 3
of immediate operand to rounding instruction.
* sysdeps/x86_64/fpu/multiarch/s_ceilf.S (__ceilf_sse41):
Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floor.S (__floor_sse41):
Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floorf.S (__floorf_sse41):
Likewise.
The x86_64 fma4 version of pow fails to disable contraction of
operations other than those explicitly intended to use fma
instructions, so resulting in large ulps errors on processors with
fma4 instructions, as in bug 18104 (165ulp for the test added for that
bug; error originally reported by "blaaa" on #glibc). This patch adds
$(config-cflags-nofma) for e_pow-fma4.c, corresponding to the use for
e_pow.c in sysdeps/ieee754/dbl-64/Makefile.
Tested for x86_64 on a processor with fma4.
[BZ #19003]
* sysdeps/x86_64/fpu/multiarch/Makefile (CFLAGS-e_pow-fma4.c): Add
$(config-cflags-nofma).
The change in 0b5395f052 replaced calls
to __get_cpu_features@plt followed by a mov from rax to rdx, with a
single macro LOAD_RTLD_GLOBAL_RO_RDX. It is pretty clear that there
was a typo in s_floorf and __nearbyint due to which the (now incorrect)
mov was not removed. This patch removes that mov.
* sysdeps/x86_64/fpu/multiarch/s_floorf.S (__floorf): Remove
unnecessary movq.
* sysdeps/x86_64/fpu/multiarch/s_nearbyint.S (__nearbyint):
Likewise.
Here is implementation of vectorized sin containing SSE, AVX,
AVX2 and AVX512 versions according to Vector ABI
<https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4>.
* bits/libm-simd-decl-stubs.h: Added stubs for sin.
* math/bits/mathcalls.h: Added sin declaration with __MATHCALL_VEC.
* sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New versions added.
* sysdeps/x86/fpu/bits/math-vector.h: SIMD declaration for sin.
* sysdeps/x86_64/fpu/Makefile (libmvec-support): Added new files.
* sysdeps/x86_64/fpu/Versions: New versions added.
* sysdeps/x86_64/fpu/libm-test-ulps: Regenerated.
* sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
build of SSE, AVX2 and AVX512 IFUNC versions.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin2_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin2_core_sse4.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin4_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin4_core_avx2.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin2_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin4_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin4_core_avx.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin8_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin_data.S: New file.
* sysdeps/x86_64/fpu/svml_d_sin_data.h: New file.
* sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c: Added vector sin test.
* sysdeps/x86_64/fpu/test-double-vlen2.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen8.c: Likewise.
* NEWS: Mention addition of x86_64 vector sin.
Here is implementation of vectorized cosf containing SSE, AVX,
AVX2 and AVX512 versions according to Vector ABI
<https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4>.
* sysdeps/x86_64/fpu/Makefile (libmvec-support): Added new files.
* sysdeps/x86_64/fpu/Versions: New versions added.
* sysdeps/x86_64/fpu/svml_s_cosf4_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf4_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf4_core_sse4.S: New file.
* sysdeps/x86_64/fpu/svml_s_cosf8_core_avx.S: New file.
* sysdeps/x86_64/fpu/svml_s_cosf8_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf8_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf8_core_avx2.S: New file.
* sysdeps/x86_64/fpu/svml_s_cosf16_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S: New file.
* sysdeps/x86_64/fpu/svml_s_wrapper_impl.h: New file.
* sysdeps/x86_64/fpu/svml_s_cosf_data.S: New file.
* sysdeps/x86_64/fpu/svml_s_cosf_data.h: New file.
* sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
build of SSE, AVX2 and AVX512 IFUNC versions.
* sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New versions added.
* sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cosf.
* NEWS: Mention addition of x86_64 vector cosf.
Here is implementation of cos containing SSE, AVX, AVX2 and AVX512
versions according to Vector ABI which had been discussed in
<https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4>.
Vector math library build and ABI testing enabled by default for x86_64.
* sysdeps/x86_64/fpu/Makefile: New file.
* sysdeps/x86_64/fpu/Versions: New file.
* sysdeps/x86_64/fpu/svml_d_cos_data.S: New file.
* sysdeps/x86_64/fpu/svml_d_cos_data.h: New file.
* sysdeps/x86_64/fpu/svml_d_cos2_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_cos4_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_cos4_core_avx.S: New file.
* sysdeps/x86_64/fpu/svml_d_cos8_core.S: New file.
* sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core.S: New file.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: New file.
* sysdeps/x86_64/fpu/multiarch/Makefile (libmvec-sysdep_routines): Added
build of SSE, AVX2 and AVX512 IFUNC versions.
* sysdeps/x86/fpu/bits/math-vector.h: Added SIMD declaration for cos.
* math/bits/mathcalls.h: Added cos declaration with __MATHCALL_VEC.
* sysdeps/x86_64/configure.ac: Options for libmvec build.
* sysdeps/x86_64/configure: Regenerated.
* sysdeps/x86_64/sysdep.h (cfi_offset_rel_rsp): New macro.
* sysdeps/unix/sysv/linux/x86_64/libmvec.abilist: New file.
* manual/install.texi (Configuring and compiling): Document
--disable-mathvec.
* INSTALL: Regenerated.
* NEWS: Mention addition of libmvec and x86_64 vector cos.
Similar to various other bugs in this area, some asin implementations
do not raise the underflow exception for subnormal arguments, when the
result is tiny and inexact. This patch forces the exception in a
similar way to previous fixes.
Tested for x86_64, x86, powerpc and mips64.
[BZ #16351]
* sysdeps/i386/fpu/e_asin.S (dbl_min): New object.
(MO): New macro.
(__ieee754_asin): Force underflow exception for results with small
absolute value.
* sysdeps/i386/fpu/e_asinf.S (flt_min): New object.
(MO): New macro.
(__ieee754_asinf): Force underflow exception for results with
small absolute value.
* sysdeps/ieee754/dbl-64/e_asin.c: Include <float.h> and <math.h>.
(__ieee754_asin): Force underflow exception for results with small
absolute value.
* sysdeps/ieee754/flt-32/e_asinf.c: Include <float.h>.
(__ieee754_asinf): Force underflow exception for results with
small absolute value.
* sysdeps/ieee754/ldbl-128/e_asinl.c: Include <float.h>.
(__ieee754_asinl): Force underflow exception for results with
small absolute value.
* sysdeps/ieee754/ldbl-128ibm/e_asinl.c: Include <float.h>.
(__ieee754_asinl): Force underflow exception for results with
small absolute value.
* sysdeps/ieee754/ldbl-96/e_asinl.c: Include <float.h>.
(__ieee754_asinl): Force underflow exception for results with
small absolute value.
* sysdeps/x86_64/fpu/multiarch/e_asin.c [HAVE_FMA4_SUPPORT]:
Include <math.h>.
* math/auto-libm-test-in: Do not mark underflow exceptions as
possibly missing for bug 16351.
* math/auto-libm-test-out: Regenerated.
This patch fixes bug 15319, missing underflows from atan / atan2 when
the result of atan is very close to its small argument (or that of
atan2 is very close to the ratio of its arguments, which may be an
exact division).
The usual approach of doing an underflowing computation if the
computed result is subnormal is followed. For 32-bit x86, there are
extra complications: the inline __ieee754_atan2 in bits/mathinline.h
needs to be disabled for float and double because other libm functions
using it generally rely on getting proper underflow exceptions from
it, while the out-of-line functions have to remove excess range and
precision from the underflowing result so as to return an exact 0 in
the case where errno should be set for underflow to 0. (The failures
I saw without that are similar to those Carlos reported for other
functions, where I haven't seen a response to
<https://sourceware.org/ml/libc-alpha/2015-01/msg00485.html>
confirming if my diagnosis is correct. Arguably all libm functions
with float and double returns should remove excess range and
precision, but that's a separate matter.)
The x86_64 long double case reported in a comment in bug 15319 is not
a bug (it's an argument of LDBL_MIN, and x86_64 is an after-rounding
architecture so the correct IEEE result is not to raise underflow in
the given rounding mode, in addition to treating the result as an
exact LDBL_MIN being within the newly clarified documentation of
accuracy goals). I'm presuming that the fpatan instruction can be
trusted to raise appropriate exceptions when the (long double) result
underflows (after rounding) and so no changes are needed for x86 /
x86_64 long double functions here; empirically this is the case for
the cases covered in the testsuite, on my system.
Tested for x86_64, x86, powerpc and mips64. Only 32-bit x86 needs
ulps updates (for the changes to inlines meaning some functions no
longer get excess precision from their __ieee754_atan2* calls).
[BZ #15319]
* sysdeps/i386/fpu/e_atan2.S (dbl_min): New object.
(MO): New macro.
(__ieee754_atan2): For results with small absolute value, force
underflow exception and remove excess range and precision from
return value.
* sysdeps/i386/fpu/e_atan2f.S (flt_min): New object.
(MO): New macro.
(__ieee754_atan2f): For results with small absolute value, force
underflow exception and remove excess range and precision from
return value.
* sysdeps/i386/fpu/s_atan.S (dbl_min): New object.
(MO): New macro.
(__atan): For results with small absolute value, force underflow
exception and remove excess range and precision from return value.
* sysdeps/i386/fpu/s_atanf.S (flt_min): New object.
(MO): New macro.
(__atanf): For results with small absolute value, force underflow
exception and remove excess range and precision from return value.
* sysdeps/ieee754/dbl-64/e_atan2.c: Include <float.h> and
<math.h>.
(__ieee754_atan2): Force underflow exception for results with
small absolute value.
* sysdeps/ieee754/dbl-64/s_atan.c: Include <float.h> and
<math_private.h>.
(atan): Force underflow exception for results with small absolute
value.
* sysdeps/ieee754/flt-32/s_atanf.c: Include <float.h>.
(__atanf): Force underflow exception for results with small
absolute value.
* sysdeps/ieee754/ldbl-128/s_atanl.c: Include <float.h> and
<math.h>.
(__atanl): Force underflow exception for results with small
absolute value.
* sysdeps/ieee754/ldbl-128ibm/s_atanl.c: Include <float.h>.
(__atanl): Force underflow exception for results with small
absolute value.
* sysdeps/x86/fpu/bits/mathinline.h
[!__SSE2_MATH__ && !__x86_64__ && __LIBC_INTERNAL_MATH_INLINES]
(__ieee754_atan2): Only define inline for long double.
* sysdeps/x86_64/fpu/multiarch/e_atan2.c
[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Include <math.h>.
* math/auto-libm-test-in: Do not mark underflow exceptions as
possibly missing for bug 15319. Add more tests of atan2.
* math/auto-libm-test-out: Regenerated.
* math/libm-test.inc (casin_test_data): Do not mark underflow
exceptions as possibly missing for bug 15319.
(casinh_test_data): Likewise.
* sysdeps/i386/fpu/libm-test-ulps: Update.
This patch fixes bug 16315, bad pow handling of overflow/underflow in
non-default rounding modes. Tests of pow are duly converted to
ALL_RM_TEST to run all tests in all rounding modes.
There are two main issues here. First, various implementations
compute a negative result by negating a positive result, but this
yields inappropriate overflow / underflow values for directed
rounding, so either overflow / underflow results need recomputing in
the correct sign, or the relevant overflowing / underflowing operation
needs to be made to have a result of the correct sign. Second, the
dbl-64 implementation sets FE_TONEAREST internally; in the overflow /
underflow case, the result needs recomputing in the original rounding
mode.
Tested x86_64 and x86 and ulps updated accordingly.
[BZ #16315]
* sysdeps/i386/fpu/e_pow.S (__ieee754_pow): Ensure possibly
overflowing or underflowing operations take place with sign of
result.
* sysdeps/i386/fpu/e_powf.S (__ieee754_powf): Likewise.
* sysdeps/i386/fpu/e_powl.S (__ieee754_powl): Likewise.
* sysdeps/ieee754/dbl-64/e_pow.c: Include <math.h>.
(__ieee754_pow): Recompute overflowing and underflowing results in
original rounding mode.
* sysdeps/x86/fpu/powl_helper.c: Include <stdbool.h>.
(__powl_helper): Allow negative argument X and scale negated value
as needed. Avoid passing value outside [-1, 1] to f2xm1.
* sysdeps/x86_64/fpu/e_powl.S (__ieee754_powl): Ensure possibly
overflowing or underflowing operations take place with sign of
result.
* sysdeps/x86_64/fpu/multiarch/e_pow.c [HAVE_FMA4_SUPPORT]:
Include <math.h>.
* math/auto-libm-test-in: Add more tests of pow.
* math/auto-libm-test-out: Regenerated.
* math/libm-test.inc (pow_test): Use ALL_RM_TEST.
(pow_tonearest_test_data): Remove.
(pow_test_tonearest): Likewise.
(pow_towardzero_test_data): Likewise.
(pow_test_towardzero): Likewise.
(pow_downward_test_data): Likewise.
(pow_test_downward): Likewise.
(pow_upward_test_data): Likewise.
(pow_test_upward): Likewise.
(main): Don't call removed functions.
* sysdeps/i386/fpu/libm-test-ulps: Update.
* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
My recent exp patch introduced warnings about implicit __isinf
declarations in exp because e_exp.c didn't include <math.h>. This
patch fixes this. Because <math.h> can't be included after
<math_private.h> (because of macro definitions of __nan*), it was
necessary to put an include in sysdeps/x86_64/fpu/multiarch/e_exp.c as
well.
Tested x86_64.
* sysdeps/ieee754/dbl-64/e_exp.c: Include <math.h>.
* sysdeps/x86_64/fpu/multiarch/e_exp.c
[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
While these instructions accept memory operands, only one operand
may be a memory operand. Giving two operands xm constraints gives
the compiler the option of using memory for both operands, which
would result in invalid assembly code. Using x for all operands is
more appropriate, as most x86_64 calling conventions will pass the
arguments in registers anyway.
2013-05-15 Peter Collingbourne <pcc@google.com>
* sysdeps/x86_64/fpu/multiarch/s_fma.c (__fma_fma4): Replace xm
constraints with x constraints.
* sysdeps/x86_64/fpu/multiarch/s_fmaf.c (__fmaf_fma4): Likewise.
There is no problem with strcmp, it doesn't use the YMM registers.
The math routines might since gcc perhaps generates such code.
Introduce bit_YMM_USBALE and use it in the math routines.
Provide macros so that the internal users can, if possible, directly use
the new instructions.
Also fix up the mathinline.h header when compiling with SSE4.1 enabled.