This is a new implementation for condition variables, required
after http://austingroupbugs.net/view.php?id=609 to fix bug 13165. In
essence, we need to be stricter in which waiters a signal or broadcast
is required to wake up; this couldn't be solved using the old algorithm.
ISO C++ made a similar clarification, so this also fixes a bug in
current libstdc++, for example.
We can't use the old algorithm anymore because futexes do not guarantee
to wake in FIFO order. Thus, when we wake, we can't simply let any
waiter grab a signal, but we need to ensure that one of the waiters
happening before the signal is woken up. This is something the previous
algorithm violated (see bug 13165).
There's another issue specific to condvars: ABA issues on the underlying
futexes. Unlike mutexes that have just three states, or semaphores that
have no tokens or a limited number of them, the state of a condvar is
the *order* of the waiters. A waiter on a semaphore can grab a token
whenever one is available; a condvar waiter must only consume a signal
if it is eligible to do so as determined by the relative order of the
waiter and the signal.
Therefore, this new algorithm maintains two groups of waiters: Those
eligible to consume signals (G1), and those that have to wait until
previous waiters have consumed signals (G2). Once G1 is empty, G2
becomes the new G1. 64b counters are used to avoid ABA issues.
This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
waking just one thread and requeueing all others on the futex of the
mutex supplied by the program). I don't think doing the requeue is
necessarily the right approach (but I haven't done real measurements
yet):
* If a program expects to wake many threads at the same time and make
that scalable, a condvar isn't great anyway because of how it requires
waiters to operate mutually exclusive (due to the mutex usage). Thus, a
thundering herd problem is a scalability problem with or without the
optimization. Using something like a semaphore might be more
appropriate in such a case.
* The scalability problem is actually at the mutex side; the condvar
could help (and it tries to with the requeue optimization), but it
should be the mutex who decides how that is done, and whether it is done
at all.
* Forcing all but one waiter into the kernel-side wait queue of the
mutex prevents/avoids the use of lock elision on the mutex. Thus, it
prevents the only cure against the underlying scalability problem
inherent to condvars.
* If condvars use short critical sections (ie, hold the mutex just to
check a binary flag or such), which they should do ideally, then forcing
all those waiter to proceed serially with kernel-based hand-off (ie,
futex ops in the mutex' contended state, via the futex wait queues) will
be less efficient than just letting a scalable mutex implementation take
care of it. Our current mutex impl doesn't employ spinning at all, but
if critical sections are short, spinning can be much better.
* Doing the requeue stuff requires all waiters to always drive the mutex
into the contended state. This leads to each waiter having to call
futex_wake after lock release, even if this wouldn't be necessary.
[BZ #13165]
* nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
use new algorithm.
* nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
* nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
* nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
* nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
(__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
(__condvar_confirm_wakeup, __condvar_cancel_waiting,
__condvar_cleanup_waiting, __condvar_dec_grefs,
__pthread_cond_wait_common): New.
(__condvar_cleanup): Remove.
* npt/pthread_condattr_getclock.c (pthread_condattr_getclock): Adapt.
* npt/pthread_condattr_setclock.c (pthread_condattr_setclock):
Likewise.
* npt/pthread_condattr_getpshared.c (pthread_condattr_getpshared):
Likewise.
* npt/pthread_condattr_init.c (pthread_condattr_init): Likewise.
* nptl/tst-cond1.c: Add comment.
* nptl/tst-cond20.c (do_test): Adapt.
* nptl/tst-cond22.c (do_test): Likewise.
* sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
structure.
* sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t):
Likewise.
* sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
Likewise.
* sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
Likewise.
* sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
* sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
(COND_CLOCK_BITS): Adapt.
* sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
* nptl/pthreadP.h (__PTHREAD_COND_CLOCK_MONOTONIC_MASK,
__PTHREAD_COND_SHARED_MASK): New.
* nptl/nptl-printers.py (CLOCK_IDS): Remove.
(ConditionVariablePrinter, ConditionVariableAttributesPrinter): Adapt.
* nptl/nptl_lock_constants.pysym: Adapt.
* nptl/test-cond-printers.py: Adapt.
* sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
cond_compat_check_and_clear): Adapt.
* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
* sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
(__pthread_cond_timedwait): ... and move here.
* nptl/DESIGN-condvar.txt: Remove file.
* nptl/lowlevelcond.sym: Likewise.
* nptl/pthread_cond_timedwait.c: Likewise.
* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.
Patch disables Intel TSX on some Haswell processors to avoid TSX
on kernels that weren't updated with the latest microcode package
(which disables broken feature by default).
* sysdeps/x86/cpu-features.c (get_common_indeces): Add
stepping identification.
(init_cpu_features): Add handle of Haswell.
Information about whether the ABI of long double is the same as that
of double is split between bits/mathdef.h and bits/wordsize.h.
When the ABIs are the same, bits/mathdef.h defines
__NO_LONG_DOUBLE_MATH. In addition, in the case where the same glibc
binary supports both -mlong-double-64 and -mlong-double-128,
bits/wordsize.h defines __LONG_DOUBLE_MATH_OPTIONAL, along with
__NO_LONG_DOUBLE_MATH if this particular compilation is with
-mlong-double-64.
As part of the refactoring I proposed in
<https://sourceware.org/ml/libc-alpha/2016-11/msg00745.html>, this
patch puts all that information in a single header,
bits/long-double.h. It is included from sys/cdefs.h alongside the
include of bits/wordsize.h, so other headers generally do not need to
include bits/long-double.h directly.
Previously, various bits/mathdef.h headers and bits/wordsize.h headers
had this long double information (including implicitly in some
bits/mathdef.h headers through not having the defines present in the
default version). After the patch, it's all in six bits/long-double.h
headers. Furthermore, most of those new headers are not
architecture-specific. Architectures with optional long double all
use the ldbl-opt sysdeps directory, either in the order (ldbl-64-128,
ldbl-opt, ldbl-128) or (ldbl-128ibm, ldbl-opt). Thus a generic header
for the case where long double = double, and headers in ldbl-128,
ldbl-96 and ldbl-opt, suffices to cover every architecture except for
cases where long double properties vary between different ABIs sharing
a set of installed headers; fortunately all the ldbl-opt cases share a
single compiler-predefined macro __LONG_DOUBLE_128__ that can be used
to tell whether this compilation is -mlong-double-64 or
-mlong-double-128.
The two cases where a set of headers is shared between ABIs with
different long double properties, MIPS (o32 has long double = double,
other ABIs use ldbl-128) and SPARC (32-bit has optional long double,
64-bit has required long double), need their own bits/long-double.h
headers.
As with bits/wordsize.h, multiple-include protection for this header
is generally implicit through the include guards on sys/cdefs.h, and
multiple inclusion is harmless in any case. There is one subtlety:
the header must not define __LONG_DOUBLE_MATH_OPTIONAL if
__NO_LONG_DOUBLE_MATH was defined before its inclusion, because doing
so breaks how sysdeps/ieee754/ldbl-opt/nldbl-compat.h defines
__NO_LONG_DOUBLE_MATH itself before including system headers. Subject
to keeping that working, it would be reasonable to move these macros
from defined/undefined #ifdef to always-defined 1/0 #if semantics, but
this patch does not attempt to do so, just rearranges where the macros
are defined.
After this patch, the only use of bits/mathdef.h is the alpha one for
modifying complex function ABIs for old GCC. Thus, all versions of
the header other than the default and alpha versions are removed, as
is the include from math.h.
Tested for x86_64 and x86. Also did compilation-only testing with
build-many-glibcs.py.
* bits/long-double.h: New file.
* sysdeps/ieee754/ldbl-128/bits/long-double.h: Likewise.
* sysdeps/ieee754/ldbl-96/bits/long-double.h: Likewise.
* sysdeps/ieee754/ldbl-opt/bits/long-double.h: Likewise.
* sysdeps/mips/bits/long-double.h: Likewise.
* sysdeps/unix/sysv/linux/sparc/bits/long-double.h: Likewise.
* math/Makefile (headers): Add bits/long-double.h.
* misc/sys/cdefs.h: Include <bits/long-double.h>.
* stdlib/strtold.c: Include <bits/long-double.h> instead of
<bits/wordsize.h>.
* bits/mathdef.h [!_COMPLEX_H]: Do not allow inclusion.
[!__NO_LONG_DOUBLE_MATH]: Remove conditional code.
* math/math.h: Do not include <bits/mathdef.h>.
* sysdeps/aarch64/bits/mathdef.h: Remove file.
* sysdeps/alpha/bits/mathdef.h [!_COMPLEX_H]: Do not allow
inclusion.
* sysdeps/ia64/bits/mathdef.h: Remove file.
* sysdeps/m68k/m680x0/bits/mathdef.h: Likewise.
* sysdeps/mips/bits/mathdef.h: Likewise.
* sysdeps/powerpc/bits/mathdef.h: Likewise.
* sysdeps/s390/bits/mathdef.h: Likewise.
* sysdeps/sparc/bits/mathdef.h: Likewise.
* sysdeps/x86/bits/mathdef.h: Likewise.
* sysdeps/s390/s390-32/bits/wordsize.h
[!__NO_LONG_DOUBLE_MATH && !__LONG_DOUBLE_MATH_OPTIONAL]: Remove
conditional code.
* sysdeps/s390/s390-64/bits/wordsize.h
[!__NO_LONG_DOUBLE_MATH && !__LONG_DOUBLE_MATH_OPTIONAL]:
Likewise.
* sysdeps/unix/sysv/linux/alpha/bits/wordsize.h
[!__NO_LONG_DOUBLE_MATH && !__LONG_DOUBLE_MATH_OPTIONAL]:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/bits/wordsize.h
[!__NO_LONG_DOUBLE_MATH && !__LONG_DOUBLE_MATH_OPTIONAL]:
Likewise.
* sysdeps/unix/sysv/linux/sparc/bits/wordsize.h
[!__NO_LONG_DOUBLE_MATH && !__LONG_DOUBLE_MATH_OPTIONAL]:
Likewise.
This uses atomic operations to access lock elision metadata that is accessed
concurrently (ie, adapt_count fields). The size of the data is less than a
word but accessed only with atomic loads and stores; therefore, we add
support for shorter-size atomic load and stores too.
* include/atomic.h (__atomic_check_size_ls): New.
(atomic_load_relaxed, atomic_load_acquire, atomic_store_relaxed,
atomic_store_release): Use it.
* sysdeps/x86/elide.h (ACCESS_ONCE): Remove.
(elision_adapt, ELIDE_LOCK): Use atomics.
* sysdeps/unix/sysv/linux/x86/elision-lock.c (__lll_lock_elision): Use
atomics and improve code comments.
* sysdeps/unix/sysv/linux/x86/elision-trylock.c
(__lll_trylock_elision): Likewise.
Continuing the refactoring of bits/mathdef.h, this patch stops it
defining FP_ILOGB0 and FP_ILOGBNAN, moving the required information to
a new header bits/fp-logb.h.
There are only two possible values of each of those macros permitted
by ISO C. TS 18661-1 adds corresponding macros for llogb, and their
values are required to correspond to those of the ilogb macros in the
obvious way. Thus two boolean values - for which the same choices are
correct for most architectures - suffice to determine the value of all
these macros, and by defining macros for those boolean values in
bits/fp-logb.h we can then define the public FP_* macros in math.h and
avoid the present duplication of the associated feature test macro
logic.
This patch duly moves to bits/fp-logb.h defining __FP_LOGB0_IS_MIN and
__FP_LOGBNAN_IS_MIN. Default definitions of those to 0 are correct
for both architectures, while ia64, m68k and x86 get their own
versions of bits/fp-logb.h to reflect their use of values different
from the defaults.
The patch renders many copies of bits/mathdef.h trivial (needed only
to avoid the default __NO_LONG_DOUBLE_MATH). I'll revise
<https://sourceware.org/ml/libc-alpha/2016-11/msg00865.html>
accordingly so that it removes all bits/mathdef.h headers except the
default one and the alpha one, and arranges for the header to be
included only by complex.h as the only remaining use at that point
will be for the alpha ABI issues there.
Tested for x86_64 and x86. Also did compile-only testing with
build-many-glibcs.py (using glibc sources from before the commit that
introduced many build failures with undefined __GI___sigsetjmp).
* bits/fp-logb.h: New file.
* sysdeps/ia64/bits/fp-logb.h: Likewise.
* sysdeps/m68k/m680x0/bits/fp-logb.h: Likewise.
* sysdeps/x86/bits/fp-logb.h: Likewise.
* math/Makefile (headers): Add bits/fp-logb.h.
* math/math.h: Include <bits/fp-logb.h>.
[__USE_ISOC99] (FP_ILOGB0): Define based on __FP_LOGB0_IS_MIN.
[__USE_ISOC99] (FP_ILOGBNAN): Define based on __FP_LOGBNAN_IS_MIN.
* bits/mathdef.h (FP_ILOGB0): Remove.
(FP_ILOGBNAN): Likewise.
* sysdeps/aarch64/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/alpha/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/ia64/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/m68k/m680x0/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/mips/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/powerpc/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/s390/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/sparc/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
* sysdeps/x86/bits/mathdef.h (FP_ILOGB0): Likewise.
(FP_ILOGBNAN): Likewise.
Continuing the refactoring of bits/mathdef.h, this patch moves the
FP_FAST_* definitions into a new bits/fp-fast.h header. Currently
this is only for FP_FAST_FMA*, but in future it would be the
appropriate place for the FP_FAST_* macros from TS 18661-1 as well.
The generic bits/mathdef.h header defines these macros based on
whether the compiler defines __FP_FAST_*. Most architecture-specific
headers, however, fail to do so, meaning that if the architecture (or
some particular processors) does in fact have fused operations, and
GCC knows to use them inline, the FP_FAST_* macros will still not be
defined.
By refactoring, this patch causes the generic version (based on
__FP_FAST_*) to be used in more cases, and so the macro definitions to
be more accurate. Architectures that already defined some or all of
these macros other than based on the predefines have their own
versions of fp-fast.h, which are arranged so they define FP_FAST_* if
either the architecture-specific conditions are true or __FP_FAST_*
are defined.
After this refactoring, various bits/mathdef.h headers for
architectures with long double = double are semantically identical to
the generic version. The patch removes those headers that are
redundant. (In fact two of the four removed were already redundant
before this patch because they did use __FP_FAST_*.)
Tested for x86_64 and x86, and compilation-only with
build-many-glibcs.py.
* bits/fp-fast.h: New file.
* sysdeps/aarch64/bits/fp-fast.h: Likewise.
* sysdeps/powerpc/bits/fp-fast.h: Likewise.
* math/Makefile (headers): Add bits/fp-fast.h.
* math/math.h: Include <bits/fp-fast.h>.
* bits/mathdef.h (FP_FAST_FMA): Remove.
(FP_FAST_FMAF): Likewise.
(FP_FAST_FMAL): Likewise.
* sysdeps/aarch64/bits/mathdef.h (FP_FAST_FMA): Likewise.
(FP_FAST_FMAF): Likewise.
* sysdeps/powerpc/bits/mathdef.h (FP_FAST_FMA): Likewise.
(FP_FAST_FMAF): Likewise.
* sysdeps/x86/bits/mathdef.h (FP_FAST_FMA): Likewise.
(FP_FAST_FMAF): Likewise.
(FP_FAST_FMAL): Likewise.
* sysdeps/arm/bits/mathdef.h: Remove file.
* sysdeps/hppa/fpu/bits/mathdef.h: Likewise.
* sysdeps/sh/sh4/bits/mathdef.h: Likewise.
* sysdeps/tile/bits/mathdef.h: Likewise.
At present, definitions of float_t and double_t are split among many
bits/mathdef.h headers.
For all but three architectures, these types are float and double.
Furthermore, if you assume __FLT_EVAL_METHOD__ to be defined, that
provides a more generic way of determining the correct values of these
typedefs. Defining these typedefs more generally based on
__FLT_EVAL_METHOD__ was previously proposed by Paul Eggert in
<https://sourceware.org/ml/libc-alpha/2012-02/msg00002.html>.
This patch refactors things in the way I proposed in
<https://sourceware.org/ml/libc-alpha/2016-11/msg00745.html>. A new
header bits/flt-eval-method.h defines a single macro,
__GLIBC_FLT_EVAL_METHOD, which is then used by math.h to define
float_t and double_t. The default is based on __FLT_EVAL_METHOD__
(although actually a default to 0 would have the same effect for
current ports, because ports where values other than 0 or 16 are
possible all have their own headers).
To avoid changing the existing semantics in any case, including for
compilers not defining __FLT_EVAL_METHOD__, architecture-specific
files are then added for m68k, s390, x86 which replicate the existing
semantics. At least with __FLT_EVAL_METHOD__ values possible with
GCC, there should be no change to the choices of float_t and double_t
for any supported configuration.
Architecture maintainer notes:
* m68k: sysdeps/m68k/m680x0/bits/flt-eval-method.h always defines
__GLIBC_FLT_EVAL_METHOD to 2 to replicate the existing logic. But
actually GCC defines __FLT_EVAL_METHOD__ to 0 if TARGET_68040. It
might make sense to make the header prefer to base things on
__FLT_EVAL_METHOD__ if defined, like the x86 version, and so make
the choices of these types more accurate (with a NEWS entry as for
the other changes to these types on particular architectures).
* s390: sysdeps/s390/bits/flt-eval-method.h always defines
__GLIBC_FLT_EVAL_METHOD to 1 to replicate the existing logic. As
previously discussed, it might make sense in coordination with GCC
to eliminate the historic mistake, avoid excess precision in the
-fexcess-precision=standard case and make the typedefs match (with a
NEWS entry, again).
Tested for x86-64 and x86. Also did compilation-only testing with
build-many-glibcs.py.
* bits/flt-eval-method.h: New file.
* sysdeps/m68k/m680x0/bits/flt-eval-method.h: Likewise.
* sysdeps/s390/bits/flt-eval-method.h: Likewise.
* sysdeps/x86/bits/flt-eval-method.h: Likewise.
* math/Makefile (headers): Add bits/flt-eval-method.h.
* math/math.h: Include <bits/flt-eval-method.h>.
[__USE_ISOC99] (float_t): Define based on __GLIBC_FLT_EVAL_METHOD.
[__USE_ISOC99] (double_t): Likewise.
* bits/mathdef.h (float_t): Remove.
(double_t): Likewise.
* sysdeps/aarch64/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/alpha/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/arm/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/hppa/fpu/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/ia64/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/m68k/m680x0/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/mips/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/powerpc/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/s390/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/sh/sh4/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/sparc/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/tile/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
* sysdeps/x86/bits/mathdef.h (float_t): Likewise.
(double_t): Likewise.
Bug 20787 reports that, while float_t and double_t for 32-bit x86
properly respect -mfpmath=sse, for x86_64 they fail to reflect
-mfpmath=387, which is valid if unusual and results in FLT_EVAL_METHOD
being 2. This patch fixes the definitions to respect
__FLT_EVAL_METHOD__ in that case, arranging for the test that the
types correspond with FLT_EVAL_METHOD to be run with both -mfpmath=387
and -mfpmath=sse.
Note: this patch will also have the effect of making float_t and
double_t be long double for x86_64 with -mfpmath=sse+387, when
FLT_EVAL_METHOD is -1. It seems reasonable for x86_64 to be
consistent with 32-bit x86 in this case (and that definition is
conservatively safe, in that it makes the types correspond to the
widest evaluation format that might be used).
Tested for x86-64 and x86.
[BZ #20787]
* sysdeps/x86/bits/mathdef.h (float_t): Do not define to float if
[__x86_64__] when __FLT_EVAL_METHOD__ is nonzero.
(double_t): Do not define to double if [__x86_64__] when
__FLT_EVAL_METHOD__ is nonzero.
* sysdeps/x86/fpu/test-flt-eval-method-387.c: New file.
* sysdeps/x86/fpu/test-flt-eval-method-sse.c: Likewise.
* sysdeps/x86/fpu/Makefile [$(subdir) = math] (tests): Add
test-flt-eval-method-387 and test-flt-eval-method-sse.
[$(subdir) = math] (CFLAGS-test-flt-eval-method-387.c): New
variable.
[$(subdir) = math] (CFLAGS-test-flt-eval-method-sse.c): Likewise.
In the Intel Architecture Instruction Set Extensions Programming
reference the recommended way to test for FMA in section
'2.2.1 Detection of FMA' is:
"Application Software must identify that hardware supports AVX as
explained in ... after that it must also detect support for FMA..."
We don't do that in glibc. We use osxsave to detect the use of xgetbv,
and after that we check for AVX and FMA orthogonally. It is conceivable
that you could have the AVX bit clear and the FMA bit in an undefined
state.
This commit fixes FMA and AVX2 detection to depend on usable AVX
as required by the recommended Intel sequences.
v1: https://www.sourceware.org/ml/libc-alpha/2016-10/msg00241.html
v2: https://www.sourceware.org/ml/libc-alpha/2016-10/msg00265.html
Since the maximum CPUID level of older Intel CPUs is 1, change
handle_intel to return -1, instead of assert, when the maximum
CPUID level is less than 2.
[BZ #20647]
* sysdeps/x86/cacheinfo.c (handle_intel): Return -1 if the
maximum CPUID level is less than 2.
TS 18661-1 adds an iseqsig type-generic comparison macro to <math.h>.
This macro is like the == operator except that unordered operands
result in the "invalid" exception and errno being set to EDOM.
This patch implements this macro for glibc. Given the need to set
errno, this is implemented with out-of-line functions __iseqsigf,
__iseqsig and __iseqsigl (of which the last only exists at all if long
double is ABI-distinct from double, so no function aliases or compat
support are needed). The present patch ignores excess precision
issues; I intend to deal with those in a followup patch. (Like
comparison operators, type-generic comparison macros should *not*
convert operands to their semantic types but should preserve excess
range and precision, meaning that for some argument types and values
of FLT_EVAL_METHOD, an underlying function should be called for a
wider type than that of the arguments.)
The underlying functions are implemented with the type-generic
template machinery. Comparing x <= y && x >= y is sufficient in ISO C
to achieve an equality comparison with "invalid" raised for unordered
operands (and the results of those two comparisons can also be used to
tell whether errno needs to be set). However, some architectures have
GCC bugs meaning that unordered comparison instructions are used
instead of ordered ones. Thus, a mechanism is provided for
architectures to use an explicit call to feraiseexcept to raise
exceptions if required. If your architecture has such a bug you
should add a fix-fp-int-compare-invalid.h header for it, with a
comment pointing to the relevant GCC bug report; if such a GCC bug is
fixed, that header's contents should have a __GNUC_PREREQ conditional
added so that the workaround can eventually be removed for that
architecture.
Tested for x86_64, x86, mips64, arm and powerpc.
* math/math.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (iseqsig): New
macro.
* math/bits/mathcalls.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(__iseqsig): New declaration.
* math/s_iseqsig_template.c: New file.
* math/Versions (__iseqsigf): New libm symbol at version
GLIBC_2.25.
(__iseqsig): Likewise.
(__iseqsigl): Likewise.
* math/libm-test.inc (iseqsig_test_data): New array.
(iseqsig_test): New function.
(main): Call iseqsig_test.
* math/Makefile (gen-libm-calls): Add s_iseqsigF.
* manual/arith.texi (FP Comparison Functions): Document iseqsig.
* manual/libm-err-tab.pl: Update comment on interfaces without
ulps tabulated.
* sysdeps/generic/fix-fp-int-compare-invalid.h: New file.
* sysdeps/powerpc/fpu/fix-fp-int-compare-invalid.h: Likewise.
* sysdeps/x86/fpu/fix-fp-int-compare-invalid.h: Likewise.
* sysdeps/nacl/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx32/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilepro/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.
This adds a test to ensure that the problems fixed in the last several
patches do not recur. Each directory checks the headers that it
installs for two properties: first, each header must be compilable in
isolation, as both C and C++, under a representative combination of
language and library conformance levels; second, there is a blacklist
of identifiers that may not appear in any installed header, currently
consisting of the legacy BSD typedefs. (There is an exemption for the
headers that define those typedefs, and for the RPC headers. It may be
necessary to make this more sophisticated if we add more stuff to the
blacklist in the future.)
In order for this test to work correctly, every wrapper header
that actually defines something must guard those definitions with
#ifndef _ISOMAC. This is the existing mechanism used by the conform/
tests to tell wrapper headers not to define anything that the public
header wouldn't, and not to use anything from libc-symbols.h. conform/
only cares for headers that we need to check for standards conformance,
whereas this test applies to *every* header. (Headers in include/ that
are either installed directly, or are internal-use-only and do *not*
correspond to any installed header, are not affected.)
* scripts/check-installed-headers.sh: New script.
* Rules: In each directory that defines header files to be installed,
run check-installed-headers.sh on them as a special test.
* Makefile: Likewise for the headers installed at top level.
* include/aliases.h, include/alloca.h, include/argz.h
* include/arpa/nameser.h, include/arpa/nameser_compat.h
* include/elf.h, include/envz.h, include/err.h
* include/execinfo.h, include/fpu_control.h, include/getopt.h
* include/gshadow.h, include/ifaddrs.h, include/libintl.h
* include/link.h, include/malloc.h, include/mcheck.h
* include/mntent.h, include/netinet/ether.h
* include/nss.h, include/obstack.h, include/printf.h
* include/pty.h, include/resolv.h, include/rpc/auth.h
* include/rpc/auth_des.h, include/rpc/auth_unix.h
* include/rpc/clnt.h, include/rpc/des_crypt.h
* include/rpc/key_prot.h, include/rpc/netdb.h
* include/rpc/pmap_clnt.h, include/rpc/pmap_prot.h
* include/rpc/pmap_rmt.h, include/rpc/rpc.h
* include/rpc/rpc_msg.h, include/rpc/svc.h
* include/rpc/svc_auth.h, include/rpc/xdr.h
* include/rpcsvc/nis_callback.h, include/rpcsvc/nislib.h
* include/rpcsvc/yp.h, include/rpcsvc/ypclnt.h
* include/rpcsvc/ypupd.h, include/shadow.h
* include/stdio_ext.h, include/sys/epoll.h
* include/sys/file.h, include/sys/gmon.h, include/sys/ioctl.h
* include/sys/prctl.h, include/sys/profil.h
* include/sys/statfs.h, include/sys/sysctl.h
* include/sys/sysinfo.h, include/ttyent.h, include/utmp.h
* sysdeps/arm/nacl/include/bits/setjmp.h
* sysdeps/mips/include/sys/asm.h
* sysdeps/unix/sysv/linux/include/sys/sysinfo.h
* sysdeps/unix/sysv/linux/include/sys/timex.h
* sysdeps/x86/fpu/include/bits/fenv.h:
Add #ifndef _ISOMAC guard around internal declarations.
Add multiple-inclusion guard if not already present.
TS 18661-1 defines a type femode_t to represent the set of dynamic
floating-point control modes (such as the rounding mode and trap
enablement modes), and functions fegetmode and fesetmode to manipulate
those modes (without affecting other state such as the raised
exception flags) and a corresponding macro FE_DFL_MODE.
This patch series implements those interfaces for glibc. This first
patch adds the architecture-independent pieces, the x86 and x86_64
implementations, and the <bits/fenv.h> and ABI baseline updates for
all architectures so glibc keeps building and passing the ABI tests on
all architectures. Subsequent patches add the fegetmode and fesetmode
implementations for other architectures.
femode_t is generally an integer type - the same type as fenv_t, or as
the single element of fenv_t where fenv_t is a structure containing a
single integer (or the single relevant element, where it has elements
for both status and control registers) - except where architecture
properties or consistency with the fenv_t implementation indicate
otherwise. FE_DFL_MODE follows FE_DFL_ENV in whether it's a magic
pointer value (-1 cast to const femode_t *), a value that can be
distinguished from valid pointers by its high bits but otherwise
contains a representation of the desired register contents, or a
pointer to a constant variable (the powerpc case; __fe_dfl_mode is
added as an exported constant object, an alias to __fe_dfl_env).
Note that where architectures (that share a register between control
and status bits) gain definitions of new floating-point control or
status bits in future, the implementations of fesetmode for those
architectures may need updating (depending on whether the new bits are
control or status bits and what the implementation does with
previously unknown bits), just like existing implementations of
<fenv.h> functions that take care not to touch reserved bits may need
updating when the set of reserved bits changes. (As any new bits are
outside the scope of ISO C, that's just a quality-of-implementation
issue for supporting them, not a conformance issue.)
As with fenv_t, femode_t should properly include any software DFP
rounding mode (and for both fenv_t and femode_t I'd consider that
fragment of DFP support appropriate for inclusion in glibc even in the
absence of the rest of libdfp; hardware DFP rounding modes should
already be included if the definitions of which bits are status /
control bits are correct).
Tested for x86_64, x86, mips64 (hard float, and soft float to test the
fallback version), arm (hard float) and powerpc (hard float, soft
float and e500). Other architecture versions are untested.
* math/fegetmode.c: New file.
* math/fesetmode.c: Likewise.
* sysdeps/i386/fpu/fegetmode.c: Likewise.
* sysdeps/i386/fpu/fesetmode.c: Likewise.
* sysdeps/x86_64/fpu/fegetmode.c: Likewise.
* sysdeps/x86_64/fpu/fesetmode.c: Likewise.
* math/fenv.h: Update comment on inclusion of <bits/fenv.h>.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (fegetmode): New function
declaration.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (fesetmode): Likewise.
* bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New
typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/aarch64/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/alpha/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/arm/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/hppa/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/ia64/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/m68k/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/microblaze/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/mips/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/nios2/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/powerpc/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (__fe_dfl_mode): New variable
declaration.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/s390/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/sh/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/sparc/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/tile/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* sysdeps/x86/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)]
(femode_t): New typedef.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro.
* manual/arith.texi (FE_DFL_MODE): Document macro.
(fegetmode): Document function.
(fesetmode): Likewise.
* math/Versions (fegetmode): New libm symbol at version
GLIBC_2.25.
(fesetmode): Likewise.
* math/Makefile (libm-support): Add fegetmode and fesetmode.
(tests): Add test-femode and test-femode-traps.
* math/test-femode-traps.c: New file.
* math/test-femode.c: Likewise.
* sysdeps/powerpc/fpu/fenv_const.c (__fe_dfl_mode): Declare as
alias for __fe_dfl_env.
* sysdeps/powerpc/nofpu/fenv_const.c (__fe_dfl_mode): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/fenv_const.c
(__fe_dfl_mode): Likewise.
* sysdeps/powerpc/Versions (__fe_dfl_mode): New libm symbol at
version GLIBC_2.25.
* sysdeps/nacl/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx32/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilepro/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.
There is transition penalty when SSE instructions are mixed with 256-bit
AVX or 512-bit AVX512 load instructions. Since _dl_runtime_resolve_avx
and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
registers, there is transition penalty when SSE instructions are used
with lazy binding on AVX and AVX512 processors.
To avoid SSE transition penalty, if only the lower 128 bits of the first
8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
with the zero upper bits.
For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
or the upper 256 bits of ZMM registers are zero. We can restore only the
non-zero portion of vector registers with AVX/AVX512 load instructions
which will zero-extend upper bits of vector registers.
This patch adds _dl_runtime_resolve_sse_vex which saves and restores
XMM registers with 128-bit AVX store/load instructions. It is used to
preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
_dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
that we store and load only the non-zero portion of vector registers.
This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
_dl_runtime_profile_avx512 when only the lower 128 bits of vector
registers are used.
_dl_runtime_resolve_avx_slow is added and used for AVX processors which
don't support XGETBV with ECX == 1. Since there is no SSE transition
penalty on AVX512 processors which don't support XGETBV with ECX == 1,
_dl_runtime_resolve_avx512_slow isn't provided.
[BZ #20495]
[BZ #20508]
* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
processors, set Use_dl_runtime_resolve_slow and set
Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
New.
(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
(index_arch_Use_dl_runtime_resolve_opt): Likewise.
(index_arch_Use_dl_runtime_resolve_slow): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
if Use_dl_runtime_resolve_opt is set. Use
_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
(_dl_runtime_resolve_opt): New. Defined for AVX and AVX512.
(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
New.
(_dl_runtime_resolve_opt): Likewise.
(_dl_runtime_profile): Define only if _dl_runtime_profile is
defined.
All other state bits, except for bit_YMM_state, are defined as (1 << N).
This patch changes bit_YMM_state from (2 << 1) to (1 << 2).
* sysdeps/x86/cpu-features.h (bit_YMM_state): Set to (1 << 2).
current vector function declaration "#pragma omp declare simd notinbranch",
according to which vector sincos should have vector of pointers for second and
third parameters. It is fixed with implementation as wrapper to version
having second and third parameters as pointers.
[BZ #20024]
* sysdeps/x86/fpu/test-math-vector-sincos.h: New.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos2_core_sse4.S: Fixed ABI
of this implementation of vector function.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos4_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf4_core_sse4.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf8_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos2_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c: Use another wrapper
for testing vector sincos with fixed ABI.
* sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx.c: New test.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise.
* sysdeps/x86_64/fpu/Makefile: Added new tests.
Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove,
memcpy, mempcpy and memset aren't used by the current processors, this
patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so
that they can be used in the future.
* sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New.
(index_arch_Prefer_ERMS): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return
__memcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(__memmove_erms): Enabled for libc.a.
* ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return
__memmove_erms or Prefer_ERMS.
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return
__mempcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memset.S (memset): Return
__memset_erms for Prefer_ERMS.
Since the FMA4 bit is in COMMON_CPUID_INDEX_80000001 and FMA4 requires
AVX, determine if FMA4 is usable after COMMON_CPUID_INDEX_80000001 is
available and if AVX is usable.
[BZ #20195]
* sysdeps/x86/cpu-features.c (get_common_indeces): Move FMA4
check to ...
(init_cpu_features): Here.
For Intel processors, when there are both L2 and L3 caches, SMT level
type should be ued to count number of available logical processors
sharing L2 cache. If there is only L2 cache, core level type should
be used to count number of available logical processors sharing L2
cache. Number of available logical processors sharing L2 cache should
be used for non-inclusive L2 and L3 caches.
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
available logical processors with SMT level type sharing L2
cache for Intel processors.
L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
per core:
https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing
So L2 cache is shared by 8 threads on Knights Landing as reported by
CPUID. We should remove special L2 cache case for Knights Landing.
[BZ #18185]
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
sharing L2 cache to 2 for Knights Landing.
Intel CPUID with EAX == 11 returns:
ECX Bits 07 - 00: Level number. Same value in ECX input.
Bits 15 - 08: Level type.
^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
Bits 31 - 16: Reserved.
Intel processor level type mask should be 0xff00, not 0xff0.
[BZ #20119]
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
processor level type mask for CPUID with EAX == 11.
Skip counting logical threads for Intel processors if the HTT bit is 0
which indicates there is only a single logical processor.
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
logical threads if the HTT bit is 0.
* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
(index_cpu_HTT): Likewise.
(reg_HTT): Likewise.
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on
i686 and x86-64.
* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
Remove ifunc-defines.sym.
* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
Likewise.
* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
* sysdeps/x86/rtld-global-offsets.sym: Likewise.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
* sysdeps/x86/Makefile (gen-as-const-headers): Remove
rtld-global-offsets.sym.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
* sysdeps/x86/cpu-features-offsets.sym: This.
* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86
and x86_64.
* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
instead of <sysdeps/x86_64/cacheinfo.c>.
* sysdeps/x86_64/cacheinfo.c: Moved to ...
* sysdeps/x86/cacheinfo.c: Here.
Updated from the model numbers of Goldmont and Airmont processors in
Intel64 And IA-32 Processor Architectures Software Developer's Manual
Volume 3 Revision 058.
* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
Goldmont and Airmont processors.
Intel Core i3, i5 and i7 processors have fast unaligned copy and
copy backward is ignored. Remove Fast_Copy_Backward from Intel Core
processors to avoid confusion.
* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
bit_arch_Fast_Copy_Backward for Intel Core proessors.
The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB
(ERMS) bit to x86 cpu-features.
* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
(index_cpu_ERMS): Likewise.
(reg_ERMS): Likewise.
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load. A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.
[BZ #19583]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
processors. Set Fast_Copy_Backward for AMD Excavator
processors.
* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
New.
(index_arch_Fast_Unaligned_Copy): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
Since only Intel processors with AVX2 have fast unaligned load, we
should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
and call get_common_indeces for other processors.
Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
GLRO(dl_x86_cpu_features) in cpu-features.c.
[BZ #19583]
* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
inline. Check family before setting family, model and
extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable
bits here.
(init_cpu_features): Replace HAS_CPU_FEATURE and
HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load
for Intel processors with usable AVX2. Call get_common_indeces
for other processors with family == NULL.
* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
(CPU_FEATURES_ARCH_P): Likewise.
(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
index_* and bit_* macros are used to access cpuid and feature arrays o
struct cpu_features. It is very easy to use bits and indices of cpuid
array on feature array, especially in assembly codes. For example,
sysdeps/i386/i686/multiarch/bcopy.S has
HAS_CPU_FEATURE (Fast_Rep_String)
which should be
HAS_ARCH_FEATURE (Fast_Rep_String)
We change index_* and bit_* to index_cpu_*/index_arch_* and
bit_cpu_*/bit_arch_* so that we can catch such error at build time.
[BZ #19762]
* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
(bit_arch_*): This for feature array.
(bit_*): Renamed to ...
(bit_cpu_*): This for cpu array.
(index_*): Renamed to ...
(index_arch_*): This for feature array.
(index_*): Renamed to ...
(index_cpu_*): This for cpu array.
[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
bit_##name with index_cpu_##name and bit_cpu_##name.
[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
bit_##name with index_arch_##name and bit_arch_##name.
Since x86 has an optimized mempcpy and GCC can inline mempcpy on x86,
define _HAVE_STRING_ARCH_mempcpy to 1 for x86.
[BZ #19759]
* sysdeps/x86/bits/string.h (_HAVE_STRING_ARCH_mempcpy): New.
As discussed in
https://sourceware.org/ml/libc-alpha/2015-10/msg00403.html
the setting of _STRING_ARCH_unaligned currently controls the external
GLIBC ABI as well as selecting the use of unaligned accesses withing
GLIBC.
Since _STRING_ARCH_unaligned was recently changed for AArch64, this
would potentially break the ABI in GLIBC 2.23, so split the uses and add
_STRING_INLINE_unaligned to select the string ABI. This setting must be
fixed for each target, while _STRING_ARCH_unaligned may be changed from
release to release. _STRING_ARCH_unaligned is used unconditionally in
glibc. But <bits/string.h>, which defines _STRING_ARCH_unaligned, isn't
included with -Os. Since _STRING_ARCH_unaligned is internal to glibc and
may change between glibc releases, it should be made private to glibc.
_STRING_ARCH_unaligned should defined in the new string_private.h heade
file which is included unconditionally from internal <string.h> for glibc
build.
[BZ #19462]
* bits/string.h (_STRING_ARCH_unaligned): Renamed to ...
(_STRING_INLINE_unaligned): This.
* include/string.h: Include <string_private.h>.
* string/bits/string2.h: Replace _STRING_ARCH_unaligned with
_STRING_INLINE_unaligned.
* sysdeps/aarch64/bits/string.h (_STRING_ARCH_unaligned): Removed.
(_STRING_INLINE_unaligned): New.
* sysdeps/aarch64/string_private.h: New file.
* sysdeps/generic/string_private.h: Likewise.
* sysdeps/m68k/m680x0/m68020/string_private.h: Likewise.
* sysdeps/s390/string_private.h: Likewise.
* sysdeps/x86/string_private.h: Likewise.
* sysdeps/m68k/m680x0/m68020/bits/string.h
(_STRING_ARCH_unaligned): Renamed to ...
(_STRING_INLINE_unaligned): This.
* sysdeps/s390/bits/string.h (_STRING_ARCH_unaligned): Renamed
to ...
(_STRING_INLINE_unaligned): This.
* sysdeps/sparc/bits/string.h (_STRING_ARCH_unaligned): Renamed
to ...
(_STRING_INLINE_unaligned): This.
* sysdeps/x86/bits/string.h (_STRING_ARCH_unaligned): Renamed
to ...
(_STRING_INLINE_unaligned): This.
GLIBC benchtest testcases shows SSE2_Unaligned based implementations
are performing faster compare to SSE2 based implementations for
routines: strcmp, strcat, strncat, stpcpy, stpncpy, strcpy, strncpy
and strstr. Flag index_Fast_Unaligned_Load is set for Excavator family
0x15h CPU's. This makes SSE2_Unaligned based implementations as
default for these routines.
[BZ #19467]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
index_Fast_Unaligned_Load flag for Excavator family CPUs.
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
* sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
index_Prefer_No_VZEROUPPER): New.
* sysdeps/x86/cpu-features.c (init_cpu_features): Set the
Prefer_No_VZEROUPPER for Knights Landing.
According to Silvermont software optimization guide, for 64-bit
applications, branch prediction performance can be negatively impacted
when the target of a branch is more than 4GB away from the branch. Add
the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not
lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for
address space layout randomization (ASLR), which is always disabled for
SUID programs and can only be enabled by setting environment variable,
LD_PREFER_MAP_32BIT_EXEC.
On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
[BZ #19367]
* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
(index_Prefer_MAP_32BIT_EXEC): Likewise.
Knights Landing processor is based on Silvermont. This patch enables
Silvermont optimizations for Knights Landing.
* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
Silvermont optimizations for Knights Landing.
This patch allows to use x86_64 vector math functions with GCC 6.*
without OpenMP SIMD constructs. For additional details please visit
<https://sourceware.org/glibc/wiki/libmvec#Example_2>.
* sysdeps/x86/fpu/bits/math-vector.h: W/o -fopenmp declare vector math
functions with GCC 6.* __attribute__ ((__simd__)).
AMD CPUs uses the similar encoding scheme for extended family and model
as Intel CPUs as shown in:
http://support.amd.com/TechDocs/25481.pdf
This patch updates get_common_indeces to get family and model for both
Intel and AMD CPUs when family == 0x0f.
[BZ #19214]
* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
argument to return extended model. Update family and model
with extended family and model when family == 0x0f.
(init_cpu_features): Updated.
Old workaround based on assembly aliases can lead to link fail (bug 19058).
This patch makes workaround in another way to avoid it.
[BZ #19058]
* math/Makefile ($(inst_libdir)/libm.so): Added libmvec_nonshared.a
to AS_NEEDED.
* sysdeps/x86/fpu/bits/math-vector.h: Removed code with old workaround.
* sysdeps/x86_64/fpu/Makefile (libmvec-support,
libmvec-static-only-routines): Added new file.
* sysdeps/x86_64/fpu/svml_finite_alias.S: New file.
This test applies to i386 and x86_64 which set R_386_GLOB_DAT and
R_X86_64_GLOB_DAT to ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA.
[BZ #19178]
* sysdeps/x86/Makefile (tests): Add tst-prelink.
(tst-prelink-ENV): New.
($(objpfx)tst-prelink-conflict.out): Likewise.
($(objpfx)tst-prelink-cmp.out): Likewise.
(tests-special): Add $(objpfx)tst-prelink-cmp.out.
* sysdeps/x86/tst-prelink.c: New file.
* sysdeps/x86/tst-prelink.exp: Likewise.
fenv_t should include architecture-specific floating-point modes and
status flags. i386 and x86_64 fesetenv limit which bits they use from
the x87 status and control words, when using saved state, and limit
which parts of the state they set to fixed values, when using
FE_DFL_ENV / FE_NOMASK_ENV. The following should be included but are
excluded in at least some cases: status and masking for the "denormal
operand" exception (which isn't part of FE_ALL_EXCEPT); precision
control (explicitly mentioned in Annex F as something that counts as
part of the floating-point environment); MXCSR FZ and DAZ bits (for
FE_DFL_ENV and FE_NOMASK_ENV). This patch arranges for this extra
state to be handled by fesetenv (and thereby by feupdateenv, which
calls fesetenv).
(Note that glibc functions using floating point are not generally
expected to work correctly with non-default values of this state,
especially precision control, but it is still logically part of the
floating-point environment and should be handled as such by fesetenv.
Changes to the state relating to subnormals ought generally to work
with libm functions when the arguments aren't subnormal and neither
are the expected results; that's a consequence of functions avoiding
spurious internal underflows.)
A question arising from this is whether FE_NOMASK_ENV should or should
not mask the "denormal operand" exception. I decided it should mask
that exception. This is the status quo - previously that exception
could only be unmasked by direct manipulation of control registers
(possibly via <fpu_control.h>). In addition, it means that use of
FE_NOMASK_ENV leaves a floating-point environment the same as could be
obtained by fesetenv (FE_DFL_ENV); feenableexcept (FE_ALL_EXCEPT);,
rather than an environment in which an exception is unmasked that
could only be masked again by using fesetenv with FE_DFL_ENV (or a
previously saved environment) - this exception not being usable with
other <fenv.h> functions because it's outside FE_ALL_EXCEPT.
Tested for x86_64 and x86.
[BZ #16068]
* sysdeps/i386/fpu/fesetenv.c: Include <fpu_control.h>.
(FE_ALL_EXCEPT_X86): New macro.
(__fesetenv): Use FE_ALL_EXCEPT_X86 in most places instead of
FE_ALL_EXCEPT. Ensure precision control is included in
floating-point state. Ensure that FE_DFL_ENV and FE_NOMASK_ENV
handle "denormal operand exception" and clear FZ and DAZ bits.
* sysdeps/x86_64/fpu/fesetenv.c: Include <fpu_control.h>.
(FE_ALL_EXCEPT_X86): New macro.
(__fesetenv): Use FE_ALL_EXCEPT_X86 in most places instead of
FE_ALL_EXCEPT. Ensure precision control is included in
floating-point state. Ensure that FE_DFL_ENV and FE_NOMASK_ENV
handle "denormal operand exception" and clear FZ and DAZ bits.
* sysdeps/x86/fpu/test-fenv-sse-2.c: New file.
* sysdeps/x86/fpu/test-fenv-x87.c: Likewise.
* sysdeps/x86/fpu/Makefile [$(subdir) = math] (tests): Add
test-fenv-x87 and test-fenv-sse-2.
[$(subdir) = math] (CFLAGS-test-fenv-sse-2.c): New variable.
The i386 and x86_64 versions of fesetenv, when called with FE_DFL_ENV
or FE_NOMASK_ENV as argument, do not clear SSE exceptions raised in
MXCSR. These arguments should, like other fenv_t values, represent
the whole of the floating-point state, so such exceptions should be
cleared; this patch adds the required clearing. (Discovered while
working on bug 16068.)
Tested for x86_64 and x86.
[BZ #19181]
* sysdeps/i386/fpu/fesetenv.c (__fesetenv): Clear already-raised
SSE exceptions when argument is FE_DFL_ENV or FE_NOMASK_ENV.
* sysdeps/x86_64/fpu/fesetenv.c (__fesetenv): Likewise.
* math/test-fenv-clear-main.c: New file.
* math/test-fenv-clear.c: Likewise.
* math/Makefile (tests): Add test-fenv-clear.
* sysdeps/x86/fpu/test-fenv-clear-sse.c: New file.
* sysdeps/x86/fpu/Makefile [$(subdir) = math] (tests): Add
test-fenv-clear-sse.
[$(subdir) = math] (CFLAGS-test-fenv-clear-sse.c): New variable.