Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.
For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
Finally remove all mpa related files, headers, declarations, probes, unused
tables and update makefiles.
Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
This patch series removes all remaining slow paths and related code.
First asin/acos, tan, atan, atan2 implementations are updated, and the final
patch removes the unused mpa files, headers and probes. Passes buildmanyglibc.
Remove slow paths from asin/acos. Add ULP annotations based on previous slow
path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps.
Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
It turns out the startup code in csu/elf-init.c has a perfect pair of
ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A
New Method to Bypass 64-bit Linux ASLR"). These functions are not
needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY
are already processed by the dynamic linker. However, the dynamic
linker skipped the main program for some reason. For maximum
backwards compatibility, this is not changed, and instead, the main
map is consulted from __libc_start_main if the init function argument
is a NULL pointer.
For statically linked binaries, the old approach based on linker
symbols is still used because there is nothing else available.
A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because
new binaries running on an old libc would not run their ELF
constructors, leading to difficult-to-debug issues.
No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Add SUPPORT_STATIC_PIE that targets can define if they support
static PIE. This requires PI_STATIC_AND_HIDDEN support and various
linker features as described in
commit 9d7a3741c9
Add --enable-static-pie configure option to build static PIE [BZ #19574]
Currently defined on x86_64, i386 and aarch64 where static PIE is
known to work.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In <sys/platform/x86.h>, define CPU features as enum instead of using
the C preprocessor magic to make it easier to wrap this functionality
in other languages. Move the C preprocessor magic to internal header
for better GCC codegen when more than one features are checked in a
single expression as in x86-64 dl-hwcaps-subdirs.c.
1. Rename COMMON_CPUID_INDEX_XXX to CPUID_INDEX_XXX.
2. Move CPUID_INDEX_MAX to sysdeps/x86/include/cpu-features.h.
3. Remove struct cpu_features and __x86_get_cpu_features from
<sys/platform/x86.h>.
4. Add __x86_get_cpuid_feature_leaf to <sys/platform/x86.h> and put it
in libc.
5. Make __get_cpu_features() private to glibc.
6. Replace __x86_get_cpu_features(N) with __get_cpu_features().
7. Add _dl_x86_get_cpu_features to GLIBC_PRIVATE.
8. Use a single enum index for each CPU feature detection.
9. Pass the CPUID feature leaf to __x86_get_cpuid_feature_leaf.
10. Return zero struct cpuid_feature for the older glibc binary with a
smaller CPUID_INDEX_MAX [BZ #27104].
11. Inside glibc, use the C preprocessor magic so that cpu_features data
can be loaded just once leading to more compact code for glibc.
256 bits are used for each CPUID leaf. Some leaves only contain a few
features. We can add exceptions to such leaves. But it will increase
code sizes and it is harder to provide backward/forward compatibilities
when new features are added to such leaves in the future.
When new leaves are added, _rtld_global_ro offsets will change which
leads to race condition during in-place updates. We may avoid in-place
updates by
1. Rename the old glibc.
2. Install the new glibc.
3. Remove the old glibc.
NB: A function, __x86_get_cpuid_feature_leaf , is used to avoid the copy
relocation issue with IFUNC resolver as shown in IFUNC resolver tests.
Add extra-test-objs to test-extras so that they are compiled with
-DMODULE_NAME=testsuite instead of -DMODULE_NAME=libc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
GCC 11 supports -march=x86-64-v[234] to enable x86 micro-architecture ISA
levels:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97250
and -mneeded to emit GNU_PROPERTY_X86_ISA_1_NEEDED property with
GNU_PROPERTY_X86_ISA_1_V[234] marker:
https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/13
Binutils support for GNU_PROPERTY_X86_ISA_1_V[234] marker were added by
commit b0ab06937385e0ae25cebf1991787d64f439bf12
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Oct 30 06:49:57 2020 -0700
x86: Support GNU_PROPERTY_X86_ISA_1_BASELINE marker
and
commit 32930e4edbc06bc6f10c435dbcc63131715df678
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Oct 9 05:05:57 2020 -0700
x86: Support GNU_PROPERTY_X86_ISA_1_V[234] marker
GNU_PROPERTY_X86_ISA_1_NEEDED property in x86 ELF binaries indicate the
micro-architecture ISA level required to execute the binary. The marker
must be added by programmers explicitly in one of 3 ways:
1. Pass -mneeded to GCC.
2. Add the marker in the linker inputs as this patch does.
3. Pass -z x86-64-v[234] to the linker.
Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
marker support to ld.so if binutils 2.32 or newer is used to build glibc:
1. Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
markers to elf.h.
2. Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
marker to abi-note.o based on the ISA level used to compile abi-note.o,
assuming that the same ISA level is used to compile the whole glibc.
3. Add isa_1 to cpu_features to record the supported x86 ISA level.
4. Rename _dl_process_cet_property_note to _dl_process_property_note and
add GNU_PROPERTY_X86_ISA_1_V[234] marker detection.
5. Update _rtld_main_check and _dl_open_check to check loaded objects
with the incompatible ISA level.
6. Add a testcase to verify that dlopen an x86-64-v4 shared object fails
on lesser platforms.
7. Use <get-isa-level.h> in dl-hwcaps-subdirs.c and tst-glibc-hwcaps.c.
Tested under i686, x32 and x86-64 modes on x86-64-v2, x86-64-v3 and
x86-64-v4 machines.
Marked elf/tst-isa-level-1 with x86-64-v4, ran it on x86-64-v3 machine
and got:
[hjl@gnu-cfl-2 build-x86_64-linux]$ ./elf/tst-isa-level-1
./elf/tst-isa-level-1: CPU ISA level is lower than required
[hjl@gnu-cfl-2 build-x86_64-linux]$
Remove the wordsize-64 implementations by merging them into the main dbl-64
directory. The second patch just moves all wordsize-64 files and removes a
few wordsize-64 uses in comments and Implies files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Calling an IFUNC function defined in unrelocated executable also leads to
segfault. Issue a fatal error message when calling IFUNC function defined
in the unrelocated executable from a shared library.
When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:
cmpl $63, %ecx
// Don't use "rep movsb" if ECX <= 63
jbe L(Don't use rep movsb")
Use "rep movsb"
Benchtests data with bench-memcpy, bench-memcpy-large, bench-memcpy-random
and bench-memcpy-walk on Skylake, Ice Lake and Tiger Lake show that its
performance impact is within noise range as "rep movsb" is only used for
data size >= 4KB.
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
This syncs up isnanl behaviour with gcc. Also move the isnanl
implementation to sysdeps/x86 and remove the sysdeps/x86_64 version.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Also move sysdeps/i386/fpu/s_fpclassifyl.c to
sysdeps/x86/fpu/s_fpclassifyl.c and remove
sysdeps/x86_64/fpu/s_fpclassifyl.c
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The previous definition of THREAD_SELF did not tell the compiler
that %fs (or %gs) usage is invalid for the !DL_LOOKUP_GSCOPE_LOCK
case in _dl_lookup_symbol_x. As a result, ld.so could try to use the
TCB before it was initialized.
As the comment in tls.h explains, asm volatile is undesirable here.
Using the __seg_fs (or __seg_gs) namespace does not interfere with
optimization, and expresses that THREAD_SELF is potentially trapping.
Now __thread_gscope_wait (the function behind THREAD_GSCOPE_WAIT,
formerly __wait_lookup_done) can be implemented directly in ld.so,
eliminating the unprotected GL (dl_wait_lookup_done) function
pointer.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The tls.h inclusion is not really required and limits possible
definition on more arch specific headers.
This is a cleanup to allow inline functions on sysdep.h, more
specifically on i386 and ia64 which requires to access some tls
definitions its own.
No semantic changes expected, checked with a build against all
affected ABIs.
The UP macro is never defined. Also define LOCK_PREFIX
unconditionally, to the same string.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
We need NO_RTLD_HIDDEN because of the need for PLT calls in ld.so.
See Roland's comment in
https://sourceware.org/bugzilla/show_bug.cgi?id=15605
"in the Hurd it's crucial that calls like __mmap be the libc ones
instead of the rtld-local ones after the bootstrap phase, when the
dynamic linker is being used for dlopen and the like."
We used to just avoid all hidden use in the rtld ; this commit switches to
keeping only those that should use PLT calls, i.e. essentially those defined in
sysdeps/mach/hurd/dl-sysdep.c:
__assert_fail
__assert_perror_fail
__*stat64
_exit
This fixes a few startup issues, notably the call to __tunable_get_val that is
made before PLTs are set up.
X86 CPU features in ld.so are initialized by init_cpu_features, which is
invoked by DL_PLATFORM_INIT from _dl_sysdep_start. But when ld.so is
loaded by static executable, DL_PLATFORM_INIT is never called. Also
x86 cache info in libc.o and libc.a is initialized by a constructor
which may be called too late. Since some fields in _rtld_global_ro
in ld.so are initialized by dynamic relocation, we can also initialize
x86 CPU features in _rtld_global_ro in ld.so and cache info in libc.so
by initializing dummy function pointers in ld.so and libc.so via IFUNC
relocation.
Key points:
1. IFUNC is always supported, independent of --enable-multi-arch or
--disable-multi-arch. Linker generates IFUNC relocations from input
IFUNC objects and ld.so performs IFUNC relocations.
2. There are no IFUNC dependencies in ld.so before dynamic relocation
have been performed,
3. The x86 CPU features in ld.so is initialized by DL_PLATFORM_INIT
in dynamic executable and by IFUNC relocation in dlopen in static
executable.
4. The x86 cache info in libc.o is initialized by IFUNC relocation.
5. In libc.a, both x86 CPU features and cache info are initialized from
ARCH_INIT_CPU_FEATURES, not by IFUNC relocation, before __libc_early_init
is called.
Note: _dl_x86_init_cpu_features can be called more than once from
DL_PLATFORM_INIT and during relocation in ld.so.
There are several compiler implementations that allow large stack
allocations to jump over the guard page at the end of the stack and
corrupt memory beyond that. See CVE-2017-1000364.
Compilers can emit code to probe the stack such that the guard page
cannot be skipped, but on aarch64 the probe interval is 64K by default
instead of the minimum supported page size (4K).
This patch enforces at least 64K guard on aarch64 unless the guard
is disabled by setting its size to 0. For backward compatibility
reasons the increased guard is not reported, so it is only observable
by exhausting the address space or parsing /proc/self/maps on linux.
On other targets the patch has no effect. If the stack probe interval
is larger than a page size on a target then ARCH_MIN_GUARD_SIZE can
be defined to get large enough stack guard on libc allocated stacks.
The patch does not affect threads with user allocated stacks.
Fixes bug 26691.
Install <sys/platform/x86.h> so that programmers can do
#if __has_include(<sys/platform/x86.h>)
#include <sys/platform/x86.h>
#endif
...
if (CPU_FEATURE_USABLE (SSE2))
...
if (CPU_FEATURE_USABLE (AVX2))
...
<sys/platform/x86.h> exports only:
enum
{
COMMON_CPUID_INDEX_1 = 0,
COMMON_CPUID_INDEX_7,
COMMON_CPUID_INDEX_80000001,
COMMON_CPUID_INDEX_D_ECX_1,
COMMON_CPUID_INDEX_80000007,
COMMON_CPUID_INDEX_80000008,
COMMON_CPUID_INDEX_7_ECX_1,
/* Keep the following line at the end. */
COMMON_CPUID_INDEX_MAX
};
struct cpuid_features
{
struct cpuid_registers cpuid;
struct cpuid_registers usable;
};
struct cpu_features
{
struct cpu_features_basic basic;
struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
};
/* Get a pointer to the CPU features structure. */
extern const struct cpu_features *__x86_get_cpu_features
(unsigned int max) __attribute__ ((const));
Since all feature checks are done through macros, programs compiled with
a newer <sys/platform/x86.h> are compatible with the older glibc binaries
as long as the layout of struct cpu_features is identical. The features
array can be expanded with backward binary compatibility for both .o and
.so files. When COMMON_CPUID_INDEX_MAX is increased to support new
processor features, __x86_get_cpu_features in the older glibc binaries
returns NULL and HAS_CPU_FEATURE/CPU_FEATURE_USABLE return false on the
new processor feature. No new symbol version is neeeded.
Both CPU_FEATURE_USABLE and HAS_CPU_FEATURE are provided. HAS_CPU_FEATURE
can be used to identify processor features.
Note: Although GCC has __builtin_cpu_supports, it only supports a subset
of <sys/platform/x86.h> and it is equivalent to CPU_FEATURE_USABLE. It
doesn't support HAS_CPU_FEATURE.
A typo in commit 107e6a3c22 causes the
FMA4 code path to be taken on systems that support FMA, even if they do
not support FMA4. Fix this to detect FMA4.
Support usable check for all CPU features with the following changes:
1. Change struct cpu_features to
struct cpuid_features
{
struct cpuid_registers cpuid;
struct cpuid_registers usable;
};
struct cpu_features
{
struct cpu_features_basic basic;
struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};
so that there is a usable bit for each cpuid bit.
2. After the cpuid bits have been initialized, copy the known bits to the
usable bits. EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for
CPU feature detection.
3. Clear the usable bits which require OS support.
4. If the feature is supported by OS, copy its cpuid bit to its usable
bit.
5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE
and CPU_FEATURE_USABLE_P to check if a feature is usable.
6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13.
7. Unset MPX feature since it has been deprecated.
The results are
1. If the feature is known and doesn't requre OS support, its usable bit
is copied from the cpuid bit.
2. Otherwise, its usable bit is copied from the cpuid bit only if the
feature is known to supported by OS.
3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the
feature can be used.
4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports
the feature.
Since
commit c867597bff
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jun 8 13:57:50 2016 -0700
X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
removed the only usage of __x86_prefetchw, we can remove the unused
__x86_prefetchw.
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.
Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.
The existing macros are fragile and expect local variables with a
certain name. Fix this by defining them as functions with default
implementation in a new header dl-runtime.h which arches can override
if need be.
This came up during ARC port review, hence the need for argument pltgot
in reloc_index() which is not needed by existing ports.
This patch potentially only affects hppa/x86 ports,
build tested for both those configs and a few more.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The getcpu cache was removed from the kernel in Linux 2.6.24. glibc
support from the sched_getcpu implementation was removed in commit
dd26c44403 ("Consolidate sched_getcpu").
Since __x86_shared_non_temporal_threshold is defined as
long int __x86_shared_non_temporal_threshold;
and long int is 4 bytes for x32, use RDX_LP to compare against
__x86_shared_non_temporal_threshold in assembly code.
This patch removes the IEEE_DOUBLE_BIG_ENDIAN and
IEEE_DOUBLE_MIXED_ENDIAN macros from gmp-impl.h and gmp-mparam.h, and
the ieee_double_extract union from gmp-impl.h. The macros were used
only in defining the union, which was used nowhere in glibc. As GMP's
gmp-impl.h is over 5000 lines, the file in glibc is so far from the
GMP version that it doesn't seem to make sense to keep things there
that are not relevant in glibc. (I expect there is plenty more in the
header after this patch that is also not relevant in glibc and can be
cleaned up later.)
Tested with build-many-glibcs.py that installed stripped shared
libraries are unchanged by this patch.
The __sfp_handle_exceptions is not fully correct regarding raising
exceptions, since there is no direct way to raise only FP_EX_OVERFLOW
nor FP_EX_UNDERFLOW for SSE mode. Both libgcc and feraiseexcept rely
on x87 mode to accomplish it.
This reverts commit 460ee50de0.
Checked on x86_64.
The exported x86_64 fenv.h functions operate on both i387 and SSE (since
they should work on both float, double, and long double) while the
internal libc_fe* set either SSE (float, double, and float128) or
i387 (long double).
The libgcc __sfp_handle_exceptions (used on float128 implementation),
however, will set either SEE or i387 exception depending of the
exception to raise. This broke the internal assumption of float128
where only SSE operations will be used.
This patch reimplements the libgcc __sfp_handle_exceptions to use only
SSE operations and sets libgcc to use it instead of its own
implementation.
And I think we should fix libgcc in a similar manner, since checking on
config/i386/64/sfp-machine.h it already only supports SSE rounding mode
and x86_64 ABI also expectes float128 to use SSE registers [1]
(although it is not clear on how future implementation might implement
it).
Checked on x86_64-linux-gnu.
[1] https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI
All cancellable syscalls are done by C implementations, so there is no
no need to use a specialized implementation to optimize register usage.
It fixes BZ #25765.
Checked on x86_64-linux-gnu.
The corner cases included were generated using exhaustive search
for all float/binary32 values on x86_64 (comparing to MPFR for
correct rounding to nearest).
For the j0/j1/y0 functions, only cases with ulp error <= 9 were
included.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
With mathinline removal there is no need to keep building and testing
inline math tests.
The gen-libm-tests.py support to generate ULP_I_* is removed and all
libm-test-ulps files are updated to longer have the
i{float,double,ldouble} entries. The support for no-test-inline is
also removed from both gen-auto-libm-tests and the
auto-libm-test-out-* were regenerated.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Exporting functions and relying on symbol interposition from libc.so
makes the choice of implementation dependent on DT_NEEDED order, which
is not what some compiler drivers expect.
This commit replaces one magic mechanism (symbol interposition) with
another one (preprocessor-/compiler-based redirection). This makes
the hand-over from the minimal malloc to the full malloc more
explicit.
Removing the ABI symbols is backwards-compatible because libc.so is
always in scope, and the dynamic loader will find the malloc-related
symbols there since commit f0b2132b35
("ld.so: Support moving versioned symbols between sonames
[BZ #24741]").
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
With all Linux ABIs using the expected Linux kABI to indicate
syscalls errors, there is no need to replicate the INLINE_SYSCALL.
The generic Linux sysdep.h includes errno.h even for !__ASSEMBLER__,
which is ok now and it allows cleanup some archaic code that assume
otherwise.
Checked with a build against all affected ABIs.
This patch adds a new macro, libm_alias_finite, to define all _finite
symbol. It sets all _finite symbol as compat symbol based on its first
version (obtained from the definition at built generated first-versions.h).
The <fn>f128_finite symbols were introduced in GLIBC 2.26 and so need
special treatment in code that is shared between long double and float128.
It is done by adding a list, similar to internal symbol redifinition,
on sysdeps/ieee754/float128/float128_private.h.
Alpha also needs some tricky changes to ensure we still emit 2 compat
symbols for sqrt(f).
Passes buildmanyglibc.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This patch replaces s_trunc.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_ceil.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_floor.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 and sparc64 files.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_rint.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 file.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch replaces s_nearbyint.c in sysdeps/dbl-64 with the one in
sysdeps/dbl-64/wordsize-64 and removes the latter one.
The code is not changed except changes in code style.
Also adjusted the include path in x86_64 file.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This commit adds missing skip_ifunc checks to aarch64, arm, i386,
sparc, and x86_64. A new test case ensures that IRELATIVE IFUNC
resolvers do not run in various diagnostic modes of the dynamic
loader.
Reviewed-By: Szabolcs Nagy <szabolcs.nagy@arm.com>
This patch new build tests to check for internal fields offsets for
internal pthread_rwlock_t definition. Althoug the '__data.__flags'
field layout should be preserved due static initializators, the patch
also adds tests for the futexes that may be used in a shared memory
(although using different libc version in such scenario is not really
supported).
Checked with a build against all affected ABIs.
Change-Id: Iccc103d557de13d17e4a3f59a0cad2f4a640c148
The offsets of pthread_mutex_t __data.__nusers, __data.__spins,
__data.elision, __data.list are not required to be constant over
the releases. Only the __data.__kind is used for static
initializers.
This patch also adds an additional size check for __data.__kind.
Checked with a build against affected ABIs.
Change-Id: I7a4e48cc91b4c4ada57e9a5d1b151fb702bfaa9f
Remove _finite tests and references from x86_64. Rather than calling
__exp_finite, use exp directly (since it's the same entry point).
x86_64 builds and passes testsuite.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Use the generic C memset/memcpy/memmove in benchtests since comparing
against a slow byte-oriented implementation makes no sense.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2019-08-29 Wilco Dijkstra <wdijkstr@arm.com>
* benchtests/bench-memcpy.c (simple_memcpy): Remove.
(generic_memcpy): Include generic C memcpy.
* benchtests/bench-memmove.c (simple_memmove): Remove.
(generic_memmove): Include generic C memmove.
* benchtests/bench-memset.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* benchtests/bench-memset-large.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* benchtests/bench-memset-walk.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* string/memcpy.c (MEMCPY): Add defines to enable redirection.
* string/memset.c (MEMSET): Likewise.
* sysdeps/x86_64/memcopy.h: Remove empty file.
When compiled with -O3 and AVX, GCC 8 and 9 optimize some loops in
sysdeps/ieee754/dbl-64/branred.c with 256-bit vector instructions,
which leads to store forward stall:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
There is no easy fix in compiler. This patch limits vector width to
128 bits to work around this issue. It improves performance of sin
and cos by more than 40% on Skylake compiled with -O3 -march=skylake.
Tested with GCC 7/8/9 on x86-64.
[BZ #24603]
* sysdeps/x86_64/configure.ac: Check if -mprefer-vector-width=128
works.
* sysdeps/x86_64/configure: Regenerated.
* sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New. Set
to -mprefer-vector-width=128 if supported.
Since sysdeps/i386/dl-lookupcfg.h and sysdeps/x86_64/dl-lookupcfg.h are
identical, we can replace them with sysdeps/x86/dl-lookupcfg.h.
* sysdeps/i386/dl-lookupcfg.h: Moved to ...
* sysdeps/x86/dl-lookupcfg.h: Here.
* sysdeps/x86_64/dl-lookupcfg.h: Removed.
This patch rewrites wcscat using wcslen and wcscpy. This is similar to
the optimization done on strcat by 6e46de42fe.
The strcpy changes are mainly to add the internal alias to avoid PLT
calls.
Checked on x86_64-linux-gnu and a build against the affected
architectures.
* include/wchar.h (__wcscpy): New prototype.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c
(__wcscpy): Route internal symbol to generic implementation.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c (wcscpy):
Add internal __wcscpy alias.
* sysdeps/powerpc/powerpc64/multiarch/wcscpy.c (wcscpy): Likewise.
* sysdeps/s390/wcscpy.c (wcscpy): Likewise.
* sysdeps/x86_64/multiarch/wcscpy.c (wcscpy): Likewise.
* wcsmbs/wcscpy.c (wcscpy): Add
* sysdeps/x86_64/multiarch/wcscpy-c.c (WCSCPY): Adjust macro to
use generic implementation.
* wcsmbs/wcscat.c (wcscat): Rewrite using wcslen and wcscpy.
This patch adds fall-through comments in some cases where -Wextra
produces implicit-fallthrough warnings.
The patch is non-exhaustive. Apart from architecture-specific code
for non-x86_64 architectures, it does not change sunrpc/xdr.c (legacy
code, probably should have such changes, but left to be dealt with
separately), or places that already had comments about the
fall-through but not matching the form expected by
-Wimplicit-fallthrough=3 (the default level with -Wextra; my
inclination is to adjust those comments to match rather than
downgrading to -Wimplicit-fallthrough=1 to allow any comment), or one
place where I thought the implicit fallthrough was not correct and so
should be handled separately as a bug fix. I think the key thing to
consider in review of this patch is whether the fall-through is indeed
intended and correct in each place where such a comment is added.
Tested for x86_64.
* elf/dl-exception.c (_dl_exception_create_format): Add
fall-through comments.
* elf/ldconfig.c (parse_conf_include): Likewise.
* elf/rtld.c (print_statistics): Likewise.
* locale/programs/charmap.c (parse_charmap): Likewise.
* misc/mntent_r.c (__getmntent_r): Likewise.
* posix/wordexp.c (parse_arith): Likewise.
(parse_backtick): Likewise.
* resolv/ns_ttl.c (ns_parse_ttl): Likewise.
* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_rela): Likewise.
This fixes the same bug in fnmatch that was fixed by commit 7e2f0d2d77 for
regexp matching. As a side effect it also removes the use of an unbound
VLA.
Since the size argument is unsigned. we should use unsigned Jcc
instructions, instead of signed, to check size.
Tested on x86-64 and x32, with and without --disable-multi-arch.
[BZ #24155]
CVE-2019-7309
* NEWS: Updated for CVE-2019-7309.
* sysdeps/x86_64/memcmp.S: Use RDX_LP for size. Clear the
upper 32 bits of RDX register for x32. Use unsigned Jcc
instructions, instead of signed.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memcmp-2.
* sysdeps/x86_64/x32/tst-size_t-memcmp-2.c: New test.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes strnlen/wcsnlen for x32. Tested on x86-64 and x32. On
x86-64, libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/strlen-avx2.S: Use RSI_LP for length.
Clear the upper 32 bits of RSI register.
* sysdeps/x86_64/strlen.S: Use RSI_LP for length.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strnlen
and tst-size_t-wcsnlen.
* sysdeps/x86_64/x32/tst-size_t-strnlen.c: New file.
* sysdeps/x86_64/x32/tst-size_t-wcsnlen.c: Likewise.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes strncpy for x32. Tested on x86-64 and x32. On x86-64,
libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/strcpy-avx2.S: Use RDX_LP for length.
* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/strcpy-ssse3.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strncpy.
* sysdeps/x86_64/x32/tst-size_t-strncpy.c: New file.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes the strncmp family for x32. Tested on x86-64 and x32.
On x86-64, libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/strcmp-avx2.S: Use RDX_LP for length.
* sysdeps/x86_64/multiarch/strcmp-sse42.S: Likewise.
* sysdeps/x86_64/strcmp.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strncasecmp,
tst-size_t-strncmp and tst-size_t-wcsncmp.
* sysdeps/x86_64/x32/tst-size_t-strncasecmp.c: New file.
* sysdeps/x86_64/x32/tst-size_t-strncmp.c: Likewise.
* sysdeps/x86_64/x32/tst-size_t-wcsncmp.c: Likewise.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes memset/wmemset for x32. Tested on x86-64 and x32. On
x86-64, libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Use
RDX_LP for length. Clear the upper 32 bits of RDX register.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-wmemset.
* sysdeps/x86_64/x32/tst-size_t-memset.c: New file.
* sysdeps/x86_64/x32/tst-size_t-wmemset.c: Likewise.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes memrchr for x32. Tested on x86-64 and x32. On x86-64,
libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/memrchr.S: Use RDX_LP for length.
* sysdeps/x86_64/multiarch/memrchr-avx2.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memrchr.
* sysdeps/x86_64/x32/tst-size_t-memrchr.c: New file.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes memcpy for x32. Tested on x86-64 and x32. On x86-64,
libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Use RDX_LP for
length. Clear the upper 32 bits of RDX register.
* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memcpy.
tst-size_t-wmemchr.
* sysdeps/x86_64/x32/tst-size_t-memcpy.c: New file.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes memcmp/wmemcmp for x32. Tested on x86-64 and x32. On
x86-64, libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S: Use RDX_LP for
length. Clear the upper 32 bits of RDX register.
* sysdeps/x86_64/multiarch/memcmp-sse4.S: Likewise.
* sysdeps/x86_64/multiarch/memcmp-ssse3.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memcmp and
tst-size_t-wmemcmp.
* sysdeps/x86_64/x32/tst-size_t-memcmp.c: New file.
* sysdeps/x86_64/x32/tst-size_t-wmemcmp.c: Likewise.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes memchr/wmemchr for x32. Tested on x86-64 and x32. On
x86-64, libc.so is the same with and withou the fix.
[BZ# 24097]
CVE-2019-6488
* sysdeps/x86_64/memchr.S: Use RDX_LP for length. Clear the
upper 32 bits of RDX register.
* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memchr and
tst-size_t-wmemchr.
* sysdeps/x86_64/x32/test-size_t.h: New file.
* sysdeps/x86_64/x32/tst-size_t-memchr.c: Likewise.
* sysdeps/x86_64/x32/tst-size_t-wmemchr.c: Likewise.
Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
It uses vector comparison as much as possible. In general, the larger the
source string, the greater performance gain observed, reaching speedups of
1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
preferred and AVX unaligned load is fast.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
stpcpy-avx2 and stpncpy-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c:
(__libc_ifunc_impl_list): Add tests for __strcat_avx2,
__strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
and __stpncpy_avx2.
* sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
ifunc-strcpy.h}: rename header for a more generic name.
* sysdeps/x86_64/multiarch/ifunc-strcpy.h:
(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
AVX unaligned load is fast and vzeroupper is preferred.
* sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
* sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
* sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
* sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
* sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
* sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
The x86 defines optimized THREAD_ATOMIC_* macros where reference always
the current thread instead of the one indicated by input 'descr' argument.
It work as long the input is the self thread pointer, however it generates
wrong code if the semantic is to set a bit atomicialy from another thread.
This is not an issue for current GLIBC usage, however the new cancellation
code expects that some synchronization code to atomically set bits from
different threads.
The generic code generates an additional load to reference to TLS segment,
for instance the code:
THREAD_ATOMIC_BIT_SET (THREAD_SELF, cancelhandling, CANCELED_BIT);
Compiles to:
lock;orl $4, %fs:776
Where with patch changes it now compiles to:
mov %fs:16,%rax
lock;orl $4, 776(%rax)
If some usage indeed proves to be a hotspot we can add an extra macro
with a more descriptive name (THREAD_ATOMIC_BIT_SET_SELF for instance)
where x86_64 might optimize it.
Checked on x86_64-linux-gnu.
* sysdeps/x86_64/nptl/tls.h (THREAD_ATOMIC_CMPXCHG_VAL,
THREAD_ATOMIC_AND, THREAD_ATOMIC_BIT_SET): Remove macros.
The current s_sincosf.c is faster than s_sincosf-sse2.S. On Broadwell
with FMA disabled, bench-sincosf shows:
Before After Improvement
max 154.032 114.517 34%
min 6.25 5.609 11%
mean 14.8728 12.8589 15%
* sysdeps/x86_64/fpu/s_sincosf.S: Removed.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.c: New file.
Add <sincosf_poly.h> and include it in s_sincosf.h to allow vectorized
sincosf_poly. Add x86 sincosf_poly.h to vectorize sincosf_poly. On
Broadwell, bench-sincosf shows:
Before After Improvement
max 160.273 114.198 40%
min 6.25 5.625 11%
mean 13.0325 10.6462 22%
Vectorized sincosf_poly shows
Before After Improvement
max 138.653 114.198 21%
min 5.004 5.625 -11%
mean 11.5934 10.6462 9%
Tested on x86-64 and i686 as well as with build-many-glibcs.py.
* sysdeps/ieee754/flt-32/s_sincosf.h: Include <sincosf_poly.h>.
(sincos_t, sincosf_poly, sinf_poly): Moved to ...
* sysdeps/ieee754/flt-32/sincosf_poly.h: Here. New file.
* sysdeps/x86/fpu/s_sincosf_data.c: New file.
* sysdeps/x86/fpu/sincosf_poly.h: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: Just include
<sysdeps/ieee754/flt-32/s_sincosf.c>.
Merge i386 and x86_64 atomic-machine.h to x86 atomic-machine.h.
Tested on i686 and x86_64 as well as with build-many-glibcs.py.
* sysdeps/i386/atomic-machine.h: Merged with ...
* sysdeps/x86_64/atomic-machine.h: To ...
* sysdeps/x86/atomic-machine.h: This. New file.
Extend CPUID support for all feature bits from CPUID. Add a new macro,
CPU_FEATURE_USABLE, which can be used to check if a feature is usable at
run-time, instead of HAS_CPU_FEATURE and HAS_ARCH_FEATURE.
Add COMMON_CPUID_INDEX_D_ECX_1, COMMON_CPUID_INDEX_80000007 and
COMMON_CPUID_INDEX_80000008 to check CPU feature bits in them.
Tested on i686 and x86-64 as well as using build-many-glibcs.py with
x86 targets.
* sysdeps/x86/cacheinfo.c (intel_check_word): Updated for
cpu_features_basic.
(__cache_sysconf): Likewise.
(init_cacheinfo): Likewise.
* sysdeps/x86/cpu-features.c (get_extended_indeces): Also
populate COMMON_CPUID_INDEX_80000007 and
COMMON_CPUID_INDEX_80000008.
(get_common_indices): Also populate COMMON_CPUID_INDEX_D_ECX_1.
Use CPU_FEATURES_CPU_P (cpu_features, XSAVEC) to check if
XSAVEC is available. Set the bit_arch_XXX_Usable bits.
(init_cpu_features): Use _Static_assert on
index_arch_Fast_Unaligned_Load.
__get_cpuid_registers and __get_arch_feature. Updated for
cpu_features_basic. Set stepping in cpu_features.
* sysdeps/x86/cpu-features.h: (FEATURE_INDEX_1): Changed to enum.
(FEATURE_INDEX_2): New.
(FEATURE_INDEX_MAX): Changed to enum.
(COMMON_CPUID_INDEX_D_ECX_1): New.
(COMMON_CPUID_INDEX_80000007): Likewise.
(COMMON_CPUID_INDEX_80000008): Likewise.
(cpuid_registers): Likewise.
(cpu_features_basic): Likewise.
(CPU_FEATURE_USABLE): Likewise.
(bit_arch_XXX_Usable): Likewise.
(cpu_features): Use cpuid_registers and cpu_features_basic.
(bit_arch_XXX): Reweritten.
(bit_cpu_XXX): Likewise.
(index_cpu_XXX): Likewise.
(reg_XXX): Likewise.
* sysdeps/x86/tst-get-cpu-features.c: Include <stdio.h> and
<support/check.h>.
(CHECK_CPU_FEATURE): New.
(CHECK_CPU_FEATURE_USABLE): Likewise.
(cpu_kinds): Likewise.
(do_test): Print vendor, family, model and stepping. Check
HAS_CPU_FEATURE and CPU_FEATURE_USABLE.
(TEST_FUNCTION): Removed.
Include <support/test-driver.c> instead of
"../../test-skeleton.c".
* sysdeps/x86_64/multiarch/sched_cpucount.c (__sched_cpucount):
Check POPCNT instead of POPCOUNT.
* sysdeps/x86_64/multiarch/test-multiarch.c (do_test): Likewise.
Introduce new pow symbol version that doesn't do SVID compatible error
handling. The standard errno and fp exception based error handling is
inline in the new code and does not have significant overhead.
The wrapper is disabled for sysdeps/ieee754/dbl-64 by using empty
w_pow.c and enabled for targets with their own pow implementation or
ifunc dispatch on __ieee754_pow by including math/w_pow.c.
The compatibility symbol version still uses the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously powl was an alias of pow, now it points to
the compatibility symbol with the wrapper, because it still need the
SVID compatible error handling. This affects NO_LONG_DOUBLE (e.g. arm)
and LONG_DOUBLE_COMPAT (e.g. alpha) targets as well.
The __pow_finite symbol is now an alias of pow. Both __pow_finite and
pow set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that
may affect that header.
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add pow.
* math/w_pow_compat.c (__pow_compat): Change to versioned compat
symbol.
* math/w_pow.c: New file.
* sysdeps/i386/fpu/w_pow.c: New file.
* sysdeps/ia64/fpu/e_pow.S: Add versioned symbols.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Rename to __pow
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_pow.c: New file.
* sysdeps/m68k/m680x0/fpu/w_pow.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__ieee754_pow): Rename to
__pow.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__ieee754_pow): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_pow.c (__ieee754_pow): Likewise.
* sysdeps/x86_64/fpu/multiarch/w_pow.c: New file.
Introduce new log symbol version that doesn't do SVID compatible error
handling. The standard errno and fp exception based error handling is
inline in the new code and does not have significant overhead.
The wrapper is disabled for sysdeps/ieee754/dbl-64 by using empty
w_log.c and enabled for targets with their own log implementation by
including math/w_log.c.
The compatibility symbol version still uses the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously logl was an alias of log, now it points to
the compatibility symbol with the wrapper, because it still need the
SVID compatible error handling. This affects NO_LONG_DOUBLE (e.g. arm)
and LONG_DOUBLE_COMPAT (e.g. alpha) targets as well.
The __log_finite symbol is now an alias of log. Both __log_finite and
log set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that may
affect that header.
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add log.
* math/w_log_compat.c (__log_compat): Change to versioned compat
symbol.
* math/w_log.c: New file.
* sysdeps/i386/fpu/w_log.c: New file.
* sysdeps/ia64/fpu/e_log.S: Update.
* sysdeps/ieee754/dbl-64/e_log.c (__ieee754_log): Rename to __log
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_log.c: New file.
* sysdeps/m68k/m680x0/fpu/w_log.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_log-avx.c (__ieee754_log): Rename to
__log.
* sysdeps/x86_64/fpu/multiarch/e_log-fma.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_log-fma4.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_log.c (__ieee754_log): Likewise.
* sysdeps/x86_64/fpu/multiarch/w_log.c: New file.
Introduce new exp and exp2 symbol version that don't do SVID compatible
error handling. The standard errno and fp exception based error handling
is inline in the new code and does not have significant overhead.
The double precision wrappers are disabled for sysdeps/ieee754/dbl-64
by using empty w_exp.c and w_exp2.c files, the math/w_exp.c and
math/w_exp2.c files use the wrapper template and can be included by
targets that have their own exp and exp2 implementations or use ifunc
on the glibc internal __ieee754_exp symbol.
The compatibility symbol versions still use the wrapper with SVID error
handling around the new code. There is no new symbol version nor
compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
On targets where previously expl and exp2l were aliases of exp and exp2,
now they point to the compatibility symbols with the wrapper, because
they still need the SVID compatible error handling. This affects
NO_LONG_DOUBLE (e.g arm) and LONG_DOUBLE_COMPAT (e.g. alpha) targets
as well.
The _finite symbols are now aliases of the standard symbols (they have
no performance advantage anymore). Both the standard symbols and
_finite symbols set errno and thus not const functions.
The ia64 asm is changed so the compat and new symbol versions map to the
same address.
On x86_64 #include <math.h> was added before macro definitions that may
affect that header (the new macro name is __exp instead of __ieee754_exp
which breaks some math.h macros).
Tested with build-many-glibcs.py.
* math/Versions (GLIBC_2.29): Add exp and exp2.
* math/w_exp2_compat.c (__exp2_compat): Change to versioned compat
symbol, handle NO_LONG_DOUBLE and LONG_DOUBLE_COMPAT explicitly.
* math/w_exp_compat.c (__exp_compat): Likewise.
* math/w_exp.c: New file.
* math/w_exp2.c: New file.
* sysdeps/i386/fpu/w_exp.c: New file.
* sysdeps/i386/fpu/w_exp2.c: New file.
* sysdeps/ia64/fpu/e_exp.S: Add versioned symbols.
* sysdeps/ia64/fpu/e_exp2.S: Likewise.
* sysdeps/ieee754/dbl-64/e_exp.c (__ieee754_exp): Rename to __exp
and add necessary aliases.
* sysdeps/ieee754/dbl-64/e_exp2.c (__ieee754_exp2): Rename to __exp2
and add necessary aliases.
* sysdeps/ieee754/dbl-64/w_exp.c: New file.
* sysdeps/ieee754/dbl-64/w_exp2.c: New file.
* sysdeps/m68k/m680x0/fpu/w_exp.c: New file.
* sysdeps/m68k/m680x0/fpu/w_exp2.c: New file.
* sysdeps/mach/hurd/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Update.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Update.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Update.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Update.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Update.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Update.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Update.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Update.
* sysdeps/x86_64/fpu/multiarch/e_exp-avx.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c (__exp1): Remove.
(__ieee754_exp): Rename to __exp.
* sysdeps/x86_64/fpu/multiarch/e_exp.c (__ieee754_exp): Rename to
__exp.
* sysdeps/x86_64/fpu/multiarch/w_exp.c: New file.
GCC 9 has gained an enhancement to help detect attribute mismatches
between alias declarations and their targets. It consists of a new
warning, -Wattribute-alias, an enhancement to an existing warning,
-Wmissing-attributes, and a new attribute called copy.
The purpose of the warnings is to help identify either possible bugs
(an alias declared with more restrictive attributes than its target
promises) or optimization or diagnostic opportunities (an alias target
missing some attributes that it could be declared with that might
benefit analysis and code generation). The purpose of the new
attribute is to easily apply (almost) the same set of attributes
to one declaration as those already present on another.
As expected (and intended) the enhancement triggers warnings for
many alias declarations in Glibc code. This change, tested on
x86_64-linux, avoids all instances of the new warnings by making
use of the attribute where appropriate. To fully benefit from
the enhancement Glibc will need to be compiled with
-Wattribute-alias=2 and remaining warnings reviewed and dealt with
(there are a couple of thousand but most should be straightforward
to deal with).
ChangeLog:
* include/libc-symbols.h (__attribute_copy__): Define macro unless
it's already defined.
(_strong_alias): Use __attribute_copy__.
(_weak_alias, __hidden_ver1, __hidden_nolink2): Same.
* misc/sys/cdefs.h (__attribute_copy__): New macro.
* sysdeps/x86_64/multiarch/memchr.c (memchr): Use __attribute_copy__.
* sysdeps/x86_64/multiarch/memcmp.c (memcmp): Same.
* sysdeps/x86_64/multiarch/mempcpy.c (mempcpy): Same.
* sysdeps/x86_64/multiarch/memset.c (memset): Same.
* sysdeps/x86_64/multiarch/stpcpy.c (stpcpy): Same.
* sysdeps/x86_64/multiarch/strcat.c (strcat): Same.
* sysdeps/x86_64/multiarch/strchr.c (strchr): Same.
* sysdeps/x86_64/multiarch/strcmp.c (strcmp): Same.
* sysdeps/x86_64/multiarch/strcpy.c (strcpy): Same.
* sysdeps/x86_64/multiarch/strcspn.c (strcspn): Same.
* sysdeps/x86_64/multiarch/strlen.c (strlen): Same.
* sysdeps/x86_64/multiarch/strncmp.c (strncmp): Same.
* sysdeps/x86_64/multiarch/strncpy.c (strncpy): Same.
* sysdeps/x86_64/multiarch/strnlen.c (strnlen): Same.
* sysdeps/x86_64/multiarch/strpbrk.c (strpbrk): Same.
* sysdeps/x86_64/multiarch/strrchr.c (strrchr): Same.
* sysdeps/x86_64/multiarch/strspn.c (strspn): Same.
Since _rdtsc intrinsic is supported in GCC 4.9, we can use it for
HP_TIMING_NOW. This patch
1. Create x86 hp-timing.h to replace i686 and x86_64 hp-timing.h.
2. Move MINIMUM_ISA from init-arch.h to isa.h so that x86 hp-timing.h
can check minimum x86 ISA to decide if _rdtsc can be used.
NB: Checking if __i686__ isn't sufficient since __i686__ may not be
defined when building for i686 class processors.
* sysdeps/i386/init-arch.h: Removed.
* sysdeps/i386/i586/init-arch.h: Likewise.
* sysdeps/i386/i686/init-arch.h: Likewise.
* sysdeps/i386/i686/hp-timing.h: Likewise.
* sysdeps/x86_64/hp-timing.h: Likewise.
* sysdeps/i386/isa.h: New file.
* sysdeps/i386/i586/isa.h: Likewise.
* sysdeps/i386/i686/isa.h: Likewise.
* sysdeps/x86_64/isa.h: Likewise.
* sysdeps/x86/hp-timing.h: New file.
* sysdeps/x86/init-arch.h: Include <isa.h>.
The algorithm is exp(y * log(x)), where log(x) is computed with about
1.3*2^-68 relative error (1.5*2^-68 without fma), returning the result
in two doubles, and the exp part uses the same algorithm (and lookup
tables) as exp, but takes the input as two doubles and a sign (to handle
negative bases with odd integer exponent). The __exp1 internal symbol
is no longer necessary.
There is separate code path when fma is not available but the worst case
error is about 0.54 ULP in both cases. The lookup table and consts for
log are 4168 bytes. The .rodata+.text is decreased by 37908 bytes on
aarch64. The non-nearest rounding error is less than 1 ULP.
Improvements on Cortex-A72 compared to current glibc master:
pow thruput: 2.40x in [0.01 11.1]x[0.01 11.1]
pow latency: 1.84x in [0.01 11.1]x[0.01 11.1]
Tested on
aarch64-linux-gnu (defined __FP_FAST_FMA, TOINT_INTRINSICS) and
arm-linux-gnueabihf (!defined __FP_FAST_FMA, !TOINT_INTRINSICS) and
x86_64-linux-gnu (!defined __FP_FAST_FMA, !TOINT_INTRINSICS) and
powerpc64le-linux-gnu (defined __FP_FAST_FMA, !TOINT_INTRINSICS) targets.
* NEWS: Mention pow improvements.
* math/Makefile (type-double-routines): Add e_pow_log_data.
* sysdeps/generic/math_private.h (__exp1): Remove.
* sysdeps/i386/fpu/e_pow_log_data.c: New file.
* sysdeps/ia64/fpu/e_pow_log_data.c: New file.
* sysdeps/ieee754/dbl-64/Makefile (CFLAGS-e_pow.c): Allow fma
contraction.
* sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove.
(exp_inline): Remove.
(__ieee754_exp): Only single double input is handled.
* sysdeps/ieee754/dbl-64/e_pow.c: Rewrite.
* sysdeps/ieee754/dbl-64/e_pow_log_data.c: New file.
* sysdeps/ieee754/dbl-64/math_config.h (issignaling_inline): Define.
(__pow_log_data): Define.
* sysdeps/ieee754/dbl-64/upow.h: Remove.
* sysdeps/ieee754/dbl-64/upow.tbl: Remove.
* sysdeps/m68k/m680x0/fpu/e_pow_log_data.c: New file.
* sysdeps/x86_64/fpu/multiarch/Makefile (CFLAGS-e_pow-fma.c): Allow fma
contraction.
(CFLAGS-e_pow-fma4.c): Likewise.
Continuing the move to use, within libm, public names for libm
functions that can be inlined as built-in functions on many
architectures, this patch moves calls to __rint functions to call the
corresponding rint names instead, with asm redirection to __rint when
the calls are not inlined. The x86_64 math_private.h is removed as no
longer useful after this patch.
This patch is relative to a tree with my floor patch
<https://sourceware.org/ml/libc-alpha/2018-09/msg00148.html> applied,
and much the same considerations arise regarding possibly replacing an
IFUNC call with a direct inline expansion.
Tested for x86_64, and with build-many-glibcs.py.
* include/math.h [!_ISOMAC && !(__FINITE_MATH_ONLY__ &&
__FINITE_MATH_ONLY__ > 0) && !NO_MATH_REDIRECT] (rint): Redirect
using MATH_REDIRECT.
* sysdeps/aarch64/fpu/s_rint.c: Define NO_MATH_REDIRECT before
header inclusion.
* sysdeps/aarch64/fpu/s_rintf.c: Likewise.
* sysdeps/alpha/fpu/s_rint.c: Likewise.
* sysdeps/alpha/fpu/s_rintf.c: Likewise.
* sysdeps/i386/fpu/s_rintl.c: Likewise.
* sysdeps/ieee754/dbl-64/s_rint.c: Likewise.
* sysdeps/ieee754/dbl-64/wordsize-64/s_rint.c: Likewise.
* sysdeps/ieee754/float128/s_rintf128.c: Likewise.
* sysdeps/ieee754/flt-32/s_rintf.c: Likewise.
* sysdeps/ieee754/ldbl-128/s_rintl.c: Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_rintl.c: Likewise.
* sysdeps/m68k/coldfire/fpu/s_rint.c: Likewise.
* sysdeps/m68k/coldfire/fpu/s_rintf.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rint.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rintf.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_rintl.c: Likewise.
* sysdeps/powerpc/fpu/s_rint.c: Likewise.
* sysdeps/powerpc/fpu/s_rintf.c: Likewise.
* sysdeps/riscv/rv64/rvd/s_rint.c: Likewise.
* sysdeps/riscv/rvf/s_rintf.c: Likewise.
* sysdeps/sparc/sparc32/sparcv9/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/sparc/sparc32/sparcv9/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_rint.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_rintf.c: Likewise.
* sysdeps/x86_64/fpu/math_private.h: Remove file.
* math/e_scalb.c (invalid_fn): Use rint functions instead of
__rint variants.
* math/e_scalbf.c (invalid_fn): Likewise.
* math/e_scalbl.c (invalid_fn): Likewise.
* sysdeps/ieee754/dbl-64/e_gamma_r.c (__ieee754_gamma_r):
Likewise.
* sysdeps/ieee754/flt-32/e_gammaf_r.c (__ieee754_gammaf_r):
Likewise.
* sysdeps/ieee754/k_standard.c (__kernel_standard): Likewise.
* sysdeps/ieee754/k_standardl.c (__kernel_standard_l): Likewise.
* sysdeps/ieee754/ldbl-128/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/ieee754/ldbl-96/e_gammal_r.c (__ieee754_gammal_r):
Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_llrint.c (__llrint): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_llrintf.c (__llrintf): Likewise.
Similar to the changes that were made to call sqrt functions directly
in glibc, instead of __ieee754_sqrt variants, so that the compiler
could inline them automatically without needing special inline
definitions in lots of math_private.h headers, this patch makes libm
code call floor functions directly instead of __floor variants,
removing the inlines / macros for x86_64 (SSE4.1) and powerpc
(POWER5).
The redirection used to ensure that __ieee754_sqrt does still get
called when the compiler doesn't inline a built-in function expansion
is refactored so it can be applied to other functions; the refactoring
is arranged so it's not limited to unary functions either (it would be
reasonable to use this mechanism for copysign - removing the inline in
math_private_calls.h but also eliminating unnecessary local PLT entry
use in the cases (powerpc soft-float and e500v1, for IBM long double)
where copysign calls don't get inlined).
The point of this change is that more architectures can get floor
calls inlined where they weren't previously (AArch64, for example),
without needing special inline definitions in their math_private.h,
and existing such definitions in math_private.h headers can be
removed.
Note that it's possible that in some cases an inline may be used where
an IFUNC call was previously used - this is the case on x86_64, for
example. I think the direct calls to floor are still appropriate; if
there's any significant performance cost from inline SSE2 floor
instead of an IFUNC call ending up with SSE4.1 floor, that indicates
that either the function should be doing something else that's faster
than using floor at all, or it should itself have IFUNC variants, or
that the compiler choice of inlining for generic tuning should change
to allow for the possibility that, by not inlining, an SSE4.1 IFUNC
might be called at runtime - but not that glibc should avoid calling
floor internally. (After all, all the same considerations would apply
to any user program calling floor, where it might either be inlined or
left as an out-of-line call allowing for a possible IFUNC.)
Tested for x86_64, and with build-many-glibcs.py.
* include/math.h [!_ISOMAC && !(__FINITE_MATH_ONLY__ &&
__FINITE_MATH_ONLY__ > 0) && !NO_MATH_REDIRECT] (MATH_REDIRECT):
New macro.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_LDBL): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_F128): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (MATH_REDIRECT_UNARY_ARGS): Likewise.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (sqrt): Redirect using MATH_REDIRECT.
[!_ISOMAC && !(__FINITE_MATH_ONLY__ && __FINITE_MATH_ONLY__ > 0)
&& !NO_MATH_REDIRECT] (floor): Likewise.
* sysdeps/aarch64/fpu/s_floor.c: Define NO_MATH_REDIRECT before
header inclusion.
* sysdeps/aarch64/fpu/s_floorf.c: Likewise.
* sysdeps/ieee754/dbl-64/s_floor.c: Likewise.
* sysdeps/ieee754/dbl-64/wordsize-64/s_floor.c: Likewise.
* sysdeps/ieee754/float128/s_floorf128.c: Likewise.
* sysdeps/ieee754/flt-32/s_floorf.c: Likewise.
* sysdeps/ieee754/ldbl-128/s_floorl.c: Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_floorl.c: Likewise.
* sysdeps/m68k/m680x0/fpu/s_floor_template.c: Likewise.
* sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/riscv/rv64/rvd/s_floor.c: Likewise.
* sysdeps/riscv/rvf/s_floorf.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/sparc/sparc64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floor.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floorf.c: Likewise.
* sysdeps/powerpc/fpu/math_private.h [_ARCH_PWR5X] (__floor):
Remove macro.
[_ARCH_PWR5X] (__floorf): Likewise.
* sysdeps/x86_64/fpu/math_private.h [__SSE4_1__] (__floor): Remove
inline function.
[__SSE4_1__] (__floorf): Likewise.
* math/w_lgamma_main.c (LGFUNC (__lgamma)): Use floor functions
instead of __floor variants.
* math/w_lgamma_r_compat.c (__lgamma_r): Likewise.
* math/w_lgammaf_main.c (LGFUNC (__lgammaf)): Likewise.
* math/w_lgammaf_r_compat.c (__lgammaf_r): Likewise.
* math/w_lgammal_main.c (LGFUNC (__lgammal)): Likewise.
* math/w_lgammal_r_compat.c (__lgammal_r): Likewise.
* math/w_tgamma_compat.c (__tgamma): Likewise.
* math/w_tgamma_template.c (M_DECL_FUNC (__tgamma)): Likewise.
* math/w_tgammaf_compat.c (__tgammaf): Likewise.
* math/w_tgammal_compat.c (__tgammal): Likewise.
* sysdeps/ieee754/dbl-64/e_lgamma_r.c (sin_pi): Likewise.
* sysdeps/ieee754/dbl-64/k_rem_pio2.c (__kernel_rem_pio2):
Likewise.
* sysdeps/ieee754/dbl-64/lgamma_neg.c (__lgamma_neg): Likewise.
* sysdeps/ieee754/flt-32/e_lgammaf_r.c (sin_pif): Likewise.
* sysdeps/ieee754/flt-32/lgamma_negf.c (__lgamma_negf): Likewise.
* sysdeps/ieee754/ldbl-128/e_lgammal_r.c (__ieee754_lgammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128/e_powl.c (__ieee754_powl): Likewise.
* sysdeps/ieee754/ldbl-128/lgamma_negl.c (__lgamma_negl):
Likewise.
* sysdeps/ieee754/ldbl-128/s_expm1l.c (__expm1l): Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_lgammal_r.c (__ieee754_lgammal_r):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_powl.c (__ieee754_powl): Likewise.
* sysdeps/ieee754/ldbl-128ibm/lgamma_negl.c (__lgamma_negl):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_expm1l.c (__expm1l): Likewise.
* sysdeps/ieee754/ldbl-128ibm/s_truncl.c (__truncl): Likewise.
* sysdeps/ieee754/ldbl-96/e_lgammal_r.c (sin_pi): Likewise.
* sysdeps/ieee754/ldbl-96/lgamma_negl.c (__lgamma_negl): Likewise.
* sysdeps/powerpc/power5+/fpu/s_modf.c (__modf): Likewise.
* sysdeps/powerpc/power5+/fpu/s_modff.c (__modff): Likewise.
The x86_64 math_private.h has asm versions of the macros to
reinterpret between floating-point and integer types.
This is the sort of thing we now strongly discourage; the expectation
in such cases, where the generic C code gives the compiler all the
information needed about the required semantics, is that you should
get the compiler to do the right thing for the generic C code rather
than writing an asm version.
Trivial tests showed GCC generates the expected single instructions
for reinterpretation from floating point to integer. In the other
direction, it goes via memory when the asms don't; I asked about this
in GCC bug 87236 and was advised this was deliberate for generic
tuning because it was faster that way on some AMD processors (but
-mtune=intel, and -Os with the latest GCC, avoid going via memory).
The asms don't and can't know about those tuning details, so that's
evidence that they are actually making the code worse.
This patch removes the asms accordingly. Tested for x86_64.
* sysdeps/x86_64/fpu/math_private.h (MOVD): Remove macro.
(MOVQ): Likewise.
(EXTRACT_WORDS64): Likewise.
(INSERT_WORDS64): Likewise.
(GET_FLOAT_WORD): Likewise.
(SET_FLOAT_WORD): Likewise.
Optimized exp and exp2 implementations using a lookup table for
fractional powers of 2. There are several variants, see e_exp_data.c,
they can be selected by modifying math_config.h allowing different
tradeoffs.
The default selection should be acceptable as generic libm code.
Worst case error is 0.509 ULP for exp and 0.507 ULP for exp2, on
aarch64 the rodata size is 2160 bytes, shared between exp and exp2.
On aarch64 .text + .rodata size decreased by 24912 bytes.
The non-nearest rounding error is less than 1 ULP even on targets
without efficient round implementation (although the error rate is
higher in that case). Targets with single instruction, rounding mode
independent, to nearest integer rounding and conversion can use them
by setting TOINT_INTRINSICS and adding the necessary code to their
math_private.h.
The __exp1 code uses the same algorithm, so the error bound of pow
increased a bit.
New double precision error handling code was added following the
style of the single precision error handling code.
Improvements on Cortex-A72 compared to current glibc master:
exp thruput: 1.61x in [-9.9 9.9]
exp latency: 1.53x in [-9.9 9.9]
exp thruput: 1.13x in [0.5 1]
exp latency: 1.30x in [0.5 1]
exp2 thruput: 2.03x in [-9.9 9.9]
exp2 latency: 1.64x in [-9.9 9.9]
For small (< 1) inputs the current exp code uses a separate algorithm
so the speed up there is less.
Was tested on
aarch64-linux-gnu (TOINT_INTRINSICS, fma contraction) and
arm-linux-gnueabihf (!TOINT_INTRINSICS, no fma contraction) and
x86_64-linux-gnu (!TOINT_INTRINSICS, no fma contraction) and
powerpc64le-linux-gnu (!TOINT_INTRINSICS, fma contraction) targets,
only non-nearest rounding ulp errors increase and they are within
acceptable bounds (ulp updates are in separate patches).
* NEWS: Mention exp and exp2 improvements.
* math/Makefile (libm-support): Remove t_exp.
(type-double-routines): Add math_err and e_exp_data.
* sysdeps/aarch64/libm-test-ulps: Update.
* sysdeps/arm/libm-test-ulps: Update.
* sysdeps/i386/fpu/e_exp_data.c: New file.
* sysdeps/i386/fpu/math_err.c: New file.
* sysdeps/i386/fpu/t_exp.c: Remove.
* sysdeps/ia64/fpu/e_exp_data.c: New file.
* sysdeps/ia64/fpu/math_err.c: New file.
* sysdeps/ia64/fpu/t_exp.c: Remove.
* sysdeps/ieee754/dbl-64/e_exp.c: Rewrite.
* sysdeps/ieee754/dbl-64/e_exp2.c: Rewrite.
* sysdeps/ieee754/dbl-64/e_exp_data.c: New file.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Update error bound.
* sysdeps/ieee754/dbl-64/eexp.tbl: Remove.
* sysdeps/ieee754/dbl-64/math_config.h: New file.
* sysdeps/ieee754/dbl-64/math_err.c: New file.
* sysdeps/ieee754/dbl-64/t_exp.c: Remove.
* sysdeps/ieee754/dbl-64/t_exp2.h: Remove.
* sysdeps/ieee754/dbl-64/uexp.h: Remove.
* sysdeps/ieee754/dbl-64/uexp.tbl: Remove.
* sysdeps/m68k/m680x0/fpu/e_exp_data.c: New file.
* sysdeps/m68k/m680x0/fpu/math_err.c: New file.
* sysdeps/m68k/m680x0/fpu/t_exp.c: Remove.
* sysdeps/powerpc/fpu/libm-test-ulps: Update.
* sysdeps/x86_64/fpu/libm-test-ulps: Update.
On some architectures, the parts of math_private.h relating to the
floating-point environment are in a separate file fenv_private.h
included from math_private.h. As this is purely an
architecture-specific convention used by several architectures,
however, all such architectures still need their own math_private.h,
even if it has nothing to do beyond #include <fenv_private.h> and
peculiarity of including the i386 file directly instead of having a
shared file in sysdeps/x86.
This patch makes the fenv_private.h name an architecture-independent
convention in glibc. The include of fenv_private.h from
math_private.h becomes architecture-independent (until callers are
updated to include fenv_private.h directly so the include from
math_private.h is no longer needed). Some architecture math_private.h
headers are removed if no longer needed, or renamed to fenv_private.h
if all they define belongs in that header; architecture fenv_private.h
headers now do require #include_next <fenv_private.h>. The i386
fenv_private.h file moves to sysdeps/x86/fpu/ to reflect how it is
actually shared with x86_64. The generic math_private.h gets a new
include of <stdbool.h>, as needed for bool in some prototypes in that
header (previously that was indirectly included via include/fenv.h,
which now only gets included too late in math_private.h, after those
prototypes).
Tested for x86_64 and x86, and tested with build-many-glibcs.py that
installed stripped shared libraries are unchanged by the patch.
* sysdeps/aarch64/fpu/fenv_private.h: New file. Based on ....
* sysdeps/aarch64/fpu/math_private.h: ... this file. All contents
moved to fenv_private.h except for ...
(TOINT_INTRINSICS): Kept in math_private.h.
(roundtoint): Likewise.
(converttoint): Likewise.
* sysdeps/arm/fenv_private.h: Change multiple-include guard to
[ARM_FENV_PRIVATE_H]. Include next <fenv_private.h>.
* sysdeps/arm/math_private.h: Remove.
* sysdeps/generic/fenv_private.h: New file. Contents moved from
....
* sysdeps/generic/math_private.h: ... this file. Include
<stdbool.h>. Do not include <fenv.h> or <get-rounding-mode.h>.
Include <fenv_private.h>. Remove functions and macros moved to
fenv_private.h.
* sysdeps/i386/fpu/math_private.h: Remove.
* sysdeps/mips/math_private.h: Move to ....
* sysdeps/mips/fpu/fenv_private.h: ... here. Change
multiple-include guard to [MIPS_FENV_PRIVATE_H]. Remove
[__mips_hard_float] conditional. Include next <fenv_private.h>.
* sysdeps/powerpc/fpu/fenv_private.h: Change multiple-include
guard to [POWERPC_FENV_PRIVATE_H]. Include next <fenv_private.h>.
* sysdeps/powerpc/fpu/math_private.h: Do not include
<fenv_private.h>.
* sysdeps/riscv/rvf/math_private.h: Move to ....
* sysdeps/riscv/rvf/fenv_private.h: ... here. Change
multiple-include guard to [RISCV_FENV_PRIVATE_H]. Include next
<fenv_private.h>.
* sysdeps/sparc/fpu/fenv_private.h: Change multiple-include guard
to [SPARC_FENV_PRIVATE_H]. Include next <fenv_private.h>.
* sysdeps/sparc/fpu/math_private.h: Remove.
* sysdeps/i386/fpu/fenv_private.h: Move to ....
* sysdeps/x86/fpu/fenv_private.h: ... here. Change
multiple-include guard to [X86_FENV_PRIVATE_H]. Include next
<fenv_private.h>.
* sysdeps/x86_64/fpu/math_private.h: Do not include
<sysdeps/i386/fpu/fenv_private.h>.
Fix a few missing spaces, it's now identical to the regenerated version.
Passes GLIBC tests on x64.
* sysdeps/x86_64/fpu/libm-test-ulps: Regenerate to fix spaces.
The second patch improves performance of sinf and cosf using the same
algorithms and polynomials. The returned values are identical to sincosf
for the same input. ULP definitions for AArch64 and x64 are updated.
sinf/cosf througput gains on Cortex-A72:
* |x| < 0x1p-12 : 1.2x
* |x| < M_PI_4 : 1.8x
* |x| < 2 * M_PI: 1.7x
* |x| < 120.0 : 2.3x
* |x| < Inf : 3.0x
* NEWS: Mention sinf, cosf, sincosf.
* sysdeps/aarch64/libm-test-ulps: Update ULP for sinf, cosf, sincosf.
* sysdeps/x86_64/fpu/libm-test-ulps: Update ULP for sinf and cosf.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: Add definitions of
constants rather than including generic sincosf.h.
* sysdeps/x86_64/fpu/s_sincosf_data.c: Remove.
* sysdeps/ieee754/flt-32/s_cosf.c (cosf): Rewrite.
* sysdeps/ieee754/flt-32/s_sincosf.h (reduced_sin): Remove.
(reduced_cos): Remove.
(sinf_poly): New function.
* sysdeps/ieee754/flt-32/s_sinf.c (sinf): Rewrite.
Continuing moving macros out of math-tests.h to smaller headers
following typo-proof conventions instead of using #ifndef, this patch
moves the SNAN_TESTS_* macros for individual types out to their own
sysdeps header (while the type-generic SNAN_TESTS wrapper for those
macros remains in math-tests.h).
Tested for x86_64 and x86, and with build-many-glibcs.py.
* sysdeps/generic/math-tests-snan.h: New file.
* sysdeps/generic/math-tests.h: Include <math-tests-snan.h>.
(SNAN_TESTS_float): Do not define here.
(SNAN_TESTS_double): Likewise.
(SNAN_TESTS_long_double): Likewise.
(SNAN_TESTS_float128): Likewise.
* sysdeps/i386/fpu/math-tests-snan.h: New file.
* sysdeps/i386/fpu/math-tests.h: Remove file.
* sysdeps/ia64/math-tests-snan.h: New file.
* sysdeps/ia64/math-tests.h: Remove file.
* sysdeps/x86/math-tests.h: Likewise.
* sysdeps/x86_64/fpu/math-tests-snan.h: New file.
This patch is a complete rewrite of sincosf. The new version is
significantly faster, as well as simple and accurate.
The worst-case ULP is 0.5607, maximum relative error is 0.5303 * 2^-23 over
all 4 billion inputs. In non-nearest rounding modes the error is 1ULP.
The algorithm uses 3 main cases: small inputs which don't need argument
reduction, small inputs which need a simple range reduction and large inputs
requiring complex range reduction. The code uses approximate integer
comparisons to quickly decide between these cases.
The small range reducer uses a single reduction step to handle values up to
120.0. It is fastest on targets which support inlined round instructions.
The large range reducer uses integer arithmetic for simplicity. It does a
32x96 bit multiply to compute a 64-bit modulo result. This is more than
accurate enough to handle the worst-case cancellation for values close to
an integer multiple of PI/4. It could be further optimized, however it is
already much faster than necessary.
sincosf throughput gains on Cortex-A72:
* |x| < 0x1p-12 : 1.6x
* |x| < M_PI_4 : 1.7x
* |x| < 2 * M_PI: 1.5x
* |x| < 120.0 : 1.8x
* |x| < Inf : 2.3x
* math/Makefile: Add s_sincosf_data.c.
* sysdeps/ia64/fpu/s_sincosf_data.c: New file.
* sysdeps/ieee754/flt-32/s_sincosf.h (abstop12): Add new function.
(sincosf_poly): Likewise.
(reduce_small): Likewise.
(reduce_large): Likewise.
* sysdeps/ieee754/flt-32/s_sincosf.c (sincosf): Rewrite.
* sysdeps/ieee754/flt-32/s_sincosf_data.c: New file with sincosf data.
* sysdeps/m68k/m680x0/fpu/s_sincosf_data.c: New file.
* sysdeps/x86_64/fpu/s_sincosf_data.c: New file.
__fentry__ symbol is currently not defined for other architectures.
Attempts to introduce it cause abicheck to fail, because it will be
available since 2.29 earliest, and not 2.13, which is the case for
Intel. With the new code, abicheck passes for i686-linux-gnu,
x86_64-linux-gnu and x86_64-linux-gnu32 triples.
ChangeLog:
* stdlib/Versions: Remove __fentry__.
* sysdeps/i386/Versions: Add __fentry__.
* sysdeps/x86_64/Versions: Add __fentry__.
Move STATE_SAVE_OFFSET and STATE_SAVE_MASK to sysdep.h to make
sysdeps/x86/cpu-features.h a C header file.
* sysdeps/x86/cpu-features.h (STATE_SAVE_OFFSET): Removed.
(STATE_SAVE_MASK): Likewise.
Don't check __ASSEMBLER__ to include <cpu-features-offsets.h>.
* sysdeps/x86/sysdep.h (STATE_SAVE_OFFSET): New.
(STATE_SAVE_MASK): Likewise.
* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features-offsets.h>
instead of <cpu-features.h>.
There is no need to include <init-arch.h> in assembly codes since all
x86 IFUNC selector functions are written in C. Tested on i686 and
x86-64. There is no code change in libc.so, ld.so and libmvec.so.
* sysdeps/i386/i686/multiarch/bzero-ia32.S: Don't include
<init-arch.h>.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core-avx2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
The glibc.tune namespace is vaguely named since it is a 'tunable', so
give it a more specific name that describes what it refers to. Rename
the tunable namespace to 'cpu' to more accurately reflect what it
encompasses. Also rename glibc.tune.cpu to glibc.cpu.name since
glibc.cpu.cpu is weird.
* NEWS: Mention the change.
* elf/dl-tunables.list: Rename tune namespace to cpu.
* sysdeps/powerpc/dl-tunables.list: Likewise.
* sysdeps/x86/dl-tunables.list: Likewise.
* sysdeps/aarch64/dl-tunables.list: Rename tune.cpu to
cpu.name.
* elf/dl-hwcaps.c (_dl_important_hwcaps): Adjust.
* elf/dl-hwcaps.h (GET_HWCAP_MASK): Likewise.
* manual/README.tunables: Likewise.
* manual/tunables.texi: Likewise.
* sysdeps/powerpc/cpu-features.c: Likewise.
* sysdeps/unix/sysv/linux/aarch64/cpu-features.c
(init_cpu_features): Likewise.
* sysdeps/x86/cpu-features.c: Likewise.
* sysdeps/x86/cpu-features.h: Likewise.
* sysdeps/x86/cpu-tunables.c: Likewise.
* sysdeps/x86_64/Makefile: Likewise.
* sysdeps/x86/dl-cet.c: Likewise.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This will be used to record the current shadow stack base for shadow
stack switching by getcontext, makecontext, setcontext and swapcontext.
If the target shadow stack base is the same as the current shadow stack
base, we unwind the shadow stack. Otherwise it is a stack switch and
we look for a restore token to restore the target shadow stack.
* sysdeps/i386/nptl/tcb-offsets.sym (SSP_BASE_OFFSET): New.
* sysdeps/i386/nptl/tls.h (tcbhead_t): Replace __glibc_reserved2
with ssp_base.
* sysdeps/x86_64/nptl/tcb-offsets.sym (SSP_BASE_OFFSET): New.
* sysdeps/x86_64/nptl/tls.h (tcbhead_t): Replace __glibc_reserved2
with ssp_base.
Add endbr64 to tst-quadmod1.S and tst-quadmod2.S so that func and foo
can be called indirectly.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* sysdeps/x86_64/tst-quadmod1.S (func): Add endbr64 if IBT is
enabled.
(foo): Likewise.
* sysdeps/x86_64/tst-quadmod2.S (func) : Likewise.
(foo): Likewise.
Add _CET_ENDBR to STRCMP_SSE42, which is called indirectly, to support
IBT.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* sysdeps/x86_64/multiarch/strcmp-sse42.S (STRCMP_SSE42): Add
_CET_ENDBR.
Add _CET_ENDBR to functions in crti.S, which are called indirectly, to
support IBT.
Tested on i686 and x86-64.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* sysdeps/i386/crti.S (_init): Add _CET_ENDBR.
(_fini): Likewise.
* sysdeps/x86_64/crti.S (_init): Likewise.
(_fini): Likewise.
Intel Control-flow Enforcement Technology (CET) instructions:
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-en
forcement-technology-preview.pdf
includes Indirect Branch Tracking (IBT) and Shadow Stack (SHSTK).
GNU_PROPERTY_X86_FEATURE_1_IBT is added to GNU program property to
indicate that all executable sections are compatible with IBT when
ENDBR instruction starts each valid target where an indirect branch
instruction can land. Linker sets GNU_PROPERTY_X86_FEATURE_1_IBT on
output only if it is set on all relocatable inputs.
On an IBT capable processor, the following steps should be taken:
1. When loading an executable without an interpreter, enable IBT and
lock IBT if GNU_PROPERTY_X86_FEATURE_1_IBT is set on the executable.
2. When loading an executable with an interpreter, enable IBT if
GNU_PROPERTY_X86_FEATURE_1_IBT is set on the interpreter.
a. If GNU_PROPERTY_X86_FEATURE_1_IBT isn't set on the executable,
disable IBT.
b. Lock IBT.
3. If IBT is enabled, when loading a shared object without
GNU_PROPERTY_X86_FEATURE_1_IBT:
a. If legacy interwork is allowed, then mark all pages in executable
PT_LOAD segments in legacy code page bitmap. Failure of legacy code
page bitmap allocation causes an error.
b. If legacy interwork isn't allowed, it causes an error.
GNU_PROPERTY_X86_FEATURE_1_SHSTK is added to GNU program property to
indicate that all executable sections are compatible with SHSTK where
return address popped from shadow stack always matches return address
popped from normal stack. Linker sets GNU_PROPERTY_X86_FEATURE_1_SHSTK
on output only if it is set on all relocatable inputs.
On a SHSTK capable processor, the following steps should be taken:
1. When loading an executable without an interpreter, enable SHSTK if
GNU_PROPERTY_X86_FEATURE_1_SHSTK is set on the executable.
2. When loading an executable with an interpreter, enable SHSTK if
GNU_PROPERTY_X86_FEATURE_1_SHSTK is set on interpreter.
a. If GNU_PROPERTY_X86_FEATURE_1_SHSTK isn't set on the executable
or any shared objects loaded via the DT_NEEDED tag, disable SHSTK.
b. Otherwise lock SHSTK.
3. After SHSTK is enabled, it is an error to load a shared object
without GNU_PROPERTY_X86_FEATURE_1_SHSTK.
To enable CET support in glibc, --enable-cet is required to configure
glibc. When CET is enabled, both compiler and assembler must support
CET. Otherwise, it is a configure-time error.
To support CET run-time control,
1. _dl_x86_feature_1 is added to the writable ld.so namespace to indicate
if IBT or SHSTK are enabled at run-time. It should be initialized by
init_cpu_features.
2. For dynamic executables:
a. A l_cet field is added to struct link_map to indicate if IBT or
SHSTK is enabled in an ELF module. _dl_process_pt_note or
_rtld_process_pt_note is called to process PT_NOTE segment for
GNU program property and set l_cet.
b. _dl_open_check is added to check IBT and SHSTK compatibilty when
dlopening a shared object.
3. Replace i386 _dl_runtime_resolve and _dl_runtime_profile with
_dl_runtime_resolve_shstk and _dl_runtime_profile_shstk, respectively if
SHSTK is enabled.
CET run-time control can be changed via GLIBC_TUNABLES with
$ export GLIBC_TUNABLES=glibc.tune.x86_shstk=[permissive|on|off]
$ export GLIBC_TUNABLES=glibc.tune.x86_ibt=[permissive|on|off]
1. permissive: SHSTK is disabled when dlopening a legacy ELF module.
2. on: IBT or SHSTK are always enabled, regardless if there are IBT or
SHSTK bits in GNU program property.
3. off: IBT or SHSTK are always disabled, regardless if there are IBT or
SHSTK bits in GNU program property.
<cet.h> from CET-enabled GCC is automatically included by assembly codes
to add GNU_PROPERTY_X86_FEATURE_1_IBT and GNU_PROPERTY_X86_FEATURE_1_SHSTK
to GNU program property. _CET_ENDBR is added at the entrance of all
assembly functions whose address may be taken. _CET_NOTRACK is used to
insert NOTRACK prefix with indirect jump table to support IBT. It is
defined as notrack when _CET_NOTRACK is defined in <cet.h>.
[BZ #21598]
* configure.ac: Add --enable-cet.
* configure: Regenerated.
* elf/Makefille (all-built-dso): Add a comment.
* elf/dl-load.c (filebuf): Moved before "dynamic-link.h".
Include <dl-prop.h>.
(_dl_map_object_from_fd): Call _dl_process_pt_note on PT_NOTE
segment.
* elf/dl-open.c: Include <dl-prop.h>.
(dl_open_worker): Call _dl_open_check.
* elf/rtld.c: Include <dl-prop.h>.
(dl_main): Call _rtld_process_pt_note on PT_NOTE segment. Call
_rtld_main_check.
* sysdeps/generic/dl-prop.h: New file.
* sysdeps/i386/dl-cet.c: Likewise.
* sysdeps/unix/sysv/linux/x86/cpu-features.c: Likewise.
* sysdeps/unix/sysv/linux/x86/dl-cet.h: Likewise.
* sysdeps/x86/cet-tunables.h: Likewise.
* sysdeps/x86/check-cet.awk: Likewise.
* sysdeps/x86/configure: Likewise.
* sysdeps/x86/configure.ac: Likewise.
* sysdeps/x86/dl-cet.c: Likewise.
* sysdeps/x86/dl-procruntime.c: Likewise.
* sysdeps/x86/dl-prop.h: Likewise.
* sysdeps/x86/libc-start.h: Likewise.
* sysdeps/x86/link_map.h: Likewise.
* sysdeps/i386/dl-trampoline.S (_dl_runtime_resolve): Add
_CET_ENDBR.
(_dl_runtime_profile): Likewise.
(_dl_runtime_resolve_shstk): New.
(_dl_runtime_profile_shstk): Likewise.
* sysdeps/linux/x86/Makefile (sysdep-dl-routines): Add dl-cet
if CET is enabled.
(CFLAGS-.o): Add -fcf-protection if CET is enabled.
(CFLAGS-.os): Likewise.
(CFLAGS-.op): Likewise.
(CFLAGS-.oS): Likewise.
(asm-CPPFLAGS): Add -fcf-protection -include cet.h if CET
is enabled.
(tests-special): Add $(objpfx)check-cet.out.
(cet-built-dso): New.
(+$(cet-built-dso:=.note)): Likewise.
(common-generated): Add $(cet-built-dso:$(common-objpfx)%=%.note).
($(objpfx)check-cet.out): New.
(generated): Add check-cet.out.
* sysdeps/x86/cpu-features.c: Include <dl-cet.h> and
<cet-tunables.h>.
(TUNABLE_CALLBACK (set_x86_ibt)): New prototype.
(TUNABLE_CALLBACK (set_x86_shstk)): Likewise.
(init_cpu_features): Call get_cet_status to check CET status
and update dl_x86_feature_1 with CET status. Call
TUNABLE_CALLBACK (set_x86_ibt) and TUNABLE_CALLBACK
(set_x86_shstk). Disable and lock CET in libc.a.
* sysdeps/x86/cpu-tunables.c: Include <cet-tunables.h>.
(TUNABLE_CALLBACK (set_x86_ibt)): New function.
(TUNABLE_CALLBACK (set_x86_shstk)): Likewise.
* sysdeps/x86/sysdep.h (_CET_NOTRACK): New.
(_CET_ENDBR): Define if not defined.
(ENTRY): Add _CET_ENDBR.
* sysdeps/x86/dl-tunables.list (glibc.tune): Add x86_ibt and
x86_shstk.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve): Add
_CET_ENDBR.
(_dl_runtime_profile): Likewise.
Save and restore shadow stack pointer in setjmp and longjmp to support
shadow stack in Intel CET. Use feature_1 in tcbhead_t to check if
shadow stack is enabled before saving and restoring shadow stack pointer.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* sysdeps/i386/__longjmp.S: Include <jmp_buf-ssp.h>.
(__longjmp): Restore shadow stack pointer if shadow stack is
enabled, SHADOW_STACK_POINTER_OFFSET is defined and __longjmp
isn't defined for __longjmp_cancel.
* sysdeps/i386/bsd-_setjmp.S: Include <jmp_buf-ssp.h>.
(_setjmp): Save shadow stack pointer if shadow stack is enabled
and SHADOW_STACK_POINTER_OFFSET is defined.
* sysdeps/i386/bsd-setjmp.S: Include <jmp_buf-ssp.h>.
(setjmp): Save shadow stack pointer if shadow stack is enabled
and SHADOW_STACK_POINTER_OFFSET is defined.
* sysdeps/i386/setjmp.S: Include <jmp_buf-ssp.h>.
(__sigsetjmp): Save shadow stack pointer if shadow stack is
enabled and SHADOW_STACK_POINTER_OFFSET is defined.
* sysdeps/unix/sysv/linux/i386/____longjmp_chk.S: Include
<jmp_buf-ssp.h>.
(____longjmp_chk): Restore shadow stack pointer if shadow stack
is enabled and SHADOW_STACK_POINTER_OFFSET is defined.
* sysdeps/unix/sysv/linux/x86/Makefile (gen-as-const-headers):
Remove jmp_buf-ssp.sym.
* sysdeps/unix/sysv/linux/x86_64/____longjmp_chk.S: Include
<jmp_buf-ssp.h>.
(____longjmp_chk): Restore shadow stack pointer if shadow stack
is enabled and SHADOW_STACK_POINTER_OFFSET is defined.
* sysdeps/x86/Makefile (gen-as-const-headers): Add
jmp_buf-ssp.sym.
* sysdeps/x86/jmp_buf-ssp.sym: New dummy file.
* sysdeps/x86_64/__longjmp.S: Include <jmp_buf-ssp.h>.
(__longjmp): Restore shadow stack pointer if shadow stack is
enabled, SHADOW_STACK_POINTER_OFFSET is defined and __longjmp
isn't defined for __longjmp_cancel.
* sysdeps/x86_64/setjmp.S: Include <jmp_buf-ssp.h>.
(__sigsetjmp): Save shadow stack pointer if shadow stack is
enabled and SHADOW_STACK_POINTER_OFFSET is defined.
feature_1 has X86_FEATURE_1_IBT and X86_FEATURE_1_SHSTK bits for CET
run-time control.
CET_ENABLED, IBT_ENABLED and SHSTK_ENABLED are defined to 1 or 0 to
indicate that if CET, IBT and SHSTK are enabled.
<tls-setup.h> is added to set up thread-local data.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
[BZ #22563]
* nptl/pthread_create.c: Include <tls-setup.h>.
(__pthread_create_2_1): Call tls_setup_tcbhead.
* sysdeps/generic/tls-setup.h: New file.
* sysdeps/x86/nptl/tls-setup.h: Likewise.
* sysdeps/i386/nptl/tcb-offsets.sym (FEATURE_1_OFFSET): New.
* sysdeps/x86_64/nptl/tcb-offsets.sym (FEATURE_1_OFFSET):
Likewise.
* sysdeps/i386/nptl/tls.h (tcbhead_t): Rename __glibc_reserved1
to feature_1.
* sysdeps/x86_64/nptl/tls.h (tcbhead_t): Likewise.
* sysdeps/x86/sysdep.h (X86_FEATURE_1_IBT): New.
(X86_FEATURE_1_SHSTK): Likewise.
(CET_ENABLED): Likewise.
(IBT_ENABLED): Likewise.
(SHSTK_ENABLED): Likewise.
sysdeps/i386/nptl/tls.h has
typedef struct
{
void *tcb; /* Pointer to the TCB. Not necessarily the
thread descriptor used by libpthread. */
dtv_t *dtv;
void *self; /* Pointer to the thread descriptor. */
int multiple_threads;
uintptr_t sysinfo;
uintptr_t stack_guard;
uintptr_t pointer_guard;
int gscope_flag;
int __glibc_reserved1;
/* Reservation of some values for the TM ABI. */
void *__private_tm[4];
/* GCC split stack support. */
void *__private_ss;
} tcbhead_t;
The offset of __private_ss is 0x34. But GCC defines
/* We steal the last transactional memory word. */
#define TARGET_THREAD_SPLIT_STACK_OFFSET 0x30
and libgcc/config/i386/morestack.S has
cmpl %gs:0x30,%eax # See if we have enough space.
movl %eax,%gs:0x30 # Save the new stack boundary.
movl %eax,%gs:0x30 # Save the new stack boundary.
movl %ecx,%gs:0x30 # Save new stack boundary.
movl %eax,%gs:0x30
movl %gs:0x30,%eax
movl %eax,%gs:0x30
Since update TARGET_THREAD_SPLIT_STACK_OFFSET changes split stack ABI,
this patch updates tcbhead_t to match GCC.
[BZ #23250]
[BZ #10686]
* sysdeps/i386/nptl/tls.h (tcbhead_t): Change __private_tm[4]
to _private_tm[3] and add __glibc_reserved2.
Add _Static_assert of offset of __private_ss == 0x30.
* sysdeps/x86_64/nptl/tls.h: Add _Static_assert of offset of
__private_ss == 0x40 for ILP32 and == 0x70 for LP64.
Due to the way the conditions were written, the rtld build of strncmp
ended up with no definition of the strncmp symbol at all: The
implementations were renamed for use within an IFUNC resolver, but the
IFUNC resolver itself was missing (because rtld does not use IFUNCs).
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
_init and _fini are special functions provided by glibc for linker to
define DT_INIT and DT_FINI in executable and shared library. They
should never be put in dynamic symbol table. This patch marks them as
hidden to remove them from dynamic symbol table.
Tested with build-many-glibcs.py.
[BZ #23145]
* elf/Makefile (tests-special): Add $(objpfx)check-initfini.out.
($(all-built-dso:=.dynsym): New target.
(common-generated): Add $(all-built-dso:$(common-objpfx)%=%.dynsym).
($(objpfx)check-initfini.out): New target.
(generated): Add check-initfini.out.
* scripts/check-initfini.awk: New file.
* sysdeps/aarch64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/alpha/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/arm/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/hppa/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/i386/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/ia64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/m68k/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/microblaze/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/mips/mips32/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/mips/mips64/n32/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/mips/mips64/n64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/nios2/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/powerpc/powerpc32/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/powerpc/powerpc64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/s390/s390-32/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/s390/s390-64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/sh/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/sparc/crti.S (_init): Mark as hidden.
(_fini): Likewise.
* sysdeps/x86_64/crti.S (_init): Mark as hidden.
(_fini): Likewise.
Optimize x86-64 strcmp/wcscmp and strncmp/wcsncmp with AVX2. It uses vector
comparison as much as possible. Peak performance observed on a SkyLake
machine: 9x, 3x, 2.5x and 5.5x for strcmp, strncmp, wcscmp and wcsncmp,
respectively. The larger the comparison length, the more benefit using
avx2 functions, except on the strcmp, where peak is observed at length
== 32 bytes. Select AVX2 strcmp/wcscmp on AVX2 machines where vzeroupper
is preferred and AVX unaligned load is fast.
NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input. TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcmp-avx2, strncmp-avx2, wcscmp-avx2, wcscmp-sse2, wcsncmp-avx2 and
wcsncmp-sse2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __strcmp_avx2,
__strncmp_avx2, __wcscmp_avx2, __wcsncmp_avx2, __wcscmp_sse2
and __wcsncmp_sse2.
* sysdeps/x86_64/multiarch/strcmp.c (OPTIMIZE (avx2)):
(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
AVX unaligned load is fast and vzeroupper is preferred.
* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
* sysdeps/x86_64/multiarch/strcmp-avx2.S: New file.
* sysdeps/x86_64/multiarch/strncmp-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcscmp-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcscmp-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/wcscmp.c: Likewise.
* sysdeps/x86_64/multiarch/wcsncmp-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcsncmp-sse2.c: Likewise.
* sysdeps/x86_64/multiarch/wcsncmp.c: Likewise.
* sysdeps/x86_64/wcscmp.S (__wcscmp): Add alias only if __wcscmp
is undefined.
This patch skips zero length in __mempcpy_erms, __memmove_erms and
__memset_erms.
Tested on x86-64.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(__mempcpy_erms): Skip zero length.
(__memmove_erms): Likewise.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(__memset_erms): Likewise.
Although the REP MOVSB implementations of memmove, memcpy and mempcpy
aren't used by the current processors, this patch adds Prefer_FSRM
check in ifunc-memmove.h so that they can be used in the future.
* sysdeps/x86/cpu-features.h (bit_arch_Prefer_FSRM): New.
(index_arch_Prefer_FSRM): Likewise.
* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
Also check Prefer_FSRM.
* sysdeps/x86_64/multiarch/ifunc-memmove.h (IFUNC_SELECTOR):
Also return OPTIMIZE (erms) for Prefer_FSRM.
Unlike i386, we can call hidden IFUNC functions inside libc.so since
x86-64 PLT is always PIC.
Tested on x86-64.
* sysdeps/x86_64/multiarch/strncat-c.c (STRNCAT_PRIMARY): Removed.
Include <string/strncat.c>.
* sysdeps/x86_64/multiarch/strncat.c (__strncat): New strong
alias.
(__GI___strncat): New hidden alias.
Since the result of testl is never used, this patch removes it.
Tested on 64-bit AVX2 machine.
* sysdeps/x86_64/multiarch/strlen-avx2.S (STRLEN): Remove the
unnecessary testl.
When MEMSET_SYMBOL (__memset, erms) is provided for debugger, mark it
as hidden so that it will be local to the library.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(MEMSET_SYMBOL (__memset, erms)): Mark the debugger symbol as
hidden.
Wrap symbol address run-time calculation into a macro and use it
throughout, replacing inline calculations.
There are a couple of variants, most of them different in a functionally
insignificant way. Most calculations are right following RESOLVE_MAP,
at which point either the map or the symbol returned can be checked for
validity as the macro sets either both or neither. In some places both
the symbol and the map has to be checked however.
My initial implementation therefore always checked both, however that
resulted in code larger by as much as 0.3%, as many places know from
elsewhere that no check is needed. I have decided the size growth was
unacceptable.
Having looked closer I realized that it's the map that is the culprit.
Therefore I have modified LOOKUP_VALUE_ADDRESS to accept an additional
boolean argument telling it to access the map without checking it for
validity. This in turn has brought quite nice results, with new code
actually being smaller for i686, and MIPS o32, n32 and little-endian n64
targets, unchanged in size for x86-64 and, unusually, marginally larger
for big-endian MIPS n64, as follows:
i686:
text data bss dec hex filename
152255 4052 192 156499 26353 ld-2.27.9000-base.so
152159 4052 192 156403 262f3 ld-2.27.9000-elf-symbol-value.so
MIPS/o32/el:
text data bss dec hex filename
142906 4396 260 147562 2406a ld-2.27.9000-base.so
142890 4396 260 147546 2405a ld-2.27.9000-elf-symbol-value.so
MIPS/n32/el:
text data bss dec hex filename
142267 4404 260 146931 23df3 ld-2.27.9000-base.so
142171 4404 260 146835 23d93 ld-2.27.9000-elf-symbol-value.so
MIPS/n64/el:
text data bss dec hex filename
149835 7376 408 157619 267b3 ld-2.27.9000-base.so
149787 7376 408 157571 26783 ld-2.27.9000-elf-symbol-value.so
MIPS/o32/eb:
text data bss dec hex filename
142870 4396 260 147526 24046 ld-2.27.9000-base.so
142854 4396 260 147510 24036 ld-2.27.9000-elf-symbol-value.so
MIPS/n32/eb:
text data bss dec hex filename
142019 4404 260 146683 23cfb ld-2.27.9000-base.so
141923 4404 260 146587 23c9b ld-2.27.9000-elf-symbol-value.so
MIPS/n64/eb:
text data bss dec hex filename
149763 7376 408 157547 2676b ld-2.27.9000-base.so
149779 7376 408 157563 2677b ld-2.27.9000-elf-symbol-value.so
x86-64:
text data bss dec hex filename
148462 6452 400 155314 25eb2 ld-2.27.9000-base.so
148462 6452 400 155314 25eb2 ld-2.27.9000-elf-symbol-value.so
[BZ #19818]
* sysdeps/generic/ldsodefs.h (LOOKUP_VALUE_ADDRESS): Add `set'
parameter.
(SYMBOL_ADDRESS): New macro.
[!ELF_FUNCTION_PTR_IS_SPECIAL] (DL_SYMBOL_ADDRESS): Use
SYMBOL_ADDRESS for symbol address calculation.
* elf/dl-runtime.c (_dl_fixup): Likewise.
(_dl_profile_fixup): Likewise.
* elf/dl-symaddr.c (_dl_symbol_address): Likewise.
* elf/rtld.c (dl_main): Likewise.
* sysdeps/aarch64/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/alpha/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/arm/dl-machine.h (elf_machine_rel): Likewise.
(elf_machine_rela): Likewise.
* sysdeps/hppa/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/hppa/dl-symaddr.c (_dl_symbol_address): Likewise.
* sysdeps/i386/dl-machine.h (elf_machine_rel): Likewise.
(elf_machine_rela): Likewise.
* sysdeps/ia64/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/m68k/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/microblaze/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/mips/dl-machine.h (ELF_MACHINE_BEFORE_RTLD_RELOC):
Likewise.
(elf_machine_reloc): Likewise.
(elf_machine_got_rel): Likewise.
* sysdeps/mips/dl-trampoline.c (__dl_runtime_resolve): Likewise.
* sysdeps/nios2/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/riscv/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/s390/s390-32/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/s390/s390-64/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/sh/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/sparc/sparc32/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/sparc/sparc64/dl-machine.h (elf_machine_rela):
Likewise.
* sysdeps/tile/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_rela): Likewise.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This series of patches removes the slow patchs from sin, cos and sincos.
Besides greatly simplifying the implementation, the new version is also much
faster for inputs up to PI (41% faster) and for large inputs needing range
reduction (27% faster).
ULP is ~0.55 with no errors found after testing 1.6 billion inputs across most
of the range with mpsin and mpcos. The number of incorrectly rounded results
(ie. ULP >0.5) is at most ~2750 per million inputs between 0.125 and 0.5,
the average is ~850 per million between 0 and PI.
Tested on AArch64 and x86_64 with no regressions.
The first patch removes the slow paths for the cases where the input is small
and doesn't require range reduction. Update ULP tables for sin, cos and sincos
on AArch64 and x86_64.
* sysdeps/aarch64/libm-test-ulps: Update ULP for sin, cos, sincos.
* sysdeps/ieee754/dbl-64/s_sin.c (__sin): Remove slow paths for small
inputs.
(__cos): Likewise.
* sysdeps/x86_64/fpu/libm-test-ulps: Update ULP for sin, cos, sincos.
No glibc configuration uses the present debug/backtrace.c, whereas
several #include the x86_64 version. The x86_64 version is
effectively a generic one (using _Unwind_Backtrace from libgcc, which
works much more reliably than the built-in functions used by
debug/backtrace.c). This patch moves it to debug/backtrace.c and
removes all the #includes of the x86_64 version from other
architectures which are no longer required.
I do not know whether all the other architecture-specific backtrace
implementations that are based on _Unwind_Backtrace are required, or
whether, where their differences from the generic version do something
useful, suitable hooks could be added to the generic version to reduce
the duplication involved.
Tested with build-many-glibcs.py that installed stripped shared
libraries are unchanged by this patch.
* sysdeps/x86_64/backtrace.c: Move to ....
* debug/backtrace.c: ... here.
* sysdeps/aarch64/backtrace.c: Remove file.
* sysdeps/alpha/backtrace.c: Likewise.
* sysdeps/hppa/backtrace.c: Likewise.
* sysdeps/ia64/backtrace.c: Likewise.
* sysdeps/mips/backtrace.c: Likewise.
* sysdeps/nios2/backtrace.c: Likewise.
* sysdeps/riscv/backtrace.c: Likewise.
* sysdeps/sh/backtrace.c: Likewise.
* sysdeps/tile/backtrace.c: Likewise.
Remove the now unused target specific__ieee754_sqrt(f/l) inlines.
Also remove inlines of sqrt which are for really old GCC versions.
Removing these is desirable, under the general principle of leaving
such inlining to the compiler rather than trying to do it in installed
headers, especially when only very old compilers are affected.
Note that removing inlines for __ieee754_sqrt disables inlining in the
sqrt wrapper functions. Given the sqrt function will typically only be
called for negative arguments, it doesn't matter whether the inlining
happens or not.
* sysdeps/aarch64/fpu/math_private.h (__ieee754_sqrt): Remove.
(__ieee754_sqrtf): Remove.
* sysdeps/alpha/fpu/math_private.h (__ieee754_sqrt): Remove.
(__ieee754_sqrtf): Remove.
* sysdeps/generic/math-type-macros.h (M_SQRT): Use sqrt.
* sysdeps/m68k/m680x0/fpu/mathimpl.h (__ieee754_sqrt): Remove.
* sysdeps/powerpc/fpu/math_private.h (__ieee754_sqrt): Remove.
(__ieee754_sqrtf): Remove.
* sysdeps/s390/fpu/bits/mathinline.h: Remove file.
* sysdeps/sparc/fpu/bits/mathinline.h (sqrt) Remove.
(sqrtf): Remove.
(sqrtl): Remove.
(__ieee754_sqrt): Remove.
(__ieee754_sqrtf): Remove.
(__ieee754_sqrtl): Remove.
* sysdeps/m68k/m680x0/fpu/mathimpl.h (__ieee754_sqrt): Remove.
* sysdeps/x86/fpu/math_private.h (__ieee754_sqrt): Remove.
* sysdeps/x86_64/fpu/math_private.h (__ieee754_sqrt): Remove.
(__ieee754_sqrtf): Remove.
(__ieee754_sqrtl): Remove.
Remove the slow paths from pow. Like several other double precision math
functions, pow is exactly rounded. This is not required from math functions
and causes major overheads as it requires multiple fallbacks using higher
precision arithmetic if a result is close to 0.5ULP. Ridiculous slowdowns
of up to 100000x have been reported when the highest precision path triggers.
All GLIBC math tests pass on AArch64 and x64 (with ULP of pow set to 1).
The worst case error is ~0.506ULP. A simple test over a few hundred million
values shows pow is 10% faster on average. This fixes BZ #13932.
[BZ #13932]
* sysdeps/ieee754/dbl-64/uexp.h (err_1): Remove.
* benchtests/pow-inputs: Update comment for slow path cases.
* manual/probes.texi (slowpow_p10): Delete removed probe.
(slowpow_p10): Likewise.
* math/Makefile: Remove halfulp.c and slowpow.c.
* sysdeps/aarch64/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/generic/math_private.h (__exp1): Remove error argument.
(__halfulp): Remove.
(__slowpow): Remove.
* sysdeps/i386/fpu/halfulp.c: Delete file.
* sysdeps/i386/fpu/slowpow.c: Likewise.
* sysdeps/ia64/fpu/halfulp.c: Likewise.
* sysdeps/ia64/fpu/slowpow.c: Likewise.
* sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove error argument,
improve comments and add error analysis.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Add error analysis.
(power1): Remove function:
(log1): Remove error argument, add error analysis.
(my_log2): Remove function.
* sysdeps/ieee754/dbl-64/halfulp.c: Delete file.
* sysdeps/ieee754/dbl-64/slowpow.c: Likewise.
* sysdeps/m68k/m680x0/fpu/halfulp.c: Likewise.
* sysdeps/m68k/m680x0/fpu/slowpow.c: Likewise.
* sysdeps/powerpc/power4/fpu/Makefile: Remove CPPFLAGS-slowpow.c.
* sysdeps/x86_64/fpu/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/x86_64/fpu/multiarch/Makefile: Remove slowpow-fma.c,
slowpow-fma4.c, halfulp-fma.c, halfulp-fma4.c.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__slowpow): Remove define.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__slowpow): Likewise.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma.c: Delete file.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c: Likewise.
In commit cba595c350 and commit
f81ddabffd, ABI compatibility with
applications was broken by increasing the size of the on-stack
allocated __pthread_unwind_buf_t beyond the oringal size.
Applications only have the origianl space available for
__pthread_unwind_register, and __pthread_unwind_next to use,
any increase in the size of __pthread_unwind_buf_t causes these
functions to write beyond the original structure into other
on-stack variables leading to segmentation faults in common
applications like vlc. The only workaround is to version those
functions which operate on the old sized objects, but this must
happen in glibc 2.28.
Thank you to Andrew Senkevich, H.J. Lu, and Aurelien Jarno, for
submitting reports and tracking the issue down.
The commit reverts the above mentioned commits and testing on
x86_64 shows that the ABI compatibility is restored. A tst-cleanup1
regression test linked with an older glibc now passes when run
with the newly built glibc. Previously a tst-cleanup1 linked with
an older glibc would segfault when run with an affected glibc build.
Tested on x86_64 with no regressions.
Signed-off-by: Carlos O'Donell <carlos@redhat.com>
_dl_runtime_profile calls _dl_call_pltexit, passing a pointer to
La_x86_64_retval which is allocated on stack. The lrv_vector0
field in La_x86_64_retval must be aligned to size of vector register.
When allocating stack space for La_x86_64_retval, we need to make sure
that the address of La_x86_64_retval + RV_VECTOR0_OFFSET is aligned to
VEC_SIZE. This patch checks the alignment of the lrv_vector0 field
and pads the stack space if needed.
Tested with x32 and x86-64 on SSE4, AVX and AVX512 machines. It fixed
FAIL: elf/tst-audit10
FAIL: elf/tst-audit4
FAIL: elf/tst-audit5
FAIL: elf/tst-audit6
FAIL: elf/tst-audit7
on x32 AVX512 machine.
[BZ #22715]
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_profile): Properly
align La_x86_64_retval to VEC_SIZE.
The x86_64 backtrace implementation is used as a generic
implementation (unwinding via unwind info and _Unwind_Backtrace) by
various other architectures. This patch makes it more generic by
making it use LIBGCC_S_SO from gnu/lib-names.h instead of hardcoding
the libgcc_s.so.1 name, so that it can also be used on hppa which uses
libgcc_s.so.4.
Tested for x86_64.
* sysdeps/x86_64/backtrace.c: Include <gnu/lib-names.h>.
(init): Use LIBGCC_S_SO not hardcoded "libgcc_s.so.1".
Since the x86-64 assembly version of sincosf is higly optimized with
vector instructions, there isn't much room for improvement. However
s_sincosf.c written in C with vector math and intrinsics can be
optimized by GCC with FMA.
On Skylake, bench-sincosf reports performance improvement:
Assembly FMA improvement
max 104.042 101.008 3%
min 9.426 8.586 10%
mean 20.6209 18.2238 13%
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_sincosf-sse2 and s_sincosf-fma.
(CFLAGS-s_sincosf-fma.c): New.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_sincosf-sse2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sincosf.c: Likewise.
* sysdeps/x86_64/fpu/s_sincosf.S: Don't add alias if
__sincosf is defined.
These changes will be active for all platforms that don't provide
their own exp() routines. They will also be active for ieee754
versions of ccos, ccosh, cosh, csin, csinh, sinh, exp10, gamma, and
erf.
Typical performance gains is typically around 5x when measured on
Sparc s7 for common values between exp(1) and exp(40).
Using the glibc perf tests on sparc,
sparc (nsec) x86 (nsec)
old new old new
max 17629 395 5173 144
min 399 54 15 13
mean 5317 200 1349 23
The extreme max times for the old (ieee754) exp are due to the
multiprecision computation in the old algorithm when the true value is
very near 0.5 ulp away from an value representable in double
precision. The new algorithm does not take special measures for those
cases. The current glibc exp perf tests overrepresent those values.
Informal testing suggests approximately one in 200 cases might
invoke the high cost computation. The performance advantage of the new
algorithm for other values is still large but not as large as indicated
by the chart above.
Glibc correctness tests for exp() and expf() were run. Within the
test suite 3 input values were found to cause 1 bit differences (ulp)
when "FE_TONEAREST" rounding mode is set. No differences in exp() were
seen for the tested values for the other rounding modes.
Typical example:
exp(-0x1.760cd2p+0) (-1.46113312244415283203125)
new code: 2.31973271630014299393707e-01 0x1.db14cd799387ap-3
old code: 2.31973271630014271638132e-01 0x1.db14cd7993879p-3
exp = 2.31973271630014285508337 (high precision)
Old delta: off by 0.49 ulp
New delta: off by 0.51 ulp
In addition, because ieee754_exp() is used by other routines, cexp()
showed test results with very small imaginary input values where the
imaginary portion of the result was off by 3 ulp when in upward
rounding mode, but not in the other rounding modes. For x86, tgamma
showed a few values where the ulp increased to 6 (max ulp for tgamma
is 5). Sparc tgamma did not show these failures. I presume the tgamma
differences are due to compiler optimization differences within the
gamma function.The gamma function is known to be difficult to compute
accurately.
* sysdeps/ieee754/dbl-64/e_exp.c: Include <math-svid-compat.h> and
<errno.h>. Include "eexp.tbl".
(half): New constant.
(one): Likewise.
(__ieee754_exp): Rewrite.
(__slowexp): Remove prototype.
* sysdeps/ieee754/dbl-64/eexp.tbl: New file.
* sysdeps/ieee754/dbl-64/slowexp.c: Remove file.
* sysdeps/i386/fpu/slowexp.c: Likewise.
* sysdeps/ia64/fpu/slowexp.c: Likewise.
* sysdeps/m68k/m680x0/fpu/slowexp.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-avx.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-fma.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c: Likewise.
* sysdeps/generic/math_private.h (__slowexp): Remove prototype.
* sysdeps/ieee754/dbl-64/e_pow.c: Remove mention of slowexp.c in
comment.
* sysdeps/powerpc/power4/fpu/Makefile [$(subdir) = math]
(CPPFLAGS-slowexp.c): Remove variable.
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Remove slowexp-fma, slowexp-fma4 and slowexp-avx.
(CFLAGS-slowexp-fma.c): Remove variable.
(CFLAGS-slowexp-fma4.c): Likewise.
(CFLAGS-slowexp-avx.c): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_exp-avx.c (__slowexp): Do not
define as macro.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma.c (__slowexp): Likewise.
* sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c (__slowexp): Likewise.
* math/Makefile (type-double-routines): Remove slowexp.
* manual/probes.texi (slowexp_p6): Remove.
(slowexp_p32): Likewise.
On x86, padding in struct __jmp_buf_tag is used for shadow stack pointer
to support Shadow Stack in Intel Control-flow Enforcemen Technology.
cancel_jmp_buf has been updated to include saved_mask so that it is as
large as struct __jmp_buf_tag. We must suport the old cancel_jmp_buf
in existing binaries. Since symbol versioning doesn't work on
cancel_jmp_buf, feature_1 is added to tcbhead_t so that setjmp and
longjmp can check if shadow stack is enabled. NB: Shadow stack is
enabled only if all modules are shadow stack enabled.
[BZ #22563]
* sysdeps/i386/nptl/tcb-offsets.sym (FEATURE_1_OFFSET): New.
* sysdeps/i386/nptl/tls.h (tcbhead_t): Add feature_1.
* sysdeps/x86_64/nptl/tcb-offsets.sym (FEATURE_1_OFFSET): New.
* sysdeps/x86_64/nptl/tls.h (tcbhead_t): Rename __glibc_unused1
to feature_1.
Static PIE extends address space layout randomization to static
executables. It provides additional security hardening benefits at
the cost of some memory and performance.
Dynamic linker, ld.so, is a standalone program which can be loaded at
any address. This patch adds a configure option, --enable-static-pie,
to embed the part of ld.so in static executable to create static position
independent executable (static PIE). A static PIE is similar to static
executable, but can be loaded at any address without help from a dynamic
linker. When --enable-static-pie is used to configure glibc, libc.a is
built as PIE and all static executables, including tests, are built as
static PIE. The resulting libc.a can be used together with GCC 8 or
above to build static PIE with the compiler option, -static-pie. But
GCC 8 isn't required to build glibc with --enable-static-pie. Only GCC
with PIE support is needed. When an older GCC is used to build glibc
with --enable-static-pie, proper input files are passed to linker to
create static executables as static PIE, together with "-z text" to
prevent dynamic relocations in read-only segments, which are not allowed
in static PIE.
The following changes are made for static PIE:
1. Add a new function, _dl_relocate_static_pie, to:
a. Get the run-time load address.
b. Read the dynamic section.
c. Perform dynamic relocations.
Dynamic linker also performs these steps. But static PIE doesn't load
any shared objects.
2. Call _dl_relocate_static_pie at entrance of LIBC_START_MAIN in
libc.a. crt1.o, which is used to create dynamic and non-PIE static
executables, is updated to include a dummy _dl_relocate_static_pie.
rcrt1.o is added to create static PIE, which will link in the real
_dl_relocate_static_pie. grcrt1.o is also added to create static PIE
with -pg. GCC 8 has been updated to support rcrt1.o and grcrt1.o for
static PIE.
Static PIE can work on all architectures which support PIE, provided:
1. Target must support accessing of local functions without dynamic
relocations, which is needed in start.S to call __libc_start_main with
function addresses of __libc_csu_init, __libc_csu_fini and main. All
functions in static PIE are local functions. If PIE start.S can't reach
main () defined in a shared object, the code sequence:
pass address of local_main to __libc_start_main
...
local_main:
tail call to main via PLT
can be used.
2. start.S is updated to check PIC instead SHARED for PIC code path and
avoid dynamic relocation, when PIC is defined and SHARED isn't defined,
to support static PIE.
3. All assembly codes are updated check PIC instead SHARED for PIC code
path to avoid dynamic relocations in read-only sections.
4. All assembly codes are updated check SHARED instead PIC for static
symbol name.
5. elf_machine_load_address in dl-machine.h are updated to support static
PIE.
6. __brk works without TLS nor dynamic relocations in read-only section
so that it can be used by __libc_setup_tls to initializes TLS in static
PIE.
NB: When glibc is built with GCC defaulted to PIE, libc.a is compiled
with -fPIE, regardless if --enable-static-pie is used to configure glibc.
When glibc is configured with --enable-static-pie, libc.a is compiled
with -fPIE, regardless whether GCC defaults to PIE or not. The same
libc.a can be used to build both static executable and static PIE.
There is no need for separate PIE copy of libc.a.
On x86-64, the normal static sln:
text data bss dec hex filename
625425 8284 5456 639165 9c0bd elf/sln
the static PIE sln:
text data bss dec hex filename
657626 20636 5392 683654 a6e86 elf/sln
The code size is increased by 5% and the binary size is increased by 7%.
Linker requirements to build glibc with --enable-static-pie:
1. Linker supports --no-dynamic-linker to remove PT_INTERP segment from
static PIE.
2. Linker can create working static PIE. The x86-64 linker needs the
fix for
https://sourceware.org/bugzilla/show_bug.cgi?id=21782
The i386 linker needs to be able to convert "movl main@GOT(%ebx), %eax"
to "leal main@GOTOFF(%ebx), %eax" if main is defined locally.
Binutils 2.29 or above are OK for i686 and x86-64. But linker status for
other targets need to be verified.
3. Linker should resolve undefined weak symbols to 0 in static PIE:
https://sourceware.org/bugzilla/show_bug.cgi?id=22269
4. Many ELF backend linkers incorrectly check bfd_link_pic for TLS
relocations, which should check bfd_link_executable instead:
https://sourceware.org/bugzilla/show_bug.cgi?id=22263
Tested on aarch64, i686 and x86-64.
Using GCC 7 and binutils master branch, build-many-glibcs.py with
--enable-static-pie with all patches for static PIE applied have the
following build successes:
PASS: glibcs-aarch64_be-linux-gnu build
PASS: glibcs-aarch64-linux-gnu build
PASS: glibcs-armeb-linux-gnueabi-be8 build
PASS: glibcs-armeb-linux-gnueabi build
PASS: glibcs-armeb-linux-gnueabihf-be8 build
PASS: glibcs-armeb-linux-gnueabihf build
PASS: glibcs-arm-linux-gnueabi build
PASS: glibcs-arm-linux-gnueabihf build
PASS: glibcs-arm-linux-gnueabihf-v7a build
PASS: glibcs-arm-linux-gnueabihf-v7a-disable-multi-arch build
PASS: glibcs-m68k-linux-gnu build
PASS: glibcs-microblazeel-linux-gnu build
PASS: glibcs-microblaze-linux-gnu build
PASS: glibcs-mips64el-linux-gnu-n32 build
PASS: glibcs-mips64el-linux-gnu-n32-nan2008 build
PASS: glibcs-mips64el-linux-gnu-n32-nan2008-soft build
PASS: glibcs-mips64el-linux-gnu-n32-soft build
PASS: glibcs-mips64el-linux-gnu-n64 build
PASS: glibcs-mips64el-linux-gnu-n64-nan2008 build
PASS: glibcs-mips64el-linux-gnu-n64-nan2008-soft build
PASS: glibcs-mips64el-linux-gnu-n64-soft build
PASS: glibcs-mips64-linux-gnu-n32 build
PASS: glibcs-mips64-linux-gnu-n32-nan2008 build
PASS: glibcs-mips64-linux-gnu-n32-nan2008-soft build
PASS: glibcs-mips64-linux-gnu-n32-soft build
PASS: glibcs-mips64-linux-gnu-n64 build
PASS: glibcs-mips64-linux-gnu-n64-nan2008 build
PASS: glibcs-mips64-linux-gnu-n64-nan2008-soft build
PASS: glibcs-mips64-linux-gnu-n64-soft build
PASS: glibcs-mipsel-linux-gnu build
PASS: glibcs-mipsel-linux-gnu-nan2008 build
PASS: glibcs-mipsel-linux-gnu-nan2008-soft build
PASS: glibcs-mipsel-linux-gnu-soft build
PASS: glibcs-mips-linux-gnu build
PASS: glibcs-mips-linux-gnu-nan2008 build
PASS: glibcs-mips-linux-gnu-nan2008-soft build
PASS: glibcs-mips-linux-gnu-soft build
PASS: glibcs-nios2-linux-gnu build
PASS: glibcs-powerpc64le-linux-gnu build
PASS: glibcs-powerpc64-linux-gnu build
PASS: glibcs-tilegxbe-linux-gnu-32 build
PASS: glibcs-tilegxbe-linux-gnu build
PASS: glibcs-tilegx-linux-gnu-32 build
PASS: glibcs-tilegx-linux-gnu build
PASS: glibcs-tilepro-linux-gnu build
and the following build failures:
FAIL: glibcs-alpha-linux-gnu build
elf/sln is failed to link due to:
assertion fail bfd/elf64-alpha.c:4125
This is caused by linker bug and/or non-PIC code in PIE libc.a.
FAIL: glibcs-hppa-linux-gnu build
elf/sln is failed to link due to:
collect2: fatal error: ld terminated with signal 11 [Segmentation fault]
https://sourceware.org/bugzilla/show_bug.cgi?id=22537
FAIL: glibcs-ia64-linux-gnu build
elf/sln is failed to link due to:
collect2: fatal error: ld terminated with signal 11 [Segmentation fault]
FAIL: glibcs-powerpc-linux-gnu build
FAIL: glibcs-powerpc-linux-gnu-soft build
FAIL: glibcs-powerpc-linux-gnuspe build
FAIL: glibcs-powerpc-linux-gnuspe-e500v1 build
elf/sln is failed to link due to:
ld: read-only segment has dynamic relocations.
This is caused by linker bug and/or non-PIC code in PIE libc.a. See:
https://sourceware.org/bugzilla/show_bug.cgi?id=22264
FAIL: glibcs-powerpc-linux-gnu-power4 build
elf/sln is failed to link due to:
findlocale.c:96:(.text+0x22c): @local call to ifunc memchr
This is caused by linker bug and/or non-PIC code in PIE libc.a.
FAIL: glibcs-s390-linux-gnu build
elf/sln is failed to link due to:
collect2: fatal error: ld terminated with signal 11 [Segmentation fault], core dumped
assertion fail bfd/elflink.c:14299
This is caused by linker bug and/or non-PIC code in PIE libc.a.
FAIL: glibcs-sh3eb-linux-gnu build
FAIL: glibcs-sh3-linux-gnu build
FAIL: glibcs-sh4eb-linux-gnu build
FAIL: glibcs-sh4eb-linux-gnu-soft build
FAIL: glibcs-sh4-linux-gnu build
FAIL: glibcs-sh4-linux-gnu-soft build
elf/sln is failed to link due to:
ld: read-only segment has dynamic relocations.
This is caused by linker bug and/or non-PIC code in PIE libc.a. See:
https://sourceware.org/bugzilla/show_bug.cgi?id=22263
Also TLS code sequence in SH assembly syscalls in glibc doesn't match TLS
code sequence expected by ld:
https://sourceware.org/bugzilla/show_bug.cgi?id=22270
FAIL: glibcs-sparc64-linux-gnu build
FAIL: glibcs-sparcv9-linux-gnu build
FAIL: glibcs-tilegxbe-linux-gnu build
FAIL: glibcs-tilegxbe-linux-gnu-32 build
FAIL: glibcs-tilegx-linux-gnu build
FAIL: glibcs-tilegx-linux-gnu-32 build
FAIL: glibcs-tilepro-linux-gnu build
elf/sln is failed to link due to:
ld: read-only segment has dynamic relocations.
This is caused by linker bug and/or non-PIC code in PIE libc.a. See:
https://sourceware.org/bugzilla/show_bug.cgi?id=22263
[BZ #19574]
* INSTALL: Regenerated.
* Makeconfig (real-static-start-installed-name): New.
(pic-default): Updated for --enable-static-pie.
(pie-default): New for --enable-static-pie.
(default-pie-ldflag): Likewise.
(+link-static-before-libc): Replace $(DEFAULT-LDFLAGS-$(@F))
with $(if $($(@F)-no-pie),$(no-pie-ldflag),$(default-pie-ldflag)).
Replace $(static-start-installed-name) with
$(real-static-start-installed-name).
(+prectorT): Updated for --enable-static-pie.
(+postctorT): Likewise.
(CFLAGS-.o): Add $(pie-default).
(CFLAGS-.op): Likewise.
* NEWS: Mention --enable-static-pie.
* config.h.in (ENABLE_STATIC_PIE): New.
* configure.ac (--enable-static-pie): New configure option.
(have-no-dynamic-linker): New LIBC_CONFIG_VAR.
(have-static-pie): Likewise.
Enable static PIE if linker supports --no-dynamic-linker.
(ENABLE_STATIC_PIE): New AC_DEFINE.
(enable-static-pie): New LIBC_CONFIG_VAR.
* configure: Regenerated.
* csu/Makefile (omit-deps): Add r$(start-installed-name) and
gr$(start-installed-name) for --enable-static-pie.
(extra-objs): Likewise.
(install-lib): Likewise.
(extra-objs): Add static-reloc.o and static-reloc.os
($(objpfx)$(start-installed-name)): Also depend on
$(objpfx)static-reloc.o.
($(objpfx)r$(start-installed-name)): New.
($(objpfx)g$(start-installed-name)): Also depend on
$(objpfx)static-reloc.os.
($(objpfx)gr$(start-installed-name)): New.
* csu/libc-start.c (LIBC_START_MAIN): Call _dl_relocate_static_pie
in libc.a.
* csu/libc-tls.c (__libc_setup_tls): Add main_map->l_addr to
initimage.
* csu/static-reloc.c: New file.
* elf/Makefile (routines): Add dl-reloc-static-pie.
(elide-routines.os): Likewise.
(DEFAULT-LDFLAGS-tst-tls1-static-non-pie): Removed.
(tst-tls1-static-non-pie-no-pie): New.
* elf/dl-reloc-static-pie.c: New file.
* elf/dl-support.c (_dl_get_dl_main_map): New function.
* elf/dynamic-link.h (ELF_DURING_STARTUP): Also check
STATIC_PIE_BOOTSTRAP.
* elf/get-dynamic-info.h (elf_get_dynamic_info): Likewise.
* gmon/Makefile (tests): Add tst-gmon-static-pie.
(tests-static): Likewise.
(DEFAULT-LDFLAGS-tst-gmon-static): Removed.
(tst-gmon-static-no-pie): New.
(CFLAGS-tst-gmon-static-pie.c): Likewise.
(CRT-tst-gmon-static-pie): Likewise.
(tst-gmon-static-pie-ENV): Likewise.
(tests-special): Likewise.
($(objpfx)tst-gmon-static-pie.out): Likewise.
(clean-tst-gmon-static-pie-data): Likewise.
($(objpfx)tst-gmon-static-pie-gprof.out): Likewise.
* gmon/tst-gmon-static-pie.c: New file.
* manual/install.texi: Document --enable-static-pie.
* sysdeps/generic/ldsodefs.h (_dl_relocate_static_pie): New.
(_dl_get_dl_main_map): Likewise.
* sysdeps/i386/configure.ac: Check if linker supports static PIE.
* sysdeps/x86_64/configure.ac: Likewise.
* sysdeps/i386/configure: Regenerated.
* sysdeps/x86_64/configure: Likewise.
* sysdeps/mips/Makefile (ASFLAGS-.o): Add $(pie-default).
(ASFLAGS-.op): Likewise.
On Skylake, bench-sinf reports performance improvement:
Before After Improvement
max 153.996 100.094 54%
min 8.546 6.852 25%
mean 18.1223 11.802 54%
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_sinf-sse2 and s_sinf-fma.
(CFLAGS-s_sinf-fma.c): New.
* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
Continuing the preparation for additional _FloatN / _FloatNx function
aliases, this patch makes x86_64 libm function implementations use
libm_alias_float to define function aliases, or libm_alias_float_other
where the main name is defined with versioned_symbol.
Tested with the glibc testsuite for x86_64, and tested with
build-many-glibcs.py for all its x86_64 configurations that installed
stripped shared libraries are unchanged by the patch.
* sysdeps/x86_64/fpu/multiarch/e_exp2f.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_expf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_log2f.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_logf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/e_powf.c: Include
<libm-alias-float.h>.
(exp2f): Define using libm_alias_float, or libm_alias_float_other
if [SHARED].
* sysdeps/x86_64/fpu/multiarch/s_ceilf.c: Include
<libm-alias-float.h>.
(ceilf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_floorf.c: Include
<libm-alias-float.h>.
(floorf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_fmaf.c: Include
<libm-alias-float.h>.
(fmaf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_nearbyintf.c: Include
<libm-alias-float.h>.
(nearbyintf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_rintf.c: Include
<libm-alias-float.h>.
(rintf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/multiarch/s_truncf.c: Include
<libm-alias-float.h>.
(truncf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_copysignf.S: Include <libm-alias-float.h>.
(copysignf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_cosf.S: Include <libm-alias-float.h>.
(cosf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fabsf.c: Include <libm-alias-float.h>.
(fabsf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fmaxf.S: Include <libm-alias-float.h>.
(fmaxf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_fminf.S: Include <libm-alias-float.h>.
(fminf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_llrintf.S: Include <libm-alias-float.h>.
(llrintf): Define using libm_alias_float.
[!__ILP32__] (lrintf): Likewise.
* sysdeps/x86_64/fpu/s_sincosf.S: Include <libm-alias-float.h>.
(sincosf): Define using libm_alias_float.
* sysdeps/x86_64/fpu/s_sinf.S: Include <libm-alias-float.h>.
(sinf): Define using libm_alias_float.
* sysdeps/x86_64/x32/fpu/s_lrintf.S: Include <libm-alias-float.h>.
(lrintf): Define using libm_alias_float.
Continuing the preparation for additional _FloatN / _FloatNx function
aliases, this patch makes x86_64 libm function implementations use
libm_alias_double to define function aliases.
Tested with the glibc testsuite for x86_64, and tested with
build-many-glibcs.py for all its x86_64 configurations that installed
stripped shared libraries are unchanged by the patch.
* sysdeps/x86_64/fpu/multiarch/s_atan.c: Include
<libm-alias-double.h>.
(atan): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_ceil.c: Include
<libm-alias-double.h>.
(ceil): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_floor.c: Include
<libm-alias-double.h>.
(floor): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_fma.c: Include
<libm-alias-double.h>.
(fma): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_nearbyint.c: Include
<libm-alias-double.h>.
(nearbyint): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_rint.c: Include
<libm-alias-double.h>.
(rint): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_sin.c: Include
<libm-alias-double.h>.
(sin): Define using libm_alias_double.
(cos): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan.c: Include
<libm-alias-double.h>.
(tan): Define using libm_alias_double.
* sysdeps/x86_64/fpu/multiarch/s_trunc.c: Include
<libm-alias-double.h>.
(trunc): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_copysign.S: Include <libm-alias-double.h>.
(copysign): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fabs.c: Include <libm-alias-double.h>.
(fabs): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fmax.S: Include <libm-alias-double.h>.
(fmax): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_fmin.S: Include <libm-alias-double.h>.
(fmin): Define using libm_alias_double.
* sysdeps/x86_64/fpu/s_llrint.S: Include <libm-alias-double.h>.
(llrint): Define using libm_alias_double.
[!__ILP32__] (lrint): Likewise.
* sysdeps/x86_64/x32/fpu/s_lrint.S: Include <libm-alias-double.h>.
(lrint): Define using libm_alias_double.
This patch continues the preparation for additional _FloatN / _FloatNx
function aliases by using libm_alias_ldouble for sysdeps/x86_64/fpu
long double functions, so that they can have _Float64x aliases added
in future.
Tested for x86_64, including build-many-glibcs.py tests that installed
stripped shared libraries are unchanged by the patch.
* sysdeps/x86_64/fpu/e_expl.S: Include <libm-alias-ldouble.h>.
[USE_AS_EXPM1L] (expm1l): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_ceill.S: Include <libm-alias-ldouble.h>.
(ceill): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_copysignl.S: Include
<libm-alias-ldouble.h>.
(copysignl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_fabsl.S: Include <libm-alias-ldouble.h>.
(fabsl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_floorl.S: Include <libm-alias-ldouble.h>.
(floorl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_fmaxl.S: Include <libm-alias-ldouble.h>.
(fmaxl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_fminl.S: Include <libm-alias-ldouble.h>.
(fminl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_llrintl.S: Include <libm-alias-ldouble.h>.
(llrintl): Define using libm_alias_ldouble.
(lrintl): Likewise.
* sysdeps/x86_64/fpu/s_nearbyintl.S: Include
<libm-alias-ldouble.h>.
(nearbyintl): Define using libm_alias_ldouble.
* sysdeps/x86_64/fpu/s_truncl.S: Include <libm-alias-ldouble.h>.
(truncl): Define using libm_alias_ldouble.
* sysdeps/x86_64/x32/fpu/s_lrintl.S: Include
<libm-alias-ldouble.h>.
(lrintl): Define using libm_alias_ldouble.
This patch adds a new build test to check for internal fields
offsets for user visible internal field. Although currently
the only field which is statically initialized to a non zero value
is pthread_mutex_t.__data.__kind value, the tests also check the
offset of __kind, __spins, __elision (if supported), and __list
internal member. A internal header (pthread-offset.h) is added
to each major ABI with the reference value.
Checked on x86_64-linux-gnu and with a build check for all affected
ABIs (aarch64-linux-gnu, alpha-linux-gnu, arm-linux-gnueabihf,
hppa-linux-gnu, i686-linux-gnu, ia64-linux-gnu, m68k-linux-gnu,
microblaze-linux-gnu, mips64-linux-gnu, mips64-n32-linux-gnu,
mips-linux-gnu, powerpc64le-linux-gnu, powerpc-linux-gnu,
s390-linux-gnu, s390x-linux-gnu, sh4-linux-gnu, sparc64-linux-gnu,
sparcv9-linux-gnu, tilegx-linux-gnu, tilegx-linux-gnu-x32,
tilepro-linux-gnu, x86_64-linux-gnu, and x86_64-linux-x32).
* nptl/pthreadP.h (ASSERT_PTHREAD_STRING,
ASSERT_PTHREAD_INTERNAL_OFFSET): New macro.
* nptl/pthread_mutex_init.c (__pthread_mutex_init): Add build time
checks for internal pthread_mutex_t offsets.
* sysdeps/aarch64/nptl/pthread-offsets.h
(__PTHREAD_MUTEX_NUSERS_OFFSET, __PTHREAD_MUTEX_KIND_OFFSET,
__PTHREAD_MUTEX_SPINS_OFFSET, __PTHREAD_MUTEX_ELISION_OFFSET,
__PTHREAD_MUTEX_LIST_OFFSET): New macro.
* sysdeps/alpha/nptl/pthread-offsets.h: Likewise.
* sysdeps/arm/nptl/pthread-offsets.h: Likewise.
* sysdeps/hppa/nptl/pthread-offsets.h: Likewise.
* sysdeps/i386/nptl/pthread-offsets.h: Likewise.
* sysdeps/ia64/nptl/pthread-offsets.h: Likewise.
* sysdeps/m68k/nptl/pthread-offsets.h: Likewise.
* sysdeps/microblaze/nptl/pthread-offsets.h: Likewise.
* sysdeps/mips/nptl/pthread-offsets.h: Likewise.
* sysdeps/nios2/nptl/pthread-offsets.h: Likewise.
* sysdeps/powerpc/nptl/pthread-offsets.h: Likewise.
* sysdeps/s390/nptl/pthread-offsets.h: Likewise.
* sysdeps/sh/nptl/pthread-offsets.h: Likewise.
* sysdeps/sparc/nptl/pthread-offsets.h: Likewise.
* sysdeps/tile/nptl/pthread-offsets.h: Likewise.
* sysdeps/x86_64/nptl/pthread-offsets.h: Likewise.
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add a new header file, sysdeps/x86/sysdep.h, for common assembly code
macros between i386 and x86-64. Tested on i686 and x86-64. There are
no differences in outputs of "readelf -a" and "objdump -dw" on all glibc
shared objects before and after the patch.
* sysdeps/i386/sysdep.h: Include <sysdeps/x86/sysdep.h> instead
of <sysdeps/generic/sysdep.h>.
(ALIGNARG): Removed.
(ASM_SIZE_DIRECTIVE): Likewise.
(ENTRY): Likewise.
(END): Likewise.
(ENTRY_CHK): Likewise.
(END_CHK): Likewise.
(syscall_error): Likewise.
(mcount): Likewise.
(PSEUDO_END): Likewise.
(L): Likewise.
(atom_text_section): Likewise.
* sysdeps/x86/sysdep.h: New file.
* sysdeps/x86_64/sysdep.h: Include <sysdeps/x86/sysdep.h> instead
of <sysdeps/generic/sysdep.h>.
(ALIGNARG): Removed.
(ASM_SIZE_DIRECTIVE): Likewise.
(ENTRY): Likewise.
(END): Likewise.
(ENTRY_CHK): Likewise.
(END_CHK): Likewise.
(syscall_error): Likewise.
(mcount): Likewise.
(PSEUDO_END): Likewise.
(L): Likewise.
(atom_text_section): Likewise.
In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
mask and bound registers. It simplifies _dl_runtime_resolve and supports
different calling conventions. ld.so code size is reduced by more than
1 KB. However, use fxsave/xsave/xsavec takes a little bit more cycles
than saving and restoring vector and bound registers individually.
Latency for _dl_runtime_resolve to lookup the function, foo, from one
shared library plus libc.so:
Before After Change
Westmere (SSE)/fxsave 345 866 151%
IvyBridge (AVX)/xsave 420 643 53%
Haswell (AVX)/xsave 713 1252 75%
Skylake (AVX+MPX)/xsavec 559 719 28%
Skylake (AVX512+MPX)/xsavec 145 272 87%
Ryzen (AVX)/xsavec 280 553 97%
This is the worst case where portion of time spent for saving and
restoring registers is bigger than majority of cases. With smaller
_dl_runtime_resolve code size, overall performance impact is negligible.
On IvyBridge, differences in build and test time of binutils with lazy
binding GCC and binutils are noises. On Westmere, differences in
bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
binutils are also noises.
[BZ #21265]
* sysdeps/x86/cpu-features-offsets.sym (XSAVE_STATE_SIZE_OFFSET):
New.
* sysdeps/x86/cpu-features.c: Include <libc-pointer-arith.h>.
(get_common_indeces): Set xsave_state_size, xsave_state_full_size
and bit_arch_XSAVEC_Usable if needed.
(init_cpu_features): Remove bit_arch_Use_dl_runtime_resolve_slow
and bit_arch_Use_dl_runtime_resolve_opt.
* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
Removed.
(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
(bit_arch_Prefer_No_AVX512): Updated.
(bit_arch_MathVec_Prefer_No_AVX512): Likewise.
(bit_arch_XSAVEC_Usable): New.
(STATE_SAVE_OFFSET): Likewise.
(STATE_SAVE_MASK): Likewise.
[__ASSEMBLER__]: Include <cpu-features-offsets.h>.
(cpu_features): Add xsave_state_size and xsave_state_full_size.
(index_arch_Use_dl_runtime_resolve_opt): Removed.
(index_arch_Use_dl_runtime_resolve_slow): Likewise.
(index_arch_XSAVEC_Usable): New.
* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
Support XSAVEC_Usable. Remove Use_dl_runtime_resolve_slow.
* sysdeps/x86_64/Makefile (tst-x86_64-1-ENV): New if tunables
is enabled.
* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup):
Replace _dl_runtime_resolve_sse, _dl_runtime_resolve_avx,
_dl_runtime_resolve_avx_slow, _dl_runtime_resolve_avx_opt,
_dl_runtime_resolve_avx512 and _dl_runtime_resolve_avx512_opt
with _dl_runtime_resolve_fxsave, _dl_runtime_resolve_xsave and
_dl_runtime_resolve_xsavec.
* sysdeps/x86_64/dl-trampoline.S (DL_RUNTIME_UNALIGNED_VEC_SIZE):
Removed.
(DL_RUNTIME_RESOLVE_REALIGN_STACK): Check STATE_SAVE_ALIGNMENT
instead of VEC_SIZE.
(REGISTER_SAVE_BND0): Removed.
(REGISTER_SAVE_BND1): Likewise.
(REGISTER_SAVE_BND3): Likewise.
(REGISTER_SAVE_RAX): Always defined to 0.
(VMOV): Removed.
(_dl_runtime_resolve_avx): Likewise.
(_dl_runtime_resolve_avx_slow): Likewise.
(_dl_runtime_resolve_avx_opt): Likewise.
(_dl_runtime_resolve_avx512): Likewise.
(_dl_runtime_resolve_avx512_opt): Likewise.
(_dl_runtime_resolve_sse): Likewise.
(_dl_runtime_resolve_sse_vex): Likewise.
(USE_FXSAVE): New.
(_dl_runtime_resolve_fxsave): Likewise.
(USE_XSAVE): Likewise.
(_dl_runtime_resolve_xsave): Likewise.
(USE_XSAVEC): Likewise.
(_dl_runtime_resolve_xsavec): Likewise.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx512):
Removed.
(_dl_runtime_resolve_avx512_opt): Likewise.
(_dl_runtime_resolve_avx): Likewise.
(_dl_runtime_resolve_avx_opt): Likewise.
(_dl_runtime_resolve_sse): Likewise.
(_dl_runtime_resolve_sse_vex): Likewise.
(_dl_runtime_resolve_fxsave): New.
(_dl_runtime_resolve_xsave): Likewise.
(_dl_runtime_resolve_xsavec): Likewise.
This code is used in non-PIE static executable and static PIE. It checks
if _DYNAMIC is undefined before using it to compute load address. But
not all targets can convert access _DYNAMIC via GOT, which needs dynamic
relocation, to PC-relative at link-time.
* sysdeps/i386/dl-machine.h (elf_machine_load_address): Don't
allow undefined _DYNAMIC in PIE libc.a.
* sysdeps/x86_64/dl-machine.h (elf_machine_load_address):
Likewse.
This patch converts the dbl-64 implementations of atan and tan into
weak aliases of __atan and __tan, in preparation for making them use
libm_alias_double. Consequent changes are made to the x86_64
multiarch versions wrapping round them (with the dbl-64 functions,
like other such functions, being made not to define their aliases at
all if __atan or __tan are defined as macros by an including file).
Tested for x86_64, and with build-many-glibcs.py.
* sysdeps/ieee754/dbl-64/s_atan.c (atan): Rename to __atan and
define as weak alias of __atan. Do not define any aliases if
[__atan].
[NO_LONG_DOUBLE] (__atanl): Define as strong alias of __atan.
[NO_LONG_DOUBLE] (atanl): Define as weak alias of __atanl.
* sysdeps/ieee754/dbl-64/s_tan.c (tan): Rename to __tan and define
as weak alias of __tan. Do not define any aliases if [__tan].
[NO_LONG_DOUBLE] (__tanl): Define as strong alias of __tan.
[NO_LONG_DOUBLE] (tanl): Define as weak alias of __tanl.
* sysdeps/x86_64/fpu/multiarch/s_atan-avx.c (atan): Rename to
__atan.
* sysdeps/x86_64/fpu/multiarch/s_atan-fma.c (atan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_atan-fma4.c (atan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_atan.c (atan): Rename to __atan
and define as weak alias of __atan.
* sysdeps/x86_64/fpu/multiarch/s_tan-avx.c (tan): Rename to
__atan.
* sysdeps/x86_64/fpu/multiarch/s_tan-fma.c (tan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan-fma4.c (tan): Likewise.
* sysdeps/x86_64/fpu/multiarch/s_tan.c (tan): Rename to __tan and
define as weak alias of __tan.
The new generic expf and exp2f code don't need wrappers any more, they
set errno inline, so only use the wrappers on targets that need it.
(If the wrapper is needed, then the top level wrapper code is included,
otherwise empty w_exp*f.c is used to suppress the wrapper.)
A powerpc64 expf implementation includes the expf c code directly which
needed some changes.
* sysdeps/ieee754/flt-32/e_exp2f.c (__exp2f): Define without wrapper.
* sysdeps/ieee754/flt-32/e_expf.c (__expf): Likewise
* sysdeps/ieee754/flt-32/w_exp2f.c: New file.
* sysdeps/ieee754/flt-32/w_expf.c: New file.
* sysdeps/powerpc/powerpc64/fpu/multiarch/e_expf-ppc64.c: Update for
the new expf code.
* sysdeps/powerpc/powerpc64/fpu/multiarch/w_expf.c: New file.
* sysdeps/powerpc/powerpc64/power8/fpu/w_expf.c: New file.
* sysdeps/m68k/m680x0/fpu/w_exp2f.c: New file.
* sysdeps/m68k/m680x0/fpu/w_expf.c: New file.
* sysdeps/i386/fpu/w_exp2f.c: New file.
* sysdeps/i386/fpu/w_expf.c: New file.
* sysdeps/i386/i686/fpu/multiarch/w_expf.c: New file.
* sysdeps/x86_64/fpu/w_expf.c: New file.
When --enable-static-pie is used to build static PIE, _DYNAMIC is used
to compute the load address of static PIE. But _DYNAMIC is undefined
when creating static executable. This patch makes _DYNAMIC weak in PIE
libc.a so that it can be undefined.
* sysdeps/i386/dl-machine.h (elf_machine_load_address): Allow
undefined _DYNAMIC in PIE libc.a.
* sysdeps/x86_64/dl-machine.h (elf_machine_load_address):
Likewse.
This patch adds SSE4.1 versions of trunc and truncf, using the roundsd
/ roundss instructions, similar to the versions of ceil, floor, rint
and nearbyint functions we already have. In my testing with the glibc
benchtests these are about 30% faster than the C versions for double,
20% faster for float.
Tested for x86_64.
[BZ #20142]
* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
Add s_trunc-c, s_truncf-c, s_trunc-sse4_1 and s_truncf-sse4_1.
* sysdeps/x86_64/fpu/multiarch/s_trunc-c.c: New file.
* sysdeps/x86_64/fpu/multiarch/s_trunc-sse4_1.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_trunc.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf-c.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf-sse4_1.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/s_truncf.c: Likewise.
AVX512 functions in mathvec are used on machines with AVX512. An AVX2
wrapper is also provided and it can be used when the AVX512 version
isn't profitable. MathVec_Prefer_No_AVX512 is addded to cpu-features.
If glibc.tune.hwcaps=MathVec_Prefer_No_AVX512 is set in GLIBC_TUNABLES
environment variable, the AVX2 wrapper will be used.
Tested on x86-64 machines with and without AVX512. Also verified
glibc.tune.hwcaps=MathVec_Prefer_No_AVX512 on AVX512 machine.
[BZ #21967]
* sysdeps/x86/cpu-features.h (bit_arch_MathVec_Prefer_No_AVX512):
New.
(index_arch_MathVec_Prefer_No_AVX512): Likewise.
* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
Handle MathVec_Prefer_No_AVX512.
* sysdeps/x86_64/fpu/multiarch/ifunc-mathvec-avx512.h
(IFUNC_SELECTOR): Return AVX2 version if MathVec_Prefer_No_AVX512
is set.
Before glibc 2.26, ld.so set dl_platform to "x86_64" and searched the
"x86_64" subdirectory when loading a shared library. ld.so in glibc
2.26 was changed to set dl_platform to "haswell" or "xeon_phi", based
on supported ISAs. This led to shared library loading failure for
shared libraries placed under the "x86_64" subdirectory.
This patch adds "x86_64" to x86-64 dl_hwcap so that ld.so will always
search the "x86_64" subdirectory when loading a shared library.
NB: We can't set x86-64 dl_platform to "x86-64" since ld.so will skip
the "haswell" and "xeon_phi" subdirectories on "haswell" and "xeon_phi"
machines.
Tested on i686 and x86-64.
[BZ #22093]
* sysdeps/x86/cpu-features.c (init_cpu_features): Initialize
GLRO(dl_hwcap) to HWCAP_X86_64 for x86-64.
* sysdeps/x86/dl-hwcap.h (HWCAP_COUNT): Updated.
(HWCAP_IMPORTANT): Likewise.
(HWCAP_X86_64): New enum.
(HWCAP_X86_AVX512_1): Updated.
* sysdeps/x86/dl-procinfo.c (_dl_x86_hwcap_flags): Add "x86_64".
* sysdeps/x86_64/Makefile (tests): Add tst-x86_64-1.
(modules-names): Add x86_64/tst-x86_64mod-1.
(LDFLAGS-tst-x86_64mod-1.so): New.
($(objpfx)tst-x86_64-1): Likewise.
($(objpfx)x86_64/tst-x86_64mod-1.os): Likewise.
(tst-x86_64-1-clean): Likewise.
* sysdeps/x86_64/tst-x86_64-1.c: New file.
* sysdeps/x86_64/tst-x86_64mod-1.c: Likewise.
This patch obsoletes the pow10, pow10f and pow10l functions (makes
them into compat symbols, not available for new ports or static
linking). The exp10 names for these functions are standardized (in TS
18661-4) and were added in the same glibc version (2.1) as pow10 so
source code can change to use them without any loss of portability.
Since pow10 is deliberately not provided for _Float128, only exp10,
this slightly simplifies moving to the new wrapper templates in the
!LIBM_SVID_COMPAT case, by avoiding needing to arrange for pow10,
pow10f and pow10l to be defined by those templates.
Tested for x86_64, and with build-many-glibcs.py.
* manual/math.texi (pow10): Do not document.
(pow10f): Likewise.
(pow10l): Likewise.
* math/bits/mathcalls.h [__USE_GNU] (pow10): Do not declare.
* math/bits/math-finite.h [__USE_GNU] (pow10): Likewise.
* math/libm-test-exp10.inc (pow10_test): Remove.
(do_test): Do not call pow10.
* math/w_exp10_compat.c (pow10): Make into compat symbol.
[NO_LONG_DOUBLE] (pow10l): Likewise.
* math/w_exp10f_compat.c (pow10f): Likewise.
* math/w_exp10l_compat.c (pow10l): Likewise.
* sysdeps/ia64/fpu/e_exp10.S: Include <shlib-compat.h>.
(pow10): Make into compat symbol.
* sysdeps/ia64/fpu/e_exp10f.S: Include <shlib-compat.h>.
(pow10f): Make into compat symbol.
* sysdeps/ia64/fpu/e_exp10l.S: Include <shlib-compat.h>.
(pow10l): Make into compat symbol.
* sysdeps/ieee754/ldbl-opt/Makefile (libnldbl-calls): Remove
pow10.
(CFLAGS-nldbl-pow10.c): Remove variable..
* sysdeps/ieee754/ldbl-opt/nldbl-pow10.c: Remove file.
* sysdeps/ieee754/ldbl-opt/w_exp10_compat.c (pow10l): Condition on
[SHLIB_COMPAT (libm, GLIBC_2_1, GLIBC_2_27)].
* sysdeps/ieee754/ldbl-opt/w_exp10l_compat.c (compat_symbol):
Undefine and redefine.
(pow10l): Make into compat symbol.
* sysdeps/aarch64/libm-test-ulps: Remove pow10 ulps.
* sysdeps/alpha/fpu/libm-test-ulps: Likewise.
* sysdeps/arm/libm-test-ulps: Likewise.
* sysdeps/hppa/fpu/libm-test-ulps: Likewise.
* sysdeps/i386/fpu/libm-test-ulps: Likewise.
* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Likewise.
* sysdeps/microblaze/libm-test-ulps: Likewise.
* sysdeps/mips/mips32/libm-test-ulps: Likewise.
* sysdeps/mips/mips64/libm-test-ulps: Likewise.
* sysdeps/nios2/libm-test-ulps: Likewise.
* sysdeps/powerpc/fpu/libm-test-ulps: Likewise.
* sysdeps/powerpc/nofpu/libm-test-ulps: Likewise.
* sysdeps/s390/fpu/libm-test-ulps: Likewise.
* sysdeps/sh/libm-test-ulps: Likewise.
* sysdeps/sparc/fpu/libm-test-ulps: Likewise.
* sysdeps/tile/libm-test-ulps: Likewise.
* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
Update libm-test-ulps for AVX512 mathvec tests by running
“make regen-ulps” on Intel Xeon processor with AVX512.
* sysdeps/x86_64/fpu/libm-test-ulps: Regenerated.
Since binutils 2.25 or later is required to build glibc, we can replace
AVX512F .byte sequences with AVX512F instructions.
Tested on x86-64 and x32. There are no code differences in libmvec.so
and libmvec.a.
* sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Replace AVX512F
.byte sequences with AVX512F instructions.
* sysdeps/x86_64/fpu/svml_d_wrapper_impl.h: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_wrapper_impl.h: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S:
Likewise.
Mark internal SSE2 functions with attribute_hidden to allow direct
access within libc.so and libc.a without using GOT nor PLT.
[BZ #18822]
* sysdeps/x86_64/multiarch/strcspn-c.c (STRCSPN_SSE2): Add
attribute_hidden.
(__strspn_sse2): Likewise.
Since the AVX2 version of mathvec functions uses FMA, it can only be
used when FMA is usable.
[BZ #21966]
* sysdeps/x86_64/fpu/multiarch/ifunc-mathvec-avx2.h
(IFUNC_SELECTOR): Don't use the AVX2 version if FMA isn't
usable.
sysdeps/x86_64/fpu/e_expf.S has
lea L(SP_RANGE)(%rip), %rdx /* load over/underflow bound */
cmpl (%rdx,%rax,4), %ecx /* |x|<under/overflow bound ? */
...
/* Here if |x| is Inf */
lea L(SP_INF_0)(%rip), %rdx /* depending on sign of x: */
movss (%rdx,%rax,4), %xmm0 /* return zero or Inf */
ret
...
.section .rodata.cst8,"aM",@progbits,8
...
.p2align 2
L(SP_RANGE): /* single precision overflow/underflow bounds */
.long 0x42b17217 /* if x>this bound, then result overflows */
.long 0x42cff1b4 /* if x<this bound, then result underflows */
.type L(SP_RANGE), @object
ASM_SIZE_DIRECTIVE(L(SP_RANGE))
.p2align 2
L(SP_INF_0):
.long 0x7f800000 /* single precision Inf */
.long 0 /* single precision zero */
.type L(SP_INF_0), @object
ASM_SIZE_DIRECTIVE(L(SP_INF_0))
Since L(SP_RANGE) and L(SP_INF_0) are in .rodata.cst8 section, they must
be aligned to 8 bytes.
[BZ #21955]
* sysdeps/x86_64/fpu/e_expf.S (L(SP_RANGE)): Aligned to 8 bytes.
(L(SP_INF_0)): Likewise.
This commit separates allocating and raising exceptions. This
simplifies catching and re-raising them because it is no longer
necessary to make a temporary, on-stack copy of the exception message.
Since apply_irel is called before memcpy and mempcpy are called, we
can use IFUNC memcpy and mempcpy in libc.a.
* sysdeps/x86_64/memmove.S (MEMCPY_SYMBOL): Don't check SHARED.
(MEMPCPY_SYMBOL): Likewise.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test memcpy and mempcpy in libc.a.
* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Also include
in libc.a.
* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memcpy.c: Also include in libc.a.
(__hidden_ver1): Don't use in libc.a.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
(__mempcpy): Don't create a weak alias in libc.a.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Support
libc.a.
* sysdeps/x86_64/multiarch/mempcpy.c: Also include in libc.a.
(__hidden_ver1): Don't use in libc.a.
Since gold doesn't support INSERT in linker script:
https://sourceware.org/bugzilla/show_bug.cgi?id=21676
tst-split-dynreloc fails to link with gold. Check if linker supports
INSERT in linker script before using it.
* config.make.in (have-insert): New.
* configure.ac (libc_cv_insert): New. Set to yes if linker
supports INSERT in linker script.
(AC_SUBST(libc_cv_insert): New.
* configure: Regenerated.
* sysdeps/x86_64/Makefile (tests): Add tst-split-dynreloc only
if $(have-insert) == yes.
__memset_zero_constant_len_parameter should be removed by
commit 61062f5630
Author: Ulrich Drepper <drepper@redhat.com>
Date: Tue Mar 1 00:35:23 2005 +0000
2005-02-24 Roland McGrath <roland@redhat.com>
* debug/Versions (libc: GLIBC_2.4): Remove
__memset_zero_constant_len_parameter.
* sysdeps/generic/memset_chk.c: Remove alias and warning.
* misc/sys/cdefs.h (__warndecl): New macro.
* debug/warning-nop.c: New file.
* string/bits/string3.h (memset): Call __warn_memset_zero_len with no
arguments, instead of calling __memset_zero_constant_len_parameter.
Use __warndecl for __warn_memset_zero_len.
* debug/Makefile (routines): Add $(static-only-routines).
(static-only-routines): New variable.
This patch removes the last emaining pieces of it. Tested it on i586,
i686 and x86-64.
[BZ #21790]
* sysdeps/i386/i586/memset.S
(__memset_zero_constant_len_parameter): Removed.
* sysdeps/i386/i686/memset.S
(__memset_zero_constant_len_parameter): Likewise.
* sysdeps/i386/i686/multiarch/memset_chk.S
(__memset_zero_constant_len_parameter): Likewise.
* sysdeps/x86_64/memset.S (__memset_zero_constant_len_parameter):
Likewise.
Since there are no multiarch versions of memmove_chk and memset_chk,
test multiarch versions of memmove_chk and memset_chk only in libc.so.
[BZ #21741]
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test memmove_chk and memset_chk only
in libc.so.
This change forces realignment of the stack pointer in __tls_get_addr, so
that binaries compiled by GCCs older than GCC 4.9:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
continue to work even if vector instructions are used in glibc which
require the ABI stack realignment.
__tls_get_addr_slow is added to handle the slow paths in the default
implementation of__tls_get_addr in elf/dl-tls.c. The new __tls_get_addr
calls __tls_get_addr_slow after realigning the stack. Internal calls
within ld.so go directly to the default implementation of __tls_get_addr
because they do not need stack realignment.
[BZ #21609]
* sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr.
(gen-as-const-headers): Add rtld-offsets.sym.
* sysdeps/x86_64/dl-tls.c: New file.
* sysdeps/x86_64/rtld-offsets.sym: Likwise.
* sysdeps/x86_64/tls_get_addr.S: Likewise.
* sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards.
* sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New.
(TI_OFFSET_OFFSET): Likwise.
This patch implements a requirement of binutils >= 2.25 (up from 2.22)
to build glibc. Tests for 2.24 or later on x86_64 and s390 are
removed. It was already the case, as indicated by buildbot results,
that 2.24 was too old for building tests for 32-bit x86 (produced
internal linker errors linking elf/tst-gnu2-tls1mod.so). I don't know
if any configure tests for binutils features are obsolete given the
increased version requirement.
Tested for x86_64.
* configure.ac (AS): Require binutils 2.25 or later.
(LD): Likewise.
* configure: Regenerated.
* sysdeps/s390/configure.ac (AS): Remove version check.
* sysdeps/s390/configure: Regenerated.
* sysdeps/x86_64/configure.ac (AS): Remove version check.
* sysdeps/x86_64/configure: Regenerated.
* manual/install.texi (Tools for Compilation): Document
requirement for binutils 2.25 or later.
* INSTALL: Regenerated.
Check the first 32 bytes before checking size when size >= 32 bytes
to avoid unnecessary branch if the difference is in the first 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.
On Haswell, the new version is as fast as the previous one. On Skylake,
the new version is a little bit faster.
* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
the first 32 bytes before checking size when size >= 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.
This patch enables float128 support for x86_64 and x86. All GCC
versions that can build glibc provide the required support, but since
GCC 6 and before don't provide __builtin_nanq / __builtin_nansq, sNaN
tests and some tests of NaN payloads need to be disabled with such
compilers (this does not affect the generated glibc binaries at all,
just the tests). bits/floatn.h declares float128 support to be
available for GCC versions that provide the required libgcc support
(4.3 for x86_64, 4.4 for i386 GNU/Linux, 4.5 for i386 GNU/Hurd);
compilation-only support was present some time before then, but not
really useful without the libgcc functions.
fenv_private.h needed updating to avoid trying to put _Float128 values
in registers. I make no assertion of optimality of the
math_opt_barrier / math_force_eval definitions for this case; they are
simply intended to be sufficient to work correctly.
Tested for x86_64 and x86, with GCC 7 and GCC 6. (Testing for x32 was
compilation tests only with build-many-glibcs.py to verify the ABI
baseline updates. I have not done any testing for Hurd, although the
float128 support is enabled there as for GNU/Linux.)
* sysdeps/i386/Implies: Add ieee754/float128.
* sysdeps/x86_64/Implies: Likewise.
* sysdeps/x86/bits/floatn.h: New file.
* sysdeps/x86/float128-abi.h: Likewise.
* manual/math.texi (Mathematics): Document support for _Float128
on x86_64 and x86.
* sysdeps/i386/fpu/fenv_private.h: Include <bits/floatn.h>.
(math_opt_barrier): Do not put _Float128 values in floating-point
registers.
(math_force_eval): Likewise.
[__x86_64__] (SET_RESTORE_ROUNDF128): New macro.
* sysdeps/x86/fpu/Makefile [$(subdir) = math] (CPPFLAGS): Append
to Makefile variable.
* sysdeps/x86/fpu/e_sqrtf128.c: New file.
* sysdeps/x86/fpu/sfp-machine.h: Likewise. Based on libgcc.
* sysdeps/x86/math-tests.h: New file.
* math/libm-test-support.h (XFAIL_FLOAT128_PAYLOAD): New macro.
* math/libm-test-getpayload.inc (getpayload_test_data): Use
XFAIL_FLOAT128_PAYLOAD.
* math/libm-test-setpayload.inc (setpayload_test_data): Likewise.
* math/libm-test-totalorder.inc (totalorder_test_data): Likewise.
* math/libm-test-totalordermag.inc (totalordermag_test_data):
Likewise.
* sysdeps/unix/sysv/linux/i386/libc.abilist: Update.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/libc.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.
* sysdeps/i386/fpu/libm-test-ulps: Likewise.
* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Likewise.
* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
This code:
L(between_2_3):
/* Load as big endian with overlapping loads and bswap to avoid
branches. */
movzwl -2(%rdi, %rdx), %eax
movzwl -2(%rsi, %rdx), %ecx
shll $16, %eax
shll $16, %ecx
movzwl (%rdi), %edi
movzwl (%rsi), %esi
orl %edi, %eax
orl %esi, %ecx
bswap %eax
bswap %ecx
subl %ecx, %eax
ret
needs a saturating subtract because the full register is used.
With this commit, only the lower 24 bits of the register are used,
so a regular subtraction suffices.
The test case change adds coverage for these kinds of bugs.
Implement strcmp family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcmp family functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2,
strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx,
strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx.
* sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file.
* sysdeps/x86_64/multiarch/strcasecmp.c: Likewise.
* sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise.
* sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise.
* sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise.
* sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise.
* sysdeps/x86_64/multiarch/strcmp.c: Likewise.
* sysdeps/x86_64/multiarch/strncase.c: Likewise.
* sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise.
* sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise.
* sysdeps/x86_64/multiarch/strncase_l.c: Likewise.
* sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise.
* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
* sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed.
* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
* sysdeps/x86_64/multiarch/strncase_l.S: Likewise.
* sysdeps/x86_64/multiarch/strncmp.S: Likewise.
* sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>.
(STRCMP_SSE42): New. Defined to __strcmp_sse42 if not defined.
[USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include
"locale-defines.h".
(UPDATE_STRNCMP_COUNTER): New.
(SECTION): Likewise.
(GLABEL): Likewise.
(LABEL): Likewise.
* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable
for libc.a.
<locale.h> is specified to define locale_t in POSIX.1-2008, and so are
all of the headers that define functions that take locale_t arguments.
Under _GNU_SOURCE, the additional headers that define such functions
have also always defined locale_t. Therefore, there is no need to use
__locale_t in public function prototypes, nor in any internal code.
* ctype/ctype-c99_l.c, ctype/ctype.h, ctype/ctype_l.c
* include/monetary.h, include/stdlib.h, include/time.h
* include/wchar.h, locale/duplocale.c, locale/freelocale.c
* locale/global-locale.c, locale/langinfo.h, locale/locale.h
* locale/localeinfo.h, locale/newlocale.c
* locale/nl_langinfo_l.c, locale/uselocale.c
* localedata/bug-usesetlocale.c, localedata/tst-xlocale2.c
* stdio-common/vfscanf.c, stdlib/monetary.h, stdlib/stdlib.h
* stdlib/strfmon_l.c, stdlib/strtod_l.c, stdlib/strtof_l.c
* stdlib/strtol.c, stdlib/strtol_l.c, stdlib/strtold_l.c
* stdlib/strtoll_l.c, stdlib/strtoul_l.c, stdlib/strtoull_l.c
* string/strcasecmp.c, string/strcoll_l.c, string/string.h
* string/strings.h, string/strncase.c, string/strxfrm_l.c
* sysdeps/ieee754/float128/strtof128_l.c
* sysdeps/ieee754/float128/wcstof128.c
* sysdeps/ieee754/float128/wcstof128_l.c
* sysdeps/ieee754/ldbl-128ibm/strtold_l.c
* sysdeps/ieee754/ldbl-64-128/strtold_l.c
* sysdeps/ieee754/ldbl-opt/nldbl-compat.c
* sysdeps/ieee754/ldbl-opt/nldbl-strfmon_l.c
* sysdeps/ieee754/ldbl-opt/nldbl-strtold_l.c
* sysdeps/ieee754/ldbl-opt/nldbl-wcstold_l.c
* sysdeps/powerpc/powerpc32/power7/strcasecmp.S
* sysdeps/powerpc/powerpc64/power7/strcasecmp.S
* sysdeps/x86_64/strcasecmp_l-nonascii.c
* sysdeps/x86_64/strncase_l-nonascii.c, time/strftime_l.c
* time/strptime_l.c, time/time.h, wcsmbs/mbsrtowcs_l.c
* wcsmbs/wchar.h, wcsmbs/wcscasecmp.c, wcsmbs/wcsncase.c
* wcsmbs/wcstod.c, wcsmbs/wcstod_l.c, wcsmbs/wcstof.c
* wcsmbs/wcstof_l.c, wcsmbs/wcstol_l.c, wcsmbs/wcstold.c
* wcsmbs/wcstold_l.c, wcsmbs/wcstoll_l.c, wcsmbs/wcstoul_l.c
* wcsmbs/wcstoull_l.c, wctype/iswctype_l.c
* wctype/towctrans_l.c, wctype/wcfuncs_l.c
* wctype/wctrans_l.c, wctype/wctype.h, wctype/wctype_l.c:
Change all uses of __locale_t to locale_t.
Remove one more string inline that was defined directly in string.h;
in the absence of the rest of the inlines, it broke the build.
Like other ifunc shims for these functions,
x86_64/multiarch/{mem,st}pcpy.c need to define __NO_STRING_INLINES and
NO_MEMPCPY_STPCPY_REDIRECT.
* string/string.h (__mempcpy_inline): Delete.
* sysdeps/x86_64/multiarch/mempcpy.c
* sysdeps/x86_64/multiarch/stpcpy.c:
Define NO_MEMPCPY_STPCPY_REDIRECT and __NO_STRING_INLINES
before including string.h.
These machine-dependent inline string functions have never been on by
default, and even if they were a good idea at the time they were
introduced, they haven't really been touched in ten to fifteen years
and probably aren't a good idea on current-gen processors. Current
thinking is that this class of optimization is best left to the
compiler.
* bits/string.h, string/bits/string.h
* sysdeps/aarch64/bits/string.h
* sysdeps/m68k/m680x0/m68020/bits/string.h
* sysdeps/s390/bits/string.h, sysdeps/sparc/bits/string.h
* sysdeps/x86/bits/string.h: Delete file.
* string/string.h: Don't include bits/string.h.
* string/bits/string3.h: Rename to bits/string_fortified.h.
No need to undef various symbols that the removed headers
might have defined as macros.
* string/Makefile (headers): Remove bits/string.h, change
bits/string3.h to bits/string_fortified.h.
* string/string-inlines.c: Update commentary. Remove definitions
of various macros that nothing looks at anymore. Don't directly
include bits/string.h. Set _STRING_INLINE_unaligned here, based on
compiler-predefined macros.
* string/strncat.c: If STRNCAT is not defined, or STRNCAT_PRIMARY
_is_ defined, provide internal hidden alias __strncat.
* include/string.h: Declare internal hidden alias __strncat.
Only forward __stpcpy to __builtin_stpcpy if __NO_STRING_INLINES is
not defined.
* include/bits/string3.h: Rename to bits/string_fortified.h,
update to match above.
* sysdeps/i386/string-inlines.c: Define compat symbols for
everything formerly defined by sysdeps/x86/bits/string.h.
Make existing definitions into compat symbols as well.
Remove some no-longer-necessary messing around with macros.
* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
* sysdeps/s390/multiarch/mempcpy.c
No need to define _HAVE_STRING_ARCH_mempcpy.
Do define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT.
* sysdeps/i386/i686/multiarch/strncat-c.c
* sysdeps/s390/multiarch/strncat-c.c
* sysdeps/x86_64/multiarch/strncat-c.c
Define STRNCAT_PRIMARY. Don't change definition of libc_hidden_def.
Implement strcspn/strpbrk/strspn IFUNC selectors in C
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcspn-sse2, strpbrk-sse2 and strspn-sse2.
* sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed.
Check USE_AS_STRPBRK instead of STRPBRK_P.
* sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New.
* sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file.
* sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strcspn.c: Likewise.
* sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strpbrk.c: Likewise.
* sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strspn.c: Likewise.
* sysdeps/x86_64/multiarch/strcspn.S: Removed.
* sysdeps/x86_64/multiarch/strpbrk.S: Likewise.
* sysdeps/x86_64/multiarch/strspn.S: Likewise.
* sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED"
and "#endif".
Implement strcat family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcat family functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcat-sse2.
* sysdeps/x86_64/multiarch/strcat-sse2.S: New file.
* sysdeps/x86_64/multiarch/strcat.c: Likewise.
* sysdeps/x86_64/multiarch/strncat.c: Likewise.
* sysdeps/x86_64/multiarch/strcat.S: Removed.
* sysdeps/x86_64/multiarch/strncat.S: Likewise.
Implement memcmp family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memcmp family functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memcmp-sse2.
* sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file.
* sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/memcmp.c: Likewise.
* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
* sysdeps/x86_64/multiarch/memcmp.S: Removed.
* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
Implement memset family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memset functions within libc.
2017-06-07 H.J. Lu <hongjiu.lu@intel.com>
Erich Elsen <eriche@google.com>
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memset-sse2-unaligned-erms, and memset_chk-nonshared.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add test for __memset_chk_erms.
Update comments.
* sysdeps/x86_64/multiarch/ifunc-memset.h: New file.
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
* sysdeps/x86_64/multiarch/memset.c: Likewise.
* sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise.
* sysdeps/x86_64/multiarch/memset_chk.c: Likewise.
* sysdeps/x86_64/multiarch/memset.S: Removed.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(__memset_chk_erms): New function.
Implement memmove family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memmove family functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memmove-sse2-unaligned-erms, memcpy_chk-nonshared,
mempcpy_chk-nonshared and memmove_chk-nonshared.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __memmove_chk_erms,
__memcpy_chk_erms and __mempcpy_chk_erms. Update comments.
* sysdeps/x86_64/multiarch/ifunc-memmove.h: New file.
* sysdeps/x86_64/multiarch/memcpy.c: Likewise.
* sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise.
* sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise.
* sysdeps/x86_64/multiarch/memmove.c: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
* sysdeps/x86_64/multiarch/mempcpy.c: Likewise.
* sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise.
* sysdeps/x86_64/multiarch/memcpy.S: Removed.
* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(__mempcpy_chk_erms): New function.
(__memmove_chk_erms): Likewise.
(__memcpy_chk_erms): New alias.
This is fairly complicated, not because the users of __need_Emath and
__need_error_t have complicated requirements, but because the core
changes had a lot of fallout.
__need_error_t exists for gnulib compatibility in argz.h and argp.h.
error_t itself is a Hurdism, an enum containing all the E-constants,
so you can do 'p (error_t) errno' in gdb and get a symbolic value.
argz.h and argp.h use it for function return values, and they want to
fall back to 'int' when that's not available. There is no reason why
these nonstandard headers cannot just go ahead and include all of
errno.h; so we do that.
__need_Emath is defined only by .S files; what they _really_ need is
for errno.h to avoid declaring anything other than the E-constants
(e.g. 'extern int __errno_location(void);' is a syntax error in
assembly language). This is replaced with a check for __ASSEMBLER__ in
errno.h, plus a carefully documented requirement for bits/errno.h not
to define anything other than macros. That in turn has the
consequence that bits/errno.h must not define errno - fortunately, all
live ports use the same definition of errno, so I've moved it to
errno.h. The Hurd bits/errno.h must also take care not to define
error_t when __ASSEMBLER__ is defined, which involves repeating all of
the definitions twice, but it's a generated file so that's okay.
* stdlib/errno.h: Remove __need_Emath and __need_error_t logic.
Reorganize file. Declare errno here. When __ASSEMBLER__ is
defined, don't declare anything other than the E-constants.
* include/errno.h: Change conditional for exposing internal
declarations to (not _ISOMAC and not __ASSEMBLER__).
* bits/errno.h: Remove logic for __need_Emath. Document
requirements for a port-specific bits/errno.h.
* sysdeps/unix/sysv/linux/bits/errno.h
* sysdeps/unix/sysv/linux/alpha/bits/errno.h
* sysdeps/unix/sysv/linux/hppa/bits/errno.h
* sysdeps/unix/sysv/linux/mips/bits/errno.h
* sysdeps/unix/sysv/linux/sparc/bits/errno.h:
Add multiple-include guard and check against improper inclusion.
Remove __need_Emath logic. Don't declare errno here. Ensure all
constants are defined as simple integer literals. Consistent
formatting.
* sysdeps/mach/hurd/errnos.awk: Likewise. Only define error_t and
enum __error_t_codes if __ASSEMBLER__ is not defined.
* sysdeps/mach/hurd/bits/errno.h: Regenerate.
* argp/argp.h, string/argz.h: Don't define __need_error_t before
including errno.h.
* sysdeps/i386/i686/fpu/multiarch/s_cosf-sse2.S
* sysdeps/i386/i686/fpu/multiarch/s_sincosf-sse2.S
* sysdeps/i386/i686/fpu/multiarch/s_sinf-sse2.S
* sysdeps/x86_64/fpu/s_cosf.S
* sysdeps/x86_64/fpu/s_sincosf.S
* sysdeps/x86_64/fpu/s_sinf.S:
Just include errno.h; don't define __need_Emath or include
bits/errno.h directly.
ELFv2 functions with localentry:0 are those with a single entry point,
ie. global entry == local entry, that have no requirement on r2 or
r12 and guarantee r2 is unchanged on return. Such an external
function can be called via the PLT without saving r2 or restoring it
on return, avoiding a common load-hit-store for small functions.
This patch implements the ld.so changes necessary for this
optimization. ld.so needs to check that an optimized plt call
sequence is in fact calling a function implemented with localentry:0,
end emit a fatal error otherwise.
The elf/testobj6.c change is to stop "error while loading shared
libraries: expected localentry:0 `preload'" when running
elf/preloadtest, which we'd get otherwise.
* elf/elf.h (PPC64_OPT_LOCALENTRY): Define.
* sysdeps/alpha/dl-machine.h (elf_machine_fixup_plt): Add
refsym and sym parameters. Adjust callers.
* sysdeps/aarch64/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/arm/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/generic/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/i386/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/ia64/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/m68k/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/microblaze/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/mips/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/nios2/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_fixup_plt):
Likewise.
* sysdeps/s390/s390-32/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/s390/s390-64/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/sh/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/sparc/sparc32/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/sparc/sparc64/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/tile/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_fixup_plt): Likewise.
* sysdeps/powerpc/powerpc64/dl-machine.c (_dl_error_localentry): New.
(_dl_reloc_overflow): Increase buffser size. Formatting.
* sysdeps/powerpc/powerpc64/dl-machine.h (ppc64_local_entry_offset):
Delete reloc param, add refsym and sym. Check optimized plt
call stubs for localentry:0 functions. Adjust callers.
(elf_machine_fixup_plt, elf_machine_plt_conflict): Add refsym
and sym parameters. Adjust callers.
(_dl_reloc_overflow): Move attribute.
(_dl_error_localentry): Declare.
* elf/dl-runtime.c (_dl_fixup): Save original sym. Pass
refsym and sym to elf_machine_fixup_plt.
* elf/testobj6.c (preload): Call printf.
Implement strcpy family IFUNC selectors in C.
All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register. For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized. Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcpy family functions within libc.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strcpy-sse2 and stpcpy-sse2.
* sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file.
* sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/stpcpy.c: Likewise.
* sysdeps/x86_64/multiarch/stpncpy.c: Likewise.
* sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strcpy.c: Likewise.
* sysdeps/x86_64/multiarch/strncpy.c: Likewise.
* sysdeps/x86_64/multiarch/stpcpy.S: Removed.
* sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
* sysdeps/x86_64/multiarch/strncpy.S: Likewise.
* sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New.
(libc_hidden_def): Always defined as empty.
* sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def):
Always Defined as empty.
Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector
instructions. It is as fast as SSE2 version for small data sizes
and up to 1X faster for large data sizes on Haswell. Select AVX2
version on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __strrchr_avx2,
__strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2.
* sysdeps/x86_64/multiarch/strrchr-avx2.S: New file.
* sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strrchr.c: Likewise.
* sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
Optimize memrchr with AVX2 to search 32 bytes with a single vector
compare instruction. It is as fast as SSE2 memrchr for small data
sizes and up to 1X faster for large data sizes on Haswell. Select
AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memrchr-sse2 and memrchr-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and
__memrchr_sse2.
* sysdeps/x86_64/multiarch/memrchr-avx2.S: New file.
* sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector
instructions. It is as fast as SSE2 versions for size <= 16 bytes and up
to 1X faster for or size > 16 bytes on Haswell. Select AVX2 version on
AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.
NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input. TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2,
wcschr-sse2 and wcschr-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __strchr_avx2,
__strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and
__wcschr_sse2.
* sysdeps/x86_64/multiarch/strchr-avx2.S: New file.
* sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strchr.c: Likewise.
* sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
* sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
* sysdeps/x86_64/multiarch/strchr.S: Removed.
Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with
a single vector compare instruction. It is as fast as SSE2 versions for
size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell.
Select AVX2 version on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.
NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input. TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2,
wcslen-sse2, wcslen-avx2 and wcsnlen-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add tests for __strlen_avx2,
__strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2,
__wcslen_sse2 and __wcsnlen_avx2.
* sysdeps/x86_64/multiarch/strlen-avx2.S: New file.
* sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strlen.c: Likewise.
* sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
* sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
* sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New.
(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.
SSE2 memchr is extended to support wmemchr. AVX2 memchr/rawmemchr/wmemchr
are added to search 32 bytes with a single vector compare instruction.
AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr
for small sizes and up to 1.5X faster for larger sizes on Haswell and
Skylake. Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.
NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input. TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.
* sysdeps/x86_64/memchr.S (MEMCHR): New. Depending on if
USE_AS_WMEMCHR is defined.
(PCMPEQ): Likewise.
(memchr): Renamed to ...
(MEMCHR): This. Support wmemchr if USE_AS_WMEMCHR is defined.
Replace pcmpeqb with PCMPEQ.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2,
wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c.
* sysdeps/x86_64/multiarch/ifunc-avx2.h: New file.
* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/memchr.c: Likewise.
* sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
* sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise.
* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2,
__rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and
__wmemchr_sse2.
Since ifunc-sse4_1.h is included only by wcsnlen.c, we can fold it
into wcsnlen.c. No code changes in wcsnlen.o.
2017-06-07 H.J. Lu <hongjiu.lu@intel.com>
* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: Removed and folded
into ...
* sysdeps/x86_64/multiarch/wcsnlen.c: Here. Don't include
ifunc-sse4_1.h.
Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S
to multiarch/wcsnlen-sse4_1.S.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
wcsnlen-sse4_1 and wcsnlen-c.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and
__wcsnlen_sse2.
* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file.
* sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise.
* sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise.
* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
* sysdeps/x86_64/wcsnlen.S: Removed.
This patch optimizes the generic spinlock code.
The type pthread_spinlock_t is a typedef to volatile int on all archs.
Passing a volatile pointer to the atomic macros which are not mapped to the
C11 atomic builtins can lead to extra stores and loads to stack if such
a macro creates a temporary variable by using "__typeof (*(mem)) tmp;".
Thus, those macros which are used by spinlock code - atomic_exchange_acquire,
atomic_load_relaxed, atomic_compare_exchange_weak - have to be adjusted.
According to the comment from Szabolcs Nagy, the type of a cast expression is
unqualified (see http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_423.htm):
__typeof ((__typeof (*(mem)) *(mem)) tmp;
Thus from spinlock perspective the variable tmp is of type int instead of
type volatile int. This patch adjusts those macros in include/atomic.h.
With this construct GCC >= 5 omits the extra stores and loads.
The atomic macros are replaced by the C11 like atomic macros and thus
the code is aligned to it. The pthread_spin_unlock implementation is now
using release memory order instead of sequentially consistent memory order.
The issue with passed volatile int pointers applies to the C11 like atomic
macros as well as the ones used before.
I've added a glibc_likely hint to the first atomic exchange in
pthread_spin_lock in order to return immediately to the caller if the lock is
free. Without the hint, there is an additional jump if the lock is free.
I've added the atomic_spin_nop macro within the loop of plain reads.
The plain reads are also realized by C11 like atomic_load_relaxed macro.
The new define ATOMIC_EXCHANGE_USES_CAS determines if the first try to acquire
the spinlock in pthread_spin_lock or pthread_spin_trylock is an exchange
or a CAS. This is defined in atomic-machine.h for all architectures.
The define SPIN_LOCK_READS_BETWEEN_CMPXCHG is now removed.
There is no technical reason for throwing in a CAS every now and then,
and so far we have no evidence that it can improve performance.
If that would be the case, we have to adjust other spin-waiting loops
elsewhere, too! Using a CAS loop without plain reads is not a good idea
on many targets and wasn't used by one. Thus there is now no option to
do so.
Architectures are now using the generic spinlock automatically if they
do not provide an own implementation. Thus the pthread_spin_lock.c files
in sysdeps folder are deleted.
ChangeLog:
* NEWS: Mention new spinlock implementation.
* include/atomic.h:
(__atomic_val_bysize): Cast type to omit volatile qualifier.
(atomic_exchange_acq): Likewise.
(atomic_load_relaxed): Likewise.
(ATOMIC_EXCHANGE_USES_CAS): Check definition.
* nptl/pthread_spin_init.c (pthread_spin_init):
Use atomic_store_relaxed.
* nptl/pthread_spin_lock.c (pthread_spin_lock):
Use C11-like atomic macros.
* nptl/pthread_spin_trylock.c (pthread_spin_trylock):
Likewise.
* nptl/pthread_spin_unlock.c (pthread_spin_unlock):
Use atomic_store_release.
* sysdeps/aarch64/nptl/pthread_spin_lock.c: Delete File.
* sysdeps/arm/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/hppa/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/m68k/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/microblaze/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/mips/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/nios2/nptl/pthread_spin_lock.c: Likewise.
* sysdeps/aarch64/atomic-machine.h (ATOMIC_EXCHANGE_USES_CAS): Define.
* sysdeps/alpha/atomic-machine.h: Likewise.
* sysdeps/arm/atomic-machine.h: Likewise.
* sysdeps/i386/atomic-machine.h: Likewise.
* sysdeps/ia64/atomic-machine.h: Likewise.
* sysdeps/m68k/coldfire/atomic-machine.h: Likewise.
* sysdeps/m68k/m680x0/m68020/atomic-machine.h: Likewise.
* sysdeps/microblaze/atomic-machine.h: Likewise.
* sysdeps/mips/atomic-machine.h: Likewise.
* sysdeps/powerpc/powerpc32/atomic-machine.h: Likewise.
* sysdeps/powerpc/powerpc64/atomic-machine.h: Likewise.
* sysdeps/s390/atomic-machine.h: Likewise.
* sysdeps/sparc/sparc32/atomic-machine.h: Likewise.
* sysdeps/sparc/sparc32/sparcv9/atomic-machine.h: Likewise.
* sysdeps/sparc/sparc64/atomic-machine.h: Likewise.
* sysdeps/tile/tilegx/atomic-machine.h: Likewise.
* sysdeps/tile/tilepro/atomic-machine.h: Likewise.
* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: Likewise.
* sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h: Likewise.
* sysdeps/unix/sysv/linux/nios2/atomic-machine.h: Likewise.
* sysdeps/unix/sysv/linux/sh/atomic-machine.h: Likewise.
* sysdeps/x86_64/atomic-machine.h: Likewise.
Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as
much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes
and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select
AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.
NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input. TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.
Key features:
1. For size from 2 to 7 bytes, load as big endian with movbe and bswap
to avoid branches.
2. Use overlapping compare to avoid branch.
3. Use vector compare when size >= 4 bytes for memcmp or size >= 8
bytes for wmemcmp.
4. If size is 8 * VEC_SIZE or less, unroll the loop.
5. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
8. Use 8 vector compares when size is 8 * VEC_SIZE or less.
* sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memcmp-avx2 and wmemcmp-avx2.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2.
* sysdeps/x86_64/multiarch/memcmp-avx2.S: New file.
* sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise.
* sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX
2 machines if AVX unaligned load is fast and vzeroupper is
preferred.
* sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX
2 machines if AVX unaligned load is fast and vzeroupper is
preferred.
The difference between strlen and wcslen is byte vs int. We can
replace pminub and pcmpeqb with pminud and pcmpeqd to turn strlen
into wcslen.
* sysdeps/x86_64/strlen.S (PMINU): New.
(PCMPEQ): Likewise.
(SHIFT_RETURN): Likewise.
(FIND_ZERO): Replace pcmpeqb with PCMPEQ.
(strlen): Add SHIFT_RETURN before ret. Replace pcmpeqb and
pminub with PCMPEQ and PMINU.
* sysdeps/x86_64/wcsnlen.S: New file.
By x86-64 specification, 32-bit destination registers are zero-extended
to 64 bits. There is no need to use 64-bit registers when only the lower
32 bits are non-zero. Also 2 instructions in:
mov %rdi, %rcx
and $15, %rcx
jz L(length_less16_offset0)
mov %rdi, %rcx <<< redundant
and $15, %rcx <<< redundant
are redundant.
* sysdeps/x86_64/memrchr.S (__memrchr): Use 32-bit registers for
the lower 32 bits. Remove redundant instructions.
By x86-64 specification, 32-bit destination registers are zero-extended
to 64 bits. There is no need to use 64-bit registers when only the lower
32 bits are non-zero.
* sysdeps/x86_64/memchr.S (MEMCHR): Use 32-bit registers for
the lower 32 bits.
SSE2 memchr computes "edx + ecx - 16" where ecx is less than 16. Use
"edx - (16 - ecx)", instead of satured math, to avoid possible addition
overflow. This replaces
add %ecx, %edx
sbb %eax, %eax
or %eax, %edx
sub $16, %edx
with
neg %ecx
add $16, %ecx
sub %ecx, %edx
It is the same for x86_64, except for rcx/rdx, instead of ecx/edx.
* sysdeps/i386/i686/multiarch/memchr-sse2.S (MEMCHR): Use
"edx + ecx - 16" to avoid possible addition overflow.
* sysdeps/x86_64/memchr.S (memchr): Likewise.
This patch adds a new build module called 'testsuite'.
IS_IN (testsuite) implies _ISOMAC, as do IS_IN_build and __cplusplus
(which means several ad-hoc tests for __cplusplus can go away).
libc-symbols.h now suppresses almost all of *itself* when _ISOMAC is
defined; in particular, _ISOMAC mode does not get config.h
automatically anymore.
There are still quite a few tests that need to see internal gunk of
one variety or another. For them, we now have 'tests-internal' and
'test-internal-extras'; files in this category will still be compiled
with MODULE_NAME=nonlib, and everything proceeds as it always has.
The bulk of this patch is moving tests from 'tests' to
'tests-internal'. There is also 'tests-static-internal', which has
the same effect on files in 'tests-static', and 'modules-names-tests',
which has the *inverse* effect on files in 'modules-names' (it's
inverted because most of the things in modules-names are *not* tests).
For both of these, the file must appear in *both* the new variable and
the old one.
There is also now a special case for when libc-symbols.h is included
without MODULE_NAME being defined at all. (This happens during the
creation of libc-modules.h, and also when preprocessing Versions
files.) When this happens, IS_IN is set to be always false and
_ISOMAC is *not* defined, which was the status quo, but now it's
explicit.
The remaining changes to C source files in this patch seemed likely to
cause problems in the absence of the main change. They should be
relatively self-explanatory. In a few cases I duplicated a definition
from an internal header rather than move the test to tests-internal;
this was a judgement call each time and I'm happy to change those
however reviewers feel is more appropriate.
* Makerules: New subdir configuration variables 'tests-internal'
and 'test-internal-extras'. Test files in these categories will
still be compiled with MODULE_NAME=nonlib. Test files in the
existing categories (tests, xtests, test-srcs, test-extras) are
now compiled with MODULE_NAME=testsuite.
New subdir configuration variable 'modules-names-tests'. Files
which are in both 'modules-names' and 'modules-names-tests' will
be compiled with MODULE_NAME=testsuite instead of
MODULE_NAME=extramodules.
(gen-as-const-headers): Move to tests-internal.
(do-tests-clean, common-mostlyclean): Support tests-internal.
* Makeconfig (built-modules): Add testsuite.
* Makefile: Change libof-check-installed-headers-c and
libof-check-installed-headers-cxx to 'testsuite'.
* Rules: Likewise. Support tests-internal.
* benchtests/strcoll-inputs/filelist#en_US.UTF-8:
Remove extra-modules.mk.
* config.h.in: Don't check for __OPTIMIZE__ or __FAST_MATH__ here.
* include/libc-symbols.h: Move definitions of _GNU_SOURCE,
PASTE_NAME, PASTE_NAME1, IN_MODULE, IS_IN, and IS_IN_LIB to the
very top of the file and rationalize their order.
If MODULE_NAME is not defined at all, define IS_IN to always be
false, and don't define _ISOMAC.
If any of IS_IN (testsuite), IS_IN_build, or __cplusplus are
true, define _ISOMAC and suppress everything else in this file,
starting with the inclusion of config.h.
Do check for inappropriate definitions of __OPTIMIZE__ and
__FAST_MATH__ here, but only if _ISOMAC is not defined.
Correct some out-of-date commentary.
* include/math.h: If _ISOMAC is defined, undefine NO_LONG_DOUBLE
and _Mlong_double_ before including math.h.
* include/string.h: If _ISOMAC is defined, don't expose
_STRING_ARCH_unaligned. Move a comment to a more appropriate
location.
* include/errno.h, include/stdio.h, include/stdlib.h, include/string.h
* include/time.h, include/unistd.h, include/wchar.h: No need to
check __cplusplus nor use __BEGIN_DECLS/__END_DECLS.
* misc/sys/cdefs.h (__NTHNL): New macro.
* sysdeps/m68k/m680x0/fpu/bits/mathinline.h
(__m81_defun): Use __NTHNL to avoid errors with GCC 6.
* elf/tst-env-setuid-tunables.c: Include config.h with _LIBC
defined, for HAVE_TUNABLES.
* inet/tst-checks-posix.c: No need to define _ISOMAC.
* intl/tst-gettext2.c: Provide own definition of N_.
* math/test-signgam-finite-c99.c: No need to define _ISOMAC.
* math/test-signgam-main.c: No need to define _ISOMAC.
* stdlib/tst-strtod.c: Convert to test-driver. Split locale_test to...
* stdlib/tst-strtod1i.c: ...this new file.
* stdlib/tst-strtod5.c: Convert to test-driver and add copyright notice.
Split tests of __strtod_internal to...
* stdlib/tst-strtod5i.c: ...this new file.
* string/test-string.h: Include stdint.h. Duplicate definition of
inhibit_loop_to_libcall here (from libc-symbols.h).
* string/test-strstr.c: Provide dummy definition of
libc_hidden_builtin_def when including strstr.c.
* sysdeps/ia64/fpu/libm-symbols.h: Suppress entire file in _ISOMAC
mode; no need to test __STRICT_ANSI__ nor __cplusplus as well.
* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
Don't include init-arch.h.
* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
Don't include init-arch.h.
* elf/Makefile: Move tst-ptrguard1-static, tst-stackguard1-static,
tst-tls1-static, tst-tls2-static, tst-tls3-static, loadtest,
unload, unload2, circleload1, neededtest, neededtest2,
neededtest3, neededtest4, tst-tls1, tst-tls2, tst-tls3,
tst-tls6, tst-tls7, tst-tls8, tst-dlmopen2, tst-ptrguard1,
tst-stackguard1, tst-_dl_addr_inside_object, and all of the
ifunc tests to tests-internal.
Don't add $(modules-names) to test-extras.
* inet/Makefile: Move tst-inet6_scopeid_pton to tests-internal.
Add tst-deadline to tests-static-internal.
* malloc/Makefile: Move tst-mallocstate and tst-scratch_buffer to
tests-internal.
* misc/Makefile: Move tst-atomic and tst-atomic-long to tests-internal.
* nptl/Makefile: Move tst-typesizes, tst-rwlock19, tst-sem11,
tst-sem12, tst-sem13, tst-barrier5, tst-signal7, tst-tls3,
tst-tls3-malloc, tst-tls5, tst-stackguard1, tst-sem11-static,
tst-sem12-static, and tst-stackguard1-static to tests-internal.
Link tests-internal with libpthread also.
Don't add $(modules-names) to test-extras.
* nss/Makefile: Move tst-field to tests-internal.
* posix/Makefile: Move bug-regex5, bug-regex20, bug-regex33,
tst-rfc3484, tst-rfc3484-2, and tst-rfc3484-3 to tests-internal.
* stdlib/Makefile: Move tst-strtod1i, tst-strtod3, tst-strtod4,
tst-strtod5i, tst-tls-atexit, and tst-tls-atexit-nodelete to
tests-internal.
* sunrpc/Makefile: Move tst-svc_register to tests-internal.
* sysdeps/powerpc/Makefile: Move test-get_hwcap and
test-get_hwcap-static to tests-internal.
* sysdeps/unix/sysv/linux/Makefile: Move tst-setgetname to
tests-internal.
* sysdeps/x86_64/fpu/Makefile: Add all libmvec test modules to
modules-names-tests.