Rename x86_cpu_INDEX_7_ECX_1 to x86_cpu_INDEX_7_ECX_15 for the unused bit
15 in ECX from CPUID with EAX == 0x7 and ECX == 0.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are
enabled. If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE
instructions can be used in user space. Define dl_check_hwcap2 to set
the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is
set.
Add a test to verify that FSGSBASE is active on current kernels.
NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE
bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
And make always supported. The configure option was added on glibc 2.25
and some features require it (such as hwcap mask, huge pages support, and
lock elisition tuning). It also simplifies the build permutations.
Changes from v1:
* Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs
more discussion.
* Cleanup more code.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Recorded in [BZ #30183]:
1. export GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX512
2. Add _dl_printf("p -- %s\n", p); just before switch(nl) in
sysdeps/x86/cpu-tunables.c
3. compiled and run ./testrun.sh /usr/bin/ls
you will get:
p -- -AVX512
p -- LC_ADDRESS=en_US.UTF-8
p -- LC_NUMERIC=C
...
The function, TUNABLE_CALLBACK (set_hwcaps)
(tunable_val_t *valp), checks far more than it should and it
should stop at end of "-AVX512".
Crossing 2GB boundaries with indirect calls and jumps can use more
branch prediction resources on Intel Golden Cove CPU (see the
"Misprediction for Branches >2GB" section in Intel 64 and IA-32
Architectures Optimization Reference Manual.) There is visible
performance improvement on workloads with many PLT calls when executable
and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC
bit so that mmap will try to map executable or denywrite pages in shared
libraries with MAP_32BIT first.
NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space
layout randomization (ASLR), which is always disabled for SUID programs
and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec,
or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only
between shared libraries or between shared libraries and executables with
addresses below 2GB. PIEs are usually loaded at a random address above
4GB by the kernel.
All AMD architectures cache details will be computed based on
__cpuid__ `0x8000_001D` and the reference to __cpuid__ `0x8000_0006` will be
zeroed out for future architectures.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
The minimum non_temporal_threshold is 0x4040. non_temporal_threshold may
be set to less than the minimum value when the shared cache size isn't
available (e.g., in an emulator) or by the tunable. Add checks for
minimum and maximum of non_temporal_threshold.
This fixes BZ #29953.
The assembler is not issued directly, but rather always through CC
wrapper. The binutils version check if done with LD instead.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Removal of legacy hwcaps support from the dynamic loader left
no users of _dl_string_hwcap.
Signed-off-by: Javier Pello <devel@otheo.eu>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk'
instruction which belongs to the BMI1 CPU feature and the 'shrx'
instruction, which belongs to the BMI2 CPU feature.
Fixes: df7e295d18 ("x86: Optimize {str|wcs}rchr-avx2")
Partially resolves: BZ #29611
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The AVX2 memrchr implementation uses the 'shlxl' instruction, which
belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which
belongs to the LZCNT CPU feature.
Fixes: af5306a735 ("x86: Optimize memrchr-avx2.S")
Partially resolves: BZ #29611
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The "System V Application Binary Interface AMD64 Architecture Processor
Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3
level.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
GCC 13 adds support for _FloatN and _FloatNx types in C++, so breaking
the installed glibc headers that assume such support is not present.
GCC mostly works around this with fixincludes, but that doesn't help
for building glibc and its tests (glibc doesn't itself contain C++
code, but there's C++ code built for tests). Update glibc's
bits/floatn-common.h and bits/floatn.h headers to handle the GCC 13
support directly.
In general the changes match those made by fixincludes, though I think
the ones in sysdeps/powerpc/bits/floatn.h, where the header tests
__LDBL_MANT_DIG__ == 113 or uses #elif, wouldn't match the existing
fixincludes patterns.
Some places involving special C++ handling in relation to _FloatN
support are not changed. There's no need to change the
__HAVE_FLOATN_NOT_TYPEDEF definition (also in a form that wouldn't be
matched by the fixincludes fixes) because it's only used in relation
to macro definitions using features not supported for C++
(__builtin_types_compatible_p and _Generic). And there's no need to
change the inline function overloads for issignaling, iszero and
iscanonical in C++ because cases where types have the same format but
are no longer compatible types are handled automatically by the C++
overload resolution rules.
This patch also does not change the overload handling for iseqsig, and
there I think changes *are* needed, beyond those in this patch or made
by fixincludes. The way that overload is defined, via a template
parameter to a structure type, requires overloads whenever the types
are incompatible, even if they have the same format. So I think we
need to add overloads with GCC 13 for every supported _FloatN and
_FloatNx type, rather than just having one for _Float128 when it has a
different ABI to long double as at present (but for older GCC, such
overloads must not be defined for types that end up defined as
typedefs for another type).
Tested with build-many-glibcs.py: compilers build for
aarch64-linux-gnu ia64-linux-gnu mips64-linux-gnu powerpc-linux-gnu
powerpc64le-linux-gnu x86_64-linux-gnu; glibcs build for
aarch64-linux-gnu ia64-linux-gnu i686-linux-gnu mips-linux-gnu
mips64-linux-gnu-n32 powerpc-linux-gnu powerpc64le-linux-gnu
x86_64-linux-gnu.
Not all compilers support the inline asm prefix '%v' to emit the avx
instruction if AVX is enable. Use a prefix instead.
Checked on x86_64-linux-gnu and i686-linux-gnu.
It avoids the possible warning of uninitialized 'frame' variable when
building with clang:
../sysdeps/nptl/jmp-unwind.c:27:42: error: variable 'frame' is
uninitialized when used here [-Werror,-Wuninitialized]
__pthread_cleanup_upto (env->__jmpbuf, CURRENT_STACK_FRAME);
The resulting code is similar to CURRENT_STACK_FRAME.
Checked on x86_64-linux-gnu.
1. Add default ISA level selection in non-multiarch/rtld
implementations.
2. Add ISA level build guards to different implementations.
- I.e strcmp-avx2.S which is ISA level 3 will only build if
compiled ISA level <= 3. Otherwise there is no reason to
include it as we will always use one of the ISA level 4
implementations (strcmp-evex.S).
3. Refactor the ifunc selector and ifunc implementation list to use
the ISA level aware wrapper macros that allow functions below the
compiled ISA level (with a guranteed replacement) to be skipped.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
1. Refactor files so that all implementations are in the multiarch
directory
- Moved the implementation portion of memcmp sse2 from memcmp.S to
multiarch/memcmp-sse2.S
- The non-multiarch file now only includes one of the
implementations in the multiarch directory based on the compiled
ISA level (only used for non-multiarch builds. Otherwise we go
through the ifunc selector).
2. Add ISA level build guards to different implementations.
- I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if
compiled ISA level <= 3. Otherwise there is no reason to include
it as we will always use one of the ISA level 4
implementations (memcmp-evex-movbe.S).
3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the
non-multiarch {w}memcmp{eq}.S which will in turn select the best
implementation based on the compiled ISA level.
4. Refactor the ifunc selector and ifunc implementation list to use
the ISA level aware wrapper macros that allow functions below the
compiled ISA level (with a guranteed replacement) to be skipped.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
When glibc is built with x86-64 ISA level v3, SSE run-time resolvers
aren't used. For x86-64 ISA level v4 build, both SSE and AVX resolvers
are unused. Check the minimum x86-64 ISA level to exclude the unused
run-time resolvers.
Add third argument to X86_ISA_CPU_FEATURES_ARCH_P macro so the runtime
CPU_FEATURES_ARCH_P check can be inverted if the
MINIMUM_X86_ISA_LEVEL is not high enough to constantly evaluate
the check.
Use this new macro to correct the backwards check in ifunc-evex.h
1. Factor out some of the ISA level defines in isa-level.c to
standalone header isa-level.h
2. Add new headers with ISA level dependent macros for handling
ifuncs.
Note, this file does not change any code.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
If an executable has copy relocations for extern protected data, that
can only work if the library containing the definition is built with
assumptions (a) the compiler emits GOT-generating relocations (b) the
linker produces R_*_GLOB_DAT instead of R_*_RELATIVE. Otherwise the
library uses its own definition directly and the executable accesses a
stale copy. Note: the GOT relocations defeat the purpose of protected
visibility as an optimization, but allow rtld to make the executable and
library use the same copy when copy relocations are present, but it
turns out this never worked perfectly.
ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both
a.so and b.so define protected var and the executable copy relocates
var: b.so accesses its own copy even with GLOB_DAT. The behavior change
is from commit 62da1e3b00 (x86) and then
copied to nios2 (ae5eae7cfc) and arc
(0e7d930c4c).
Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy
relocated data like a.so.
There is now a warning for copy relocation on protected symbol since
commit 7374c02b68. It's extremely
unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA
behavior, so let's remove it: this removes a check in the symbol lookup
code.
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
Unroll slightly and enforce good instruction scheduling. This improves
performance on out-of-order machines. The unrolling allows for
pipelined multiplies.
As well, as an optional sysdep, reorder the operations and prevent
reassosiation for better scheduling and higher ILP. This commit
only adds the barrier for x86, although it should be either no
change or a win for any architecture.
Unrolling further started to induce slowdowns for sizes [0, 4]
but can help the loop so if larger sizes are the target further
unrolling can be beneficial.
Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Time as Geometric Mean of N=30 runs
Geometric of all benchmark New / Old: 0.674
type, length, New Time, Old Time, New Time / Old Time
fixed, 0, 2.865, 2.72, 1.053
fixed, 1, 3.567, 2.489, 1.433
fixed, 2, 2.577, 3.649, 0.706
fixed, 3, 3.644, 5.983, 0.609
fixed, 4, 4.211, 6.833, 0.616
fixed, 5, 4.741, 9.372, 0.506
fixed, 6, 5.415, 9.561, 0.566
fixed, 7, 6.649, 10.789, 0.616
fixed, 8, 8.081, 11.808, 0.684
fixed, 9, 8.427, 12.935, 0.651
fixed, 10, 8.673, 14.134, 0.614
fixed, 11, 10.69, 15.408, 0.694
fixed, 12, 10.789, 16.982, 0.635
fixed, 13, 12.169, 18.411, 0.661
fixed, 14, 12.659, 19.914, 0.636
fixed, 15, 13.526, 21.541, 0.628
fixed, 16, 14.211, 23.088, 0.616
fixed, 32, 29.412, 52.722, 0.558
fixed, 64, 65.41, 142.351, 0.459
fixed, 128, 138.505, 295.625, 0.469
fixed, 256, 291.707, 601.983, 0.485
random, 2, 12.698, 12.849, 0.988
random, 4, 16.065, 15.857, 1.013
random, 8, 19.564, 21.105, 0.927
random, 16, 23.919, 26.823, 0.892
random, 32, 31.987, 39.591, 0.808
random, 64, 49.282, 71.487, 0.689
random, 128, 82.23, 145.364, 0.566
random, 256, 152.209, 298.434, 0.51
Co-authored-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
PI_STATIC_AND_HIDDEN indicates whether accesses to internal linkage
variables and hidden visibility variables in a shared object (ld.so)
need dynamic relocations (usually R_*_RELATIVE). PI (position
independent) in the macro name is a misnomer: a code sequence using GOT
is typically position-independent as well, but using dynamic relocations
does not meet the requirement.
Not defining PI_STATIC_AND_HIDDEN is legacy and we expect that all new
ports will define PI_STATIC_AND_HIDDEN. Current ports defining
PI_STATIC_AND_HIDDEN are more than the opposite. Change the configure
default.
No functional change.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.
commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:21 2022 -0600
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.
This change will need to be backported.
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
In some cases (e.g QEMU, non-Intel/AMD CPU) the cache information can
not be retrieved and the corresponding values are set to 0.
Commit 2d651eb926 ("x86: Move x86 processor cache info to
cpu_features") changed the behaviour in such case by defining the
__x86_shared_cache_size and __x86_data_cache_size variables to 0 instead
of using the default values. This cause an issue with the i686 SSE2
optimized bzero/routine which assumes that the cache size is at least
128 bytes, and otherwise tries to zero/set the whole address space minus
128 bytes.
Fix that by restoring the original code to only update
__x86_shared_cache_size and __x86_data_cache_size variables if the
corresponding cache sizes are not zero.
Fixes bug 28784
Fixes commit 2d651eb926
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This ISA level covers the glibc build itself. <dl-hwcap-check.h>
cannot be used because this check (by design) happens before
DL_PLATFORM_INIT and the x86 CPU flags initialization.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
HAVE_X86_LAHF_SAHF is implied by x86-64-v2, and HAVE_X86_MOVBE by
x86-64-v3.
The individual flag does not appear in -fverbose-asm flag output
even if the ISA level implies it.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
Implement vectorized tan/tanf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector tan/tanf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized erfc/erfcf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector erfc/erfcf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized asinh/asinhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector asinh/asinhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized tanh/tanhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector tanh/tanhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized erf/erff containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector erf/erff with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized acosh/acoshf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector acosh/acoshf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atanh/atanhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atanh/atanhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log1p/log1pf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log1p/log1pf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log2/log2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log2/log2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized log10/log10f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector log10/log10f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atan2/atan2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atan2/atan2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized cbrt/cbrtf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector cbrt/cbrtf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized sinh/sinhf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector sinh/sinhf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized expm1/expm1f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector expm1/expm1f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized cosh/coshf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector cosh/coshf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized exp10/exp10f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector exp10/exp10f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized exp2/exp2f containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector exp2/exp2f with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized hypot/hypotf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector hypot/hypotf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized asin/asinf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector asin/asinf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Implement vectorized atan/atanf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector atan/atanf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
It can be used to speed up the libgcc unwinder, and the internal
_dl_find_dso_for_object function (which is used for caller
identification in dlopen and related functions, and in dladdr).
_dl_find_object is in the internal namespace due to bug 28503.
If libgcc switches to _dl_find_object, this namespace issue will
be fixed. It is located in libc for two reasons: it is necessary
to forward the call to the static libc after static dlopen, and
there is a link ordering issue with -static-libgcc and libgcc_eh.a
because libc.so is not a linker script that includes ld.so in the
glibc build tree (so that GCC's internal -lc after libgcc_eh.a does
not pick up ld.so).
It is necessary to do the i386 customization in the
sysdeps/x86/bits/dl_find_object.h header shared with x86-64 because
otherwise, multilib installations are broken.
The implementation uses software transactional memory, as suggested
by Torvald Riegel. Two copies of the supporting data structures are
used, also achieving full async-signal-safety.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Implement vectorized acos/acosf containing SSE, AVX, AVX2 and
AVX512 versions for libmvec as per vector ABI. It also contains
accuracy and ABI tests for vector acos/acosf with regenerated ulps.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On KVM guests running on some AMD systems, the IBRS feature is reported
as a synthetic feature using the Intel feature, while the cpuinfo entry
keeps the same. Handle that by first checking the presence of the Intel
feature on AMD systems.
Fixes bug 28704.
Remove the LD_PREFER_MAP_32BIT_EXEC environment variable support since
the first PT_LOAD segment is no longer executable due to defaulting to
-z separate-code.
This fixes [BZ #28656].
Reviewed-by: Florian Weimer <fweimer@redhat.com>
<tls.h> already contains a definition that is quite similar,
but it is not consistent across architectures.
Only architectures for which rseq support is added are covered.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
Intel MPX failed to gain wide adoption and has been deprecated for a
while. GCC 9.1 removed Intel MPX support. Linux kernel removed MPX in
2019.
This patch removes the support code from the dynamic loader.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
No bug.
This change adds a new macro ENTRY_P2ALIGN which takes a second
argument, log2 of the desired function alignment.
The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
doesn't affect any existing functionality.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
1. Add GNU_PROPERTY_1_NEEDED:
#define GNU_PROPERTY_1_NEEDED GNU_PROPERTY_UINT32_OR_LO
to indicate the needed properties by the object file.
2. Add GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS:
#define GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (1U << 0)
to indicate that the object file requires canonical function pointers and
cannot be used with copy relocation.
3. Scan GNU_PROPERTY_1_NEEDED property and store it in l_1_needed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It turns out the __SSE2_MATH__ conditional in sysdeps/x86/fpu/s_ffma.c
does not cover all cases where the x86 fenv_private.h macros might
manipulate one of the SSE and 387 floating-point state, while the
actual fma implementation uses the other. Specifically, in the 32-bit
case, with a compiler not defaulting to -mfpmath=sse, but testing on a
processor with hardware FMA support, the multiarch fma function
implementations will end up using SSE, while the fenv_private.h macros
will use the 387 state for double. Change the conditional to use the
default macros rather than the optimized ones in all cases except when
the compiler inlines an fma instruction (in which case, since all
those instructions are SSE instructions and -mfpmath=sse must be in
effect for them to be inlined, the optimized macros will only use the
SSE state and it's OK for them to only use the SSE state).
Tested for x86_64 and x86. H.J. reports in
<https://sourceware.org/pipermail/libc-alpha/2021-September/131367.html>
that it fixes the problems he observed.
On 32-bit x86 with -mfpmath=sse, and on x86_64 with
--disable-multi-arch, the tests of ffma and its aliases (fma narrowing
from binary64 to binary32) fail. This is probably the issue reported
by H.J. in
<https://sourceware.org/pipermail/libc-alpha/2021-September/131277.html>.
The problem is the use of fenv_private.h macros in the round-to-odd
implementation. Those macros are set up to manipulate only one of the
SSE and 387 floating-point state, whichever is relevant for the type
indicated by the suffix on the macro name. But x86 configurations
sometimes use the ldbl-96 implementation of binary64 fma (that's where
--disable-multi-arch is relevant for x86_64: it causes the ldbl-96
implementation to be used, instead of an IFUNC implementation that
falls back to the dbl-64 version), contrary to the expectations of
those macros for functions operating on double when __SSE2_MATH__ is
defined.
This can be addressed by using the default versions of those macros
(giving x86 its own version of s_ffma.c), as is done for the *f128
macro variants where it depends on the details of how GCC was
configured when building libgcc which floating-point state is affected
by _Float128 arithmetic. The issue only applies when __SSE2_MATH__ is
defined, and doesn't apply when __FP_FAST_FMA is defined (because in
that case, fma will be inlined by the compiler, meaning it's
definitely an SSE operation; for the same reason, this is not an issue
for narrowing sqrt, as hardware sqrt is always inlined in that
implementation for x86), but in other cases it's safest to use the
default versions of the fenv_private.h macros to ensure things work
whichever fma implementation is used.
Tested for x86_64 (with and without --disable-multi-arch) and x86
(with and without -mfpmath=sse).
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date. Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.
Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions. These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.
The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively. These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:
https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dchttps://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
When using LLD (LLVM linker) as the linker, configure prints a confusing
message.
*** These critical programs are missing or too old: GNU ld
LLD>=13.0.0 can build glibc --enable-static-pie. (8.0.0 needs one
workaround for -Wl,-defsym=_begin=0. 9.0.0 works with
--disable-static-pie).
XFAIL two tests sysdeps/x86/tst-ifunc-isa-* which have the BZ #28154
issue (LLD follows the PowerPC port of GNU ld for ifunc by placing
IRELATIVE relocations in .rela.dyn, triggering a glibc ifunc fragility).
The set of dynamic symbols is the same with GNU ld and LLD,
modulo unused SHN_ABS version node symbols.
For comparison, gold does not support --enable-static-pie
yet (--no-dynamic-linker is unsupported BZ #22221), yet
has 6 failures more than LLD. gold linked libc.so has
larger .dynsym differences with GNU ld and LLD
(non-default version symbols are changed to default versions
by a version script BZ #28196).
The Autoconf documentation for the AC_CACHE_CHECK macro states:
The commands-to-set-it must have no side effects except for setting
the variable cache-id, see below.
However, the tests for support of -msahf and -mmovbe were embedded in
the commands-to-set-it for lib_cv_include_x86_isa_level. This had the
consequence that libc_cv_have_x86_lahf_sahf and libc_cv_have_x86_movbe
were not defined whenever lib_cv_include_x86_isa_level was read from
cache. These variables' being undefined meant that their unquoted use
in later test expressions led to the 'test' built-in's misparsing its
arguments and emitting errors like "test: =: unexpected operator" or
"test: =: unary operator expected", depending on the particular shell.
This commit refactors the tests for LAHF/SAHF and MOVBE instruction
support into their own AC_CACHE_CHECK macro invocations to obey the
rule that the commands-to-set-it must have no side effects other than
setting the variable named by cache-id.
Signed-off-by: Matt Whitlock <sourceware@mattwhitlock.name>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
commit 3ec5d83d2a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Sat Jan 25 14:19:40 2020 -0800
x86-64: Avoid rep movsb with short distance [BZ #27130]
introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM. bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:
before after Improvement
length=4127, align1=3, align2=0: 479.38 349.25 27%
length=4223, align1=9, align2=5: 405.62 333.25 18%
length=8223, align1=3, align2=0: 786.12 496.38 37%
length=8319, align1=9, align2=5: 727.50 501.38 31%
length=16415, align1=3, align2=0: 1436.88 840.00 41%
length=16511, align1=9, align2=5: 1375.50 836.38 39%
length=32799, align1=3, align2=0: 2890.00 1860.12 36%
length=32895, align1=9, align2=5: 2891.38 1931.88 33%
1. Install <bits/platform/x86.h> for <sys/platform/x86.h> which includes
<bits/platform/x86.h>.
2. Rename HAS_CPU_FEATURE to CPU_FEATURE_PRESENT which checks if the
processor has the feature.
3. Rename CPU_FEATURE_USABLE to CPU_FEATURE_ACTIVE which checks if the
feature is active. There may be other preconditions, like sufficient
stack space or further setup for AMX, which must be satisfied before the
feature can be used.
This fixes BZ #27958.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Remove unused code and declare __libc_mallopt when !IS_IN (libc) to
allow the debug hook to build with --disable-tunables.
Also, run tst-ifunc-isa-2* tests only when tunables are enabled since
the result depends on it.
Tested on x86_64.
Reported-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The SSBD feature is implemented in 2 different ways on AMD processors:
newer systems (Zen3) provides AMD_SSBD (function 8000_0008, EBX[24]),
while older system provides AMD_VIRT_SSBD (function 8000_0008, EBX[25]).
However for AMD_VIRT_SSBD, kernel shows both 'ssdb' and 'virt_ssdb' on
/proc/cpuinfo; while for AMD_SSBD only 'ssdb' is provided.
This now check is AMD_SSBD is set to check for 'ssbd', otherwise check
if AMD_VIRT_SSDB is set to check for 'virt_ssbd'.
Checked on x86_64-linux-gnu on a Ryzen 9 5900x.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
From
https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
This fixes BZ #28033.