Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.
commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:21 2022 -0600
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.
This change will need to be backported.
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9fef7039a7)
Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since
they won't lower CPU frequency when ZMM load and store instructions are
used.
(cherry picked from commit ceeffe968c)
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
(cherry picked from commit 14dbbf46a0)
No bug.
This change adds a new macro ENTRY_P2ALIGN which takes a second
argument, log2 of the desired function alignment.
The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
doesn't affect any existing functionality.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit fc5bd179ef)
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b98d0bbf74)
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af)
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c627209832)
In some cases (e.g QEMU, non-Intel/AMD CPU) the cache information can
not be retrieved and the corresponding values are set to 0.
Commit 2d651eb926 ("x86: Move x86 processor cache info to
cpu_features") changed the behaviour in such case by defining the
__x86_shared_cache_size and __x86_data_cache_size variables to 0 instead
of using the default values. This cause an issue with the i686 SSE2
optimized bzero/routine which assumes that the cache size is at least
128 bytes, and otherwise tries to zero/set the whole address space minus
128 bytes.
Fix that by restoring the original code to only update
__x86_shared_cache_size and __x86_data_cache_size variables if the
corresponding cache sizes are not zero.
Fixes bug 28784
Fixes commit 2d651eb926
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c242fcce06)
On KVM guests running on some AMD systems, the IBRS feature is reported
as a synthetic feature using the Intel feature, while the cpuinfo entry
keeps the same. Handle that by first checking the presence of the Intel
feature on AMD systems.
Fixes bug 28704.
(cherry picked from commit 94058f6cde)
commit 3ec5d83d2a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Sat Jan 25 14:19:40 2020 -0800
x86-64: Avoid rep movsb with short distance [BZ #27130]
introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM. bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:
before after Improvement
length=4127, align1=3, align2=0: 479.38 349.25 27%
length=4223, align1=9, align2=5: 405.62 333.25 18%
length=8223, align1=3, align2=0: 786.12 496.38 37%
length=8319, align1=9, align2=5: 727.50 501.38 31%
length=16415, align1=3, align2=0: 1436.88 840.00 41%
length=16511, align1=9, align2=5: 1375.50 836.38 39%
length=32799, align1=3, align2=0: 2890.00 1860.12 36%
length=32895, align1=9, align2=5: 2891.38 1931.88 33%
1. Install <bits/platform/x86.h> for <sys/platform/x86.h> which includes
<bits/platform/x86.h>.
2. Rename HAS_CPU_FEATURE to CPU_FEATURE_PRESENT which checks if the
processor has the feature.
3. Rename CPU_FEATURE_USABLE to CPU_FEATURE_ACTIVE which checks if the
feature is active. There may be other preconditions, like sufficient
stack space or further setup for AMX, which must be satisfied before the
feature can be used.
This fixes BZ #27958.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Remove unused code and declare __libc_mallopt when !IS_IN (libc) to
allow the debug hook to build with --disable-tunables.
Also, run tst-ifunc-isa-2* tests only when tunables are enabled since
the result depends on it.
Tested on x86_64.
Reported-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The SSBD feature is implemented in 2 different ways on AMD processors:
newer systems (Zen3) provides AMD_SSBD (function 8000_0008, EBX[24]),
while older system provides AMD_VIRT_SSBD (function 8000_0008, EBX[25]).
However for AMD_VIRT_SSBD, kernel shows both 'ssdb' and 'virt_ssdb' on
/proc/cpuinfo; while for AMD_SSBD only 'ssdb' is provided.
This now check is AMD_SSBD is set to check for 'ssbd', otherwise check
if AMD_VIRT_SSDB is set to check for 'virt_ssbd'.
Checked on x86_64-linux-gnu on a Ryzen 9 5900x.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
From
https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
This fixes BZ #28033.
AMD define different flags for IRPB, IBRS, and STIPBP [1], so new
x86_64_cpu are added and IBRS_IBPB is only tested for Intel.
The SSDB is also defined and implemented different on AMD [2],
and also a new AMD_SSDB flag is added. It should map to the
cpuinfo 'ssdb' on recent AMD cpus.
It fixes tst-cpu-features-cpuinfo and tst-cpu-features-cpuinfo-static
on recent AMD cpus.
Checked on x86_64-linux-gnu on AMD Ryzen 9 5900X.
[1] https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf
[2] https://bugzilla.kernel.org/show_bug.cgi?id=199889
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
IBT and SHSTK usable bits are copied from CPUID feature bits and later
cleared if kernel doesn't support CET. Copy IBT and SHSTK usable only
if CET is enabled so that they aren't set on CET capable processors
with non-CET enabled glibc.
This commit removes the ELF constructor and internal variables from
dlfcn/dlfcn.c. The file now serves the same purpose as
nptl/libpthread-compat.c, so it is renamed to dlfcn/libdl-compat.c.
The use of libdl-shared-only-routines ensures that libdl.a is empty.
This commit adjusts the test suite not to use $(libdl). The libdl.so
symbolic link is no longer installed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
1. Replace
if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0)
which may be optimized out by compiler, with
int
__attribute__ ((weak, noclone, noinline))
is_aligned (void *p, int align)
{
return (((uintptr_t) p) & (align - 1)) != 0;
}
2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN.
3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment
for both i386 and x86-64.
4. Update powerpc to use TEST_STACK_ALIGN_INIT.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The definitions in libc are sufficient, the forwarders are no longer
needed.
The symbols have been moved using scripts/move-symbol-to-libc.py.
s390-linux-gnu and s390x-linux-gnu need a new version placeholder
to keep the GLIBC_2.19 symbol version in libpthread.
Tested on i386-linux-gnu, powerpc64le-linux-gnu, s390x-linux-gnu,
x86_64-linux-gnu. Built with build-many-glibcs.py.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
All of the isnan functions are in libc.so due to printf_fp, so move
__isnanf128 there too for consistency.
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@ascii.art.br>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
1. Support GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVE.
2. Disable all features which depend on XSAVE:
a. If OSXSAVE is disabled by glibc tunables. Or
b. If both XSAVE and XSAVEC aren't usable.
__x86_get_cpuid_feature_leaf is called during early startup, before
the stack check guard is initialized and is hence not safe to build
with stack-protector.
Additionally, IFUNC resolvers for static tst-ifunc-isa tests get
called too early for stack protector to be useful, so fix them to
disable stack protector for the resolver functions.
This fixes all failures seen with --enable-stack-protector=all
configuration.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
commit 2d651eb926
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Sep 18 07:55:14 2020 -0700
x86: Move x86 processor cache info to cpu_features
missed _SC_LEVEL1_ICACHE_LINESIZE.
1. Add level1_icache_linesize to struct cpu_features.
2. Initialize level1_icache_linesize by calling handle_intel,
handle_zhaoxin and handle_amd with _SC_LEVEL1_ICACHE_LINESIZE.
3. Return level1_icache_linesize for _SC_LEVEL1_ICACHE_LINESIZE.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since the full ISA set used in an ELF binary is unknown to compiler,
an x86-64 ISA level marker indicates the minimum, not maximum, ISA set
required to run such an ELF binary. We never guarantee a library with
an x86-64 ISA level v3 marker doesn't contain other ISAs beyond x86-64
ISA level v3, like AVX VNNI. We check the x86-64 ISA level marker for
the minimum ISA set. Since -march=sandybridge enables only some ISAs
in x86-64 ISA level v3, we should set the needed ISA marker to v2.
Otherwise, libc is compiled with -march=sandybridge will fail to run on
Sandy Bridge:
$ ./elf/ld.so ./libc.so
./libc.so: (p) CPU ISA level is lower than required: needed: 7; got: 3
Set the minimum, instead of maximum, x86-64 ISA level marker should have
no impact on the glibc-hwcaps directory assignment logic in ldconfig nor
ld.so.
Remove the extra space between "# endif" left over from
commit f380868f6d
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Thu Dec 24 15:43:34 2020 -0800
Remove _ISOMAC check from <cpu-features.h>
The TUNABLE_SET interface took a primitive C type argument, which
resulted in inconsistent type conversions internally due to incorrect
dereferencing of types, especialy on 32-bit architectures. This
change simplifies the TUNABLE setting logic along with the interfaces.
Now all numeric tunable values are stored as signed numbers in
tunable_num_t, which is intmax_t. All calls to set tunables cast the
input value to its primitive type and then to tunable_num_t for
storage. This relies on gcc-specific (although I suspect other
compilers woul also do the same) unsigned to signed integer conversion
semantics, i.e. the bit pattern is conserved. The reverse conversion
is guaranteed by the standard.
1. Add CPUID_INDEX_14_ECX_0 for CPUID leaf 0x14 to detect PTWRITE feature
in EBX of CPUID leaf 0x14 with ECX == 0.
2. Add PTWRITE detection to CPU feature tests.
3. Add 2 static CPU feature tests.
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Add _SC_MINSIGSTKSZ for the minimum signal stack size derived from
AT_MINSIGSTKSZ, which is the minimum number of bytes of free stack
space required in order to gurantee successful, non-nested handling
of a single signal whose handler is an empty function, and _SC_SIGSTKSZ
which is the suggested minimum number of bytes of stack space required
for a signal stack.
If AT_MINSIGSTKSZ isn't available, sysconf (_SC_MINSIGSTKSZ) returns
MINSIGSTKSZ. On Linux/x86 with XSAVE, the signal frame used by kernel
is composed of the following areas and laid out as:
------------------------------
| alignment padding |
------------------------------
| xsave buffer |
------------------------------
| fsave header (32-bit only) |
------------------------------
| siginfo + ucontext |
------------------------------
Compute AT_MINSIGSTKSZ value as size of xsave buffer + size of fsave
header (32-bit only) + size of siginfo and ucontext + alignment padding.
If _SC_SIGSTKSZ_SOURCE or _GNU_SOURCE are defined, MINSIGSTKSZ and SIGSTKSZ
are redefined as
/* Default stack size for a signal handler: sysconf (SC_SIGSTKSZ). */
# undef SIGSTKSZ
# define SIGSTKSZ sysconf (_SC_SIGSTKSZ)
/* Minimum stack size for a signal handler: SIGSTKSZ. */
# undef MINSIGSTKSZ
# define MINSIGSTKSZ SIGSTKSZ
Compilation will fail if the source assumes constant MINSIGSTKSZ or
SIGSTKSZ.
The reason for not simply increasing the kernel's MINSIGSTKSZ #define
(apart from the fact that it is rarely used, due to glibc's shadowing
definitions) was that userspace binaries will have baked in the old
value of the constant and may be making assumptions about it.
For example, the type (char [MINSIGSTKSZ]) changes if this #define
changes. This could be a problem if an newly built library tries to
memcpy() or dump such an object defined by and old binary.
Bounds-checking and the stack sizes passed to things like sigaltstack()
and makecontext() could similarly go wrong.
commit 94cd37ebb2
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Sep 16 05:27:32 2020 -0700
x86: Use HAS_CPU_FEATURE with IBT and SHSTK [BZ #26625]
broke
GLIBC_TUNABLES=glibc.cpu.hwcaps=-IBT,-SHSTK
since it can no longer disable IBT nor SHSTK. Handle IBT and SHSTK with:
1. Revert commit 94cd37ebb2.
2. Clears the usable CET feature bits if kernel doesn't support CET.
3. Add GLIBC_TUNABLES tests without dlopen.
4. Add tests to verify that CPU_FEATURE_USABLE on IBT and SHSTK matches
_get_ssp.
5. Update GLIBC_TUNABLES tests with dlopen to verify that CET is disabled
with GLIBC_TUNABLES.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Check ifunc resolver with CPU_FEATURE_USABLE and tunables in dynamic and
static executables to verify that CPUID features are initialized early in
static PIE.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Extern symbol access in position independent code usually involves GOT
indirection which needs RELATIVE reloc in a static linked PIE. (On
some targets this is avoided e.g. because the linker can relax a GOT
access to a pc-relative access, but this is not generally true.) Code
that runs before static PIE self relocation must avoid relying on
dynamic relocations which can be ensured by using hidden visibility.
However we cannot just make all symbols hidden:
On i386, all calls to IFUNC functions must go through PLT and calls to
hidden functions CANNOT go through PLT in PIE since EBX used in PIE PLT
may not be set up for local calls to hidden IFUNC functions.
This patch aims to make symbol references hidden in code that is used
before and by _dl_relocate_static_pie when building a static PIE libc.
Note: for an object that is used in the startup code, its references
and definition may not have consistent visibility: it is only forced
hidden in the startup code.
This is needed for fixing bug 27072.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In <sys/platform/x86.h>, define CPU features as enum instead of using
the C preprocessor magic to make it easier to wrap this functionality
in other languages. Move the C preprocessor magic to internal header
for better GCC codegen when more than one features are checked in a
single expression as in x86-64 dl-hwcaps-subdirs.c.
1. Rename COMMON_CPUID_INDEX_XXX to CPUID_INDEX_XXX.
2. Move CPUID_INDEX_MAX to sysdeps/x86/include/cpu-features.h.
3. Remove struct cpu_features and __x86_get_cpu_features from
<sys/platform/x86.h>.
4. Add __x86_get_cpuid_feature_leaf to <sys/platform/x86.h> and put it
in libc.
5. Make __get_cpu_features() private to glibc.
6. Replace __x86_get_cpu_features(N) with __get_cpu_features().
7. Add _dl_x86_get_cpu_features to GLIBC_PRIVATE.
8. Use a single enum index for each CPU feature detection.
9. Pass the CPUID feature leaf to __x86_get_cpuid_feature_leaf.
10. Return zero struct cpuid_feature for the older glibc binary with a
smaller CPUID_INDEX_MAX [BZ #27104].
11. Inside glibc, use the C preprocessor magic so that cpu_features data
can be loaded just once leading to more compact code for glibc.
256 bits are used for each CPUID leaf. Some leaves only contain a few
features. We can add exceptions to such leaves. But it will increase
code sizes and it is harder to provide backward/forward compatibilities
when new features are added to such leaves in the future.
When new leaves are added, _rtld_global_ro offsets will change which
leads to race condition during in-place updates. We may avoid in-place
updates by
1. Rename the old glibc.
2. Install the new glibc.
3. Remove the old glibc.
NB: A function, __x86_get_cpuid_feature_leaf , is used to avoid the copy
relocation issue with IFUNC resolver as shown in IFUNC resolver tests.