GCC 11 supports -march=x86-64-v[234] to enable x86 micro-architecture ISA
levels:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97250
and -mneeded to emit GNU_PROPERTY_X86_ISA_1_NEEDED property with
GNU_PROPERTY_X86_ISA_1_V[234] marker:
https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/13
Binutils support for GNU_PROPERTY_X86_ISA_1_V[234] marker were added by
commit b0ab06937385e0ae25cebf1991787d64f439bf12
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Oct 30 06:49:57 2020 -0700
x86: Support GNU_PROPERTY_X86_ISA_1_BASELINE marker
and
commit 32930e4edbc06bc6f10c435dbcc63131715df678
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Oct 9 05:05:57 2020 -0700
x86: Support GNU_PROPERTY_X86_ISA_1_V[234] marker
GNU_PROPERTY_X86_ISA_1_NEEDED property in x86 ELF binaries indicate the
micro-architecture ISA level required to execute the binary. The marker
must be added by programmers explicitly in one of 3 ways:
1. Pass -mneeded to GCC.
2. Add the marker in the linker inputs as this patch does.
3. Pass -z x86-64-v[234] to the linker.
Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
marker support to ld.so if binutils 2.32 or newer is used to build glibc:
1. Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
markers to elf.h.
2. Add GNU_PROPERTY_X86_ISA_1_BASELINE and GNU_PROPERTY_X86_ISA_1_V[234]
marker to abi-note.o based on the ISA level used to compile abi-note.o,
assuming that the same ISA level is used to compile the whole glibc.
3. Add isa_1 to cpu_features to record the supported x86 ISA level.
4. Rename _dl_process_cet_property_note to _dl_process_property_note and
add GNU_PROPERTY_X86_ISA_1_V[234] marker detection.
5. Update _rtld_main_check and _dl_open_check to check loaded objects
with the incompatible ISA level.
6. Add a testcase to verify that dlopen an x86-64-v4 shared object fails
on lesser platforms.
7. Use <get-isa-level.h> in dl-hwcaps-subdirs.c and tst-glibc-hwcaps.c.
Tested under i686, x32 and x86-64 modes on x86-64-v2, x86-64-v3 and
x86-64-v4 machines.
Marked elf/tst-isa-level-1 with x86-64-v4, ran it on x86-64-v3 machine
and got:
[hjl@gnu-cfl-2 build-x86_64-linux]$ ./elf/tst-isa-level-1
./elf/tst-isa-level-1: CPU ISA level is lower than required
[hjl@gnu-cfl-2 build-x86_64-linux]$
Make the tests use TEST_COND_intel96 to decide on whether to build the
unnormal tests instead of the macro in nan-pseudo-number.h and then
drop the header inclusion. This unbreaks test runs on all
architectures that do not have ldbl-96.
Also drop the HANDLE_PSEUDO_NUMBERS macro since it is not used
anywhere.
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
Add support to treat pseudo-numbers specially and implement x86
version to consider all of them as signaling.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Do not attempt to fix the significand top bit in long double input
received in printf. The code should never reach here because isnan
should now detect unnormals as NaN. This is already a NOP for glibc
since it uses the gcc __builtin_isnan, which detects unnormals as NaN.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This syncs up isnanl behaviour with gcc. Also move the isnanl
implementation to sysdeps/x86 and remove the sysdeps/x86_64 version.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Also move sysdeps/i386/fpu/s_fpclassifyl.c to
sysdeps/x86/fpu/s_fpclassifyl.c and remove
sysdeps/x86_64/fpu/s_fpclassifyl.c
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add Intel Linear Address Masking (LAM) support to <sys/platform/x86.h>.
HAS_CPU_FEATURE (LAM) can be used to detect if LAM is enabled in CPU.
LAM modifies the checking that is applied to 64-bit linear addresses,
allowing software to use of the untranslated address bits for metadata.
Since we can't tell if the tunable value is set by user or not:
https://sourceware.org/bugzilla/show_bug.cgi?id=27069
remove the default REP MOVSB threshold tunable value so that the correct
default value will be set correctly by init_cacheinfo ().
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
To handle GNU property notes on aarch64 some segments need to
be mmaped again, so the fd of the loaded ELF module is needed.
When the fd is not available (kernel loaded modules), then -1
is passed.
The fd is passed to both _dl_process_pt_gnu_property and
_dl_process_pt_note for consistency. Target specific note
processing functions are updated accordingly.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Check HAS_CPU_FEATURE instead of CPU_FEATURE_USABLE for FSGSBASE, IBT,
LM, SHSTK and XSAVES since FSGSBASE requires kernel support, IBT/SHSTK/LM
require OS support and XSAVES is supervisor-mode only.
The UP macro is never defined. Also define LOCK_PREFIX
unconditionally, to the same string.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Modifying the shareable cache '__x86_shared_cache_size', which is a
factor in computing the non-temporal threshold parameter
'__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen
architectures.
In the existing implementation, the shareable cache is computed as 'L3
per thread, L2 per core'. Recomputing this shareable cache as 'L3 per
CCX(Core-Complex)' has brought in performance gains.
As per the large bench variant results, this patch also addresses the
regression problem on AMD Zen architectures.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
X86 CPU features in ld.so are initialized by init_cpu_features, which is
invoked by DL_PLATFORM_INIT from _dl_sysdep_start. But when ld.so is
loaded by static executable, DL_PLATFORM_INIT is never called. Also
x86 cache info in libc.o and libc.a is initialized by a constructor
which may be called too late. Since some fields in _rtld_global_ro
in ld.so are initialized by dynamic relocation, we can also initialize
x86 CPU features in _rtld_global_ro in ld.so and cache info in libc.so
by initializing dummy function pointers in ld.so and libc.so via IFUNC
relocation.
Key points:
1. IFUNC is always supported, independent of --enable-multi-arch or
--disable-multi-arch. Linker generates IFUNC relocations from input
IFUNC objects and ld.so performs IFUNC relocations.
2. There are no IFUNC dependencies in ld.so before dynamic relocation
have been performed,
3. The x86 CPU features in ld.so is initialized by DL_PLATFORM_INIT
in dynamic executable and by IFUNC relocation in dlopen in static
executable.
4. The x86 cache info in libc.o is initialized by IFUNC relocation.
5. In libc.a, both x86 CPU features and cache info are initialized from
ARCH_INIT_CPU_FEATURES, not by IFUNC relocation, before __libc_early_init
is called.
Note: _dl_x86_init_cpu_features can be called more than once from
DL_PLATFORM_INIT and during relocation in ld.so.
The __x86_shared_non_temporal_threshold determines when memcpy on x86
uses non_temporal stores to avoid pushing other data out of the last
level cache.
This patch proposes to revert the calculation change made by H.J. Lu's
patch of June 2, 2017.
H.J. Lu's patch selected a threshold suitable for a single thread
getting maximum performance. It was tuned using the single threaded
large memcpy micro benchmark on an 8 core processor. The last change
changes the threshold from using 3/4 of one thread's share of the
cache to using 3/4 of the entire cache of a multi-threaded system
before switching to non-temporal stores. Multi-threaded systems with
more than a few threads are server-class and typically have many
active threads. If one thread consumes 3/4 of the available cache for
all threads, it will cause other active threads to have data removed
from the cache. Two examples show the range of the effect. John
McCalpin's widely parallel Stream benchmark, which runs in parallel
and fetches data sequentially, saw a 20% slowdown with this patch on
an internal system test of 128 threads. This regression was discovered
when comparing OL8 performance to OL7. An example that compares
normal stores to non-temporal stores may be found at
https://vgatherps.github.io/2018-09-02-nontemporal/. A simple test
shows performance loss of 400 to 500% due to a failure to use
nontemporal stores. These performance losses are most likely to occur
when the system load is heaviest and good performance is critical.
The tunable x86_non_temporal_threshold can be used to override the
default for the knowledgable user who really wants maximum cache
allocation to a single thread in a multi-threaded system.
The manual entry for the tunable has been expanded to provide
more information about its purpose.
modified: sysdeps/x86/cacheinfo.c
modified: manual/tunables.texi
The behavior of isnan/__builtin_isnan on bit patterns that do not
correspond to something that the CPU would produce from valid inputs
is currently under-defined in the toolchain. (The GCC built-in and
glibc disagree.)
The isnan check in PRINTF_FP_FETCH in stdio-common/printf_fp.c
assumes the GCC behavior that returns true for non-normal numbers
which are not specified as NaN. (The glibc implementation returns
false for such numbers.)
At present, passing non-normal numbers to __mpn_extract_long_double
causes this function to produce irregularly shaped multi-precision
integers, triggering undefined behavior in __printf_fp_l.
With GCC 10 and glibc 2.32, this behavior is not visible because
__builtin_isnan is used, which avoids calling
__mpn_extract_long_double in this case. This commit updates the
implementation of __mpn_extract_long_double so that regularly shaped
multi-precision integers are produced in this case, avoiding
undefined behavior in __printf_fp_l.
commit 04bba1e5d8
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Aug 5 13:51:56 2020 -0700
x86: Set CPU usable feature bits conservatively [BZ #26552]
Set CPU usable feature bits only for CPU features which are usable in
user space and whose usability can be detected from user space, excluding
features like FSGSBASE whose enable bit can only be checked in the kernel.
no longer turns on the usable bits of IBT and SHSTK since we don't know
if IBT and SHSTK are usable until much later. Use HAS_CPU_FEATURE to
check if the processor supports IBT and SHSTK.
Add Intel Key Locker:
https://software.intel.com/content/www/us/en/develop/download/intel-key-locker-specification.html
support to <sys/platform/x86.h>. Intel Key Locker has
1. KL: AES Key Locker instructions.
2. WIDE_KL: AES wide Key Locker instructions.
3. AESKLE: AES Key Locker instructions are enabled by OS.
Applications should use
if (CPU_FEATURE_USABLE (KL))
and
if (CPU_FEATURE_USABLE (WIDE_KL))
to check if AES Key Locker instructions and AES wide Key Locker
instructions are usable.
Install <sys/platform/x86.h> so that programmers can do
#if __has_include(<sys/platform/x86.h>)
#include <sys/platform/x86.h>
#endif
...
if (CPU_FEATURE_USABLE (SSE2))
...
if (CPU_FEATURE_USABLE (AVX2))
...
<sys/platform/x86.h> exports only:
enum
{
COMMON_CPUID_INDEX_1 = 0,
COMMON_CPUID_INDEX_7,
COMMON_CPUID_INDEX_80000001,
COMMON_CPUID_INDEX_D_ECX_1,
COMMON_CPUID_INDEX_80000007,
COMMON_CPUID_INDEX_80000008,
COMMON_CPUID_INDEX_7_ECX_1,
/* Keep the following line at the end. */
COMMON_CPUID_INDEX_MAX
};
struct cpuid_features
{
struct cpuid_registers cpuid;
struct cpuid_registers usable;
};
struct cpu_features
{
struct cpu_features_basic basic;
struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
};
/* Get a pointer to the CPU features structure. */
extern const struct cpu_features *__x86_get_cpu_features
(unsigned int max) __attribute__ ((const));
Since all feature checks are done through macros, programs compiled with
a newer <sys/platform/x86.h> are compatible with the older glibc binaries
as long as the layout of struct cpu_features is identical. The features
array can be expanded with backward binary compatibility for both .o and
.so files. When COMMON_CPUID_INDEX_MAX is increased to support new
processor features, __x86_get_cpu_features in the older glibc binaries
returns NULL and HAS_CPU_FEATURE/CPU_FEATURE_USABLE return false on the
new processor feature. No new symbol version is neeeded.
Both CPU_FEATURE_USABLE and HAS_CPU_FEATURE are provided. HAS_CPU_FEATURE
can be used to identify processor features.
Note: Although GCC has __builtin_cpu_supports, it only supports a subset
of <sys/platform/x86.h> and it is equivalent to CPU_FEATURE_USABLE. It
doesn't support HAS_CPU_FEATURE.
Set CPU usable feature bits only for CPU features which are usable in
user space and whose usability can be detected from user space, excluding
features like FSGSBASE whose enable bit can only be checked in the kernel.
Intel64 and IA-32 Architectures Software Developer’s Manual has changed
the following CPU feature names:
1. The CPU feature of Enhanced Intel SpeedStep Technology is renamed
from EST to EIST.
2. The CPU feature which supports Platform Quality of Service Monitoring
(PQM) capability is changed to Intel Resource Director Technology
(Intel RDT) Monitoring capability, i.e. PQM is renamed to RDT_M.
3. The CPU feature which supports Platform Quality of Service
Enforcement (PQE) capability is changed to Intel Resource Director
Technology (Intel RDT) Allocation capability, i.e. PQE is renamed to
RDT_A.
Support usable check for all CPU features with the following changes:
1. Change struct cpu_features to
struct cpuid_features
{
struct cpuid_registers cpuid;
struct cpuid_registers usable;
};
struct cpu_features
{
struct cpu_features_basic basic;
struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};
so that there is a usable bit for each cpuid bit.
2. After the cpuid bits have been initialized, copy the known bits to the
usable bits. EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for
CPU feature detection.
3. Clear the usable bits which require OS support.
4. If the feature is supported by OS, copy its cpuid bit to its usable
bit.
5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE
and CPU_FEATURE_USABLE_P to check if a feature is usable.
6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13.
7. Unset MPX feature since it has been deprecated.
The results are
1. If the feature is known and doesn't requre OS support, its usable bit
is copied from the cpuid bit.
2. Otherwise, its usable bit is copied from the cpuid bit only if the
feature is known to supported by OS.
3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the
feature can be used.
4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports
the feature.
Since
commit 430388d5dc
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Aug 3 08:04:49 2018 -0700
x86: Don't include <init-arch.h> in assembly codes
removed all usages of <init-arch.h> from assembly codes, we can remove
__ASSEMBLER__ check in init-arch.h.
Since
commit c867597bff
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jun 8 13:57:50 2016 -0700
X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
removed the only usage of __x86_prefetchw, we can remove the unused
__x86_prefetchw.
Add generic code to handle PT_GNU_PROPERTY notes. Invalid
content is ignored, _dl_process_pt_gnu_property is always called
after PT_LOAD segments are mapped and it has no failure modes.
Currently only one NT_GNU_PROPERTY_TYPE_0 note is handled, which
contains target specific properties: the _dl_process_gnu_property
hook is called for each property.
The old _dl_process_pt_note and _rtld_process_pt_note differ in how
the program header is read. The old _dl_process_pt_note is called
before PT_LOAD segments are mapped and _rtld_process_pt_note is called
after PT_LOAD segments are mapped. The old _rtld_process_pt_note is
removed and _dl_process_pt_note is always called after PT_LOAD
segments are mapped and now it has no failure modes.
The program headers are scanned backwards so that PT_NOTE can be
skipped if PT_GNU_PROPERTY exists.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.
Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
An extension called extended feature disable (XFD) is an extension added
for Intel AMX to the XSAVE feature set that allows an operating system
to enable a feature while preventing specific user threads from using
the feature.
Intel Advanced Matrix Extensions (Intel AMX) is a new programming
paradigm consisting of two components: a set of 2-dimensional registers
(tiles) representing sub-arrays from a larger 2-dimensional memory image,
and accelerators able to operate on tiles. Intel AMX is an extensible
architecture. New accelerators can be added and the existing accelerator
may be enhanced to provide higher performance. The initial features are
AMX-BF16, AMX-TILE and AMX-INT8, which are usable only if the operating
system supports both XTILECFG state and XTILEDATA state.
Add AMX-BF16, AMX-TILE and AMX-INT8 support to HAS_CPU_FEATURE and
CPU_FEATURE_USABLE.
1. Divide architecture features into the usable features and the preferred
features. The usable features are for correctness and can be exported in
a stable ABI. The preferred features are for performance and only for
glibc internal use.
2. Change struct cpu_features to
struct cpu_features
{
struct cpu_features_basic basic;
unsigned int *usable_p;
struct cpuid_registers cpuid[COMMON_CPUID_INDEX_MAX];
unsigned int usable[USABLE_FEATURE_INDEX_MAX];
unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};
and initialize usable_p to pointer to the usable arary so that
struct cpu_features
{
struct cpu_features_basic basic;
unsigned int *usable_p;
struct cpuid_registers cpuid[COMMON_CPUID_INDEX_MAX];
};
can be exported via a stable ABI. The cpuid and usable arrays can be
expanded with backward binary compatibility for both .o and .so files.
3. Add COMMON_CPUID_INDEX_7_ECX_1 for AVX512_BF16.
4. Detect ENQCMD, PKS, AVX512_VP2INTERSECT, MD_CLEAR, SERIALIZE, HYBRID,
TSXLDTRK, L1D_FLUSH, CORE_CAPABILITIES and AVX512_BF16.
5. Rename CAPABILITIES to ARCH_CAPABILITIES.
6. Check if AVX512_VP2INTERSECT, AVX512_BF16 and PKU are usable.
7. Update CPU feature detection test.
When CET is enabled, it is an error to dlopen a non CET enabled shared
library in CET enabled application. It may be desirable to make CET
permissive, that is disable CET when dlopening a non CET enabled shared
library. With the new --enable-cet=permissive configure option, CET is
disabled when dlopening a non CET enabled shared library.
Add DEFAULT_DL_X86_CET_CONTROL to config.h.in:
/* The default value of x86 CET control. */
#define DEFAULT_DL_X86_CET_CONTROL cet_elf_property
which enables CET features based on ELF property note.
--enable-cet=permissive it to
/* The default value of x86 CET control. */
#define DEFAULT_DL_X86_CET_CONTROL cet_permissive
which enables CET features permissively.
Update tst-cet-legacy-5a, tst-cet-legacy-5b, tst-cet-legacy-6a and
tst-cet-legacy-6b to check --enable-cet and --enable-cet=permissive.
1. Include <dl-procruntime.c> to get architecture specific initializer in
rtld_global.
2. Change _dl_x86_feature_1[2] to _dl_x86_feature_1.
3. Add _dl_x86_feature_control after _dl_x86_feature_1, which is a
struct of 2 bitfields for IBT and SHSTK control
This fixes [BZ #25887].
This consolidates the copy-pasted arch specific semaphore header into
single version (based on s390) which suffices 32-bit and and 64-bit
arch/ABI based on the canonical WORDSIZE.
For now I've left out arches which use alternate defines to choose for
32 vs 64-bit builds (aarch64, mips) which in theory can also use the same
header.
Passes build-many for
aarch64-linux-gnu arm-linux-gnueabi arm-linux-gnueabihf
riscv64-linux-gnu-rv64imac-lp64 riscv64-linux-gnu-rv64imafdc-lp64
x86_64-linux-gnu microblaze-linux-gnu nios2-linux-gnu
Suggested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Commit a98dc92dd1 ("x86: Add cache
information support for Zhaoxin processors") introduced an unused
variable warning in the default i686-linux-gnu build:
In file included from ../sysdeps/i386/cacheinfo.c:3:
../sysdeps/x86/cacheinfo.c: In function 'init_cacheinfo':
../sysdeps/x86/cacheinfo.c:762:16: error: unused variable 'eax' [-Werror=unused-variable]
762 | unsigned int eax;
| ^~~
To obtain Zhaoxin CPU cache information, add a new function
handle_zhaoxin().
Add a new function get_common_cache_info() that extracts the code
in init_cacheinfo() to get the value of the variable shared, threads.
Add Zhaoxin branch in init_cacheinfo() for initializing variables,
such as __x86_shared_cache_size.