The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).
Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.
This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.
Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.
Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.
Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Remove the definitions of HWCAP_IMPORTANT after removal of
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask. There HWCAP_IMPORTANT
was used as default value.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_PLATFORMS_COUNT as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Note: On x86, we can also get rid of the definitions
HWCAP_PLATFORMS_START and HWCAP_PLATFORMS_COUNT.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_FIRST_PLATFORM as those were only used
in the _DL_HWCAP_PLATFORM definitions and in _dl_string_platform().
Both were removed.
Note: Removed on every architecture despite of powerpc, where
_dl_string_platform() is still used.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_HWCAP_PLATFORM as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the platform strings in dl-procinfo.c where also
the implementation of _dl_string_platform() was removed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Despite of powerpc where the returned integer is stored in tcb,
and the diagnostics output, there is no user anymore.
Thus this patch removes the diagnostics output and
_dl_string_platform for all other platforms.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Both defines are not used anymore. Those were only used for
_dl_string_hwcap(), which itself was removed with commit
ab40f20364
"elf: Remove _dl_string_hwcap"
Just clean up.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
As discussed at the patch review meeting
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
When we don't want to use non-temporal stores for memset, we set
`x86_memset_non_temporal_threshold` to SIZE_MAX.
The current code, however, we using `maximum_non_temporal_threshold`
as the upper bound which is `SIZE_MAX >> 4` so we ended up with a
value of `0`.
Fix is to just use `SIZE_MAX` as the upper bound for when setting the
tunable.
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Properly set libc_cv_have_x86_isa_level in shell for MINIMUM_X86_ISA_LEVEL
defined as
(__X86_ISA_V1 + __X86_ISA_V2 + __X86_ISA_V3 + __X86_ISA_V4)
Also set __X86_ISA_V2 to 1 for i386 if __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
is defined. There are no changes in config.h nor in config.make on x86-64.
On i386, -march=x86-64-v2 with GCC generates
#define MINIMUM_X86_ISA_LEVEL 2
in config.h and
have-x86-isa-level = 2
in config.make. This fixes BZ #31883.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
On i386, set the default minimum ISA level to 0, not 1 (baseline which
includes SSE2). There are no changes in config.h nor in config.make on
x86-64. This fixes BZ #31867.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Tested-by: Ian Jordan <immoloism@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
In commit 46b5e98ef6 ("x86: Add seperate non-temporal tunable for
memset") a tunable threshold for enabling non-temporal memset was added,
but only for Intel hardware.
Since that commit, new benchmark results suggest that non-temporal
memset is beneficial on AMD, as well, so allow this tunable to be set
for AMD.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
which has been updated to include data using different stategies for
large memset on AMD Zen2, Zen3, and Zen4.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.
This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
At this point, this is mainly a tool for testing the early ld.so
CPU compatibility diagnostics: GCC uses the new instructions in most
functions, so it's easy to spot if some of the early code is not
built correctly.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Define MINIMUM_X86_ISA_LEVEL at configure time to avoid
/usr/bin/ld: …/build/elf/librtld.os: in function `init_cpu_features':
…/git/elf/../sysdeps/x86/cpu-features.c:1202: undefined reference to `_dl_runtime_resolve_fxsave'
/usr/bin/ld: …/build/elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
when glibc is built with -march=x86-64-v3 and configured with
--with-rtld-early-cflags=-march=x86-64, which is used to allow ld.so to
print an error message on unsupported CPUs:
Fatal glibc error: CPU does not support x86-64-v3
This fixes BZ #31676.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
These structs describe file formats under /var/log, and should not
depend on the definition of _TIME_BITS. This is achieved by
defining __WORDSIZE_TIME64_COMPAT32 to 1 on 32-bit ports that
support 32-bit time_t values (where __time_t is 32 bits).
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The default <utmp-size.h> is for ports with a 64-bit time_t.
Ports with a 32-bit time_t or with __WORDSIZE_TIME64_COMPAT32=1
need to override it.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This is surprisingly difficult to implement if the goal is to produce
reasonably sized output. With the current approaches to output
compression (suppressing zeros and repeated results between CPUs,
folding ranges of identical subleaves, dealing with the %ecx
reflection issue), the output is less than 600 KiB even for systems
with 256 logical CPUs.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The implementations of trunc functions using x87 floating point (i386 and
x86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The implementations of floor functions using x87 floating point (i386 and
86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The implementations of ceil functions using x87 floating point (i386 and
x86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11. Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).
+==================+<- stack frame start aligned at 8 or 16 bytes
| |<- RDI saved in the red zone
| |<- RSI saved in the red zone
| |<- RBX saved in the red zone
| |<- paddings for stack realignment of 64 bytes
|------------------|<- xsave buffer end aligned at 64 bytes
| |<-
| |<-
| |<-
|------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
| |<- 8-byte padding for 64-byte alignment
| |<- 8-byte padding for 64-byte alignment
| |<- R11
| |<- R10
| |<- R9
| |<- R8
| |<- RDX
| |<- RCX
+==================+<- RSP aligned at 64 bytes
Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone. This fixes BZ #31501.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Replace minimum ISA check ifdef conditional with if. Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved. Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK. Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec. This fixes
the AMX part of BZ #31372. Tested on AMX processor.
AMX test is enabled only for compilers with the fix for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098
GCC 14 and GCC 11/12/13 branches have the bug fix.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:
ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value
For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.
This fixes BZ #31429.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Compiler generates the following instruction sequence for GNU2 dynamic
TLS access:
leaq tls_var@TLSDESC(%rip), %rax
call *tls_var@TLSCALL(%rax)
or
leal tls_var@TLSDESC(%ebx), %eax
call *tls_var@TLSCALL(%eax)
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL. When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path. __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers. _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers. Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers. This
fixes BZ #31372.
Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
When glibc is built with ISA level 3 or higher by default, the resulting
glibc binaries won't run on SSE or FMA4 processors. Exclude SSE, AVX and
FMA4 variants in libm multiarch when ISA level 3 or higher is enabled by
default.
When glibc is built with ISA level 2 enabled by default, only keep SSE4.1
variant.
Fixes BZ 31335.
NB: elf/tst-valgrind-smoke test fails with ISA level 4, because valgrind
doesn't support AVX512 instructions:
https://bugs.kde.org/show_bug.cgi?id=383010
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add APX registers to STATE_SAVE_MASK so that APX registers are saved in
ld.so trampoline. This fixes BZ #31371.
Also update STATE_SAVE_OFFSET and STATE_SAVE_MASK for i386 which will
be used by i386 _dl_tlsdesc_dynamic.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
For AMD Zen3+ architecture, the performance of the vectorized loop is
slightly better than ERMS.
Checked on x86_64-linux-gnu on Zen3.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The REP MOVSB usage on memcpy/memmove does not show much performance
improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
as from BZ 30994, if the source is aligned and the destination is not
the performance can be 20x slower.
The performance difference is noticeable with small buffer sizes, closer
to the lower bounds limits when memcpy/memmove starts to use ERMS. The
performance of REP MOVSB is similar to vectorized instruction on the
size limit (the L2 cache). Also, there is no drawback to multiple cores
sharing the cache.
Checked on x86_64-linux-gnu on Zen3.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
WG14 decided to use the name C23 as the informal name of the next
revision of the C standard (notwithstanding the publication date in
2024). Update references to C2X in glibc to use the C23 name.
This is intended to update everything *except* where it involves
renaming files (the changes involving renaming tests are intended to
be done separately). In the case of the _ISOC2X_SOURCE feature test
macro - the only user-visible interface involved - support for that
macro is kept for backwards compatibility, while adding
_ISOC23_SOURCE.
Tested for x86_64.
Systemd execution environment configuration may prohibit changing a memory
mapping to become executable:
MemoryDenyWriteExecute=
Takes a boolean argument. If set, attempts to create memory mappings
that are writable and executable at the same time, or to change existing
memory mappings to become executable, or mapping shared memory segments
as executable, are prohibited.
When it is set, systemd service stops working if PLT rewrite is enabled.
Check if mprotect works before rewriting PLT. This fixes BZ #31230.
This also works with SELinux when deny_execmem is on.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
CET feature bits in TCB, which are Linux specific, are used to check if
CET features are active. Move CET feature check to Linux/x86 directory.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Move sysdeps/x86/libc-start.h to sysdeps/x86_64/libc-start.h and use
sysdeps/generic/libc-start.h for i386.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
CET is only support for x86_64, this patch reverts:
- faaee1f07e x86: Support shadow stack pointer in setjmp/longjmp.
- be9ccd27c0 i386: Add _CET_ENDBR to indirect jump targets in
add_n.S/sub_n.S
- c02695d776 x86/CET: Update vfork to prevent child return
- 5d844e1b72 i386: Enable CET support in ucontext functions
- 124bcde683 x86: Add _CET_ENDBR to functions in crti.S
- 562837c002 x86: Add _CET_ENDBR to functions in dl-tlsdesc.S
- f753fa7dea x86: Support IBT and SHSTK in Intel CET [BZ #21598]
- 825b58f3fb i386-mcount.S: Add _CET_ENDBR to _mcount and __fentry__
- 7e119cd582 i386: Use _CET_NOTRACK in i686/memcmp.S
- 177824e232 i386: Use _CET_NOTRACK in memcmp-sse4.S
- 0a899af097 i386: Use _CET_NOTRACK in memcpy-ssse3-rep.S
- 7fb613361c i386: Use _CET_NOTRACK in memcpy-ssse3.S
- 77a8ae0948 i386: Use _CET_NOTRACK in memset-sse2-rep.S
- 00e7b76a8f i386: Use _CET_NOTRACK in memset-sse2.S
- 90d15dc577 i386: Use _CET_NOTRACK in strcat-sse2.S
- f1574581c7 i386: Use _CET_NOTRACK in strcpy-sse2.S
- 4031d7484a i386/sub_n.S: Add a missing _CET_ENDBR to indirect jump
- target
-
Checked on i686-linux-gnu.
The CET is only supported for x86_64 and there is no plan to add
kernel support for i386. Move the Makefile rules and files from the
generic x86 folder to x86_64 one.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after
relocation.
For x86-64, add
#define DT_X86_64_PLT (DT_LOPROC + 0)
#define DT_X86_64_PLTSZ (DT_LOPROC + 1)
#define DT_X86_64_PLTENT (DT_LOPROC + 3)
1. DT_X86_64_PLT: The address of the procedure linkage table.
2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage
table.
3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table
entry.
With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the
memory offset of the indirect branch instruction.
Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section
with direct branch after relocation when the lazy binding is disabled.
PLT rewrite is disabled by default since SELinux may disallow modifying
code pages and ld.so can't detect it in all cases. Use
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1
to enable PLT rewrite with 32-bit direct jump at run-time or
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2
to enable PLT rewrite with 32-bit direct jump and on APX processors with
64-bit absolute jump at run-time.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
setcontext and swapcontext put a restore token on the old shadow stack
which is used to restore the target shadow stack when switching user
contexts. When longjmp from a user context, the target shadow stack
can be different from the current shadow stack and INCSSP can't be
used to restore the shadow stack pointer to the target shadow stack.
Update longjmp to search for a restore token. If found, use the token
to restore the shadow stack pointer before using INCSSP to pop the
shadow stack. Stop the token search and use INCSSP if the shadow stack
entry value is the same as the current shadow stack pointer.
It is a user error if there is a shadow stack switch without leaving a
restore token on the old shadow stack.
The only difference between __longjmp.S and __longjmp_chk.S is that
__longjmp_chk.S has a check for invalid longjmp usages. Merge
__longjmp.S and __longjmp_chk.S by adding the CHECK_INVALID_LONGJMP
macro.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>