When shadow stack is enabled, some CET tests failed when compiled with
GCC 14:
FAIL: elf/tst-cet-legacy-4
FAIL: elf/tst-cet-legacy-5a
FAIL: elf/tst-cet-legacy-6a
which are caused by
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113039
These tests use -fcf-protection -fcf-protection=branch and assume that
-fcf-protection=branch will override -fcf-protection. But this GCC 14
commit:
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=1c6231c05bdcca
changed the -fcf-protection behavior such that
-fcf-protection -fcf-protection=branch
is treated the same as
-fcf-protection
Use
-fcf-protection -fcf-protection=none -fcf-protection=branch
as the workaround. This fixes BZ #31187.
Tested with GCC 13 and GCC 14 on Intel Tiger Lake.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Not all CET enabled applications and libraries have been properly tested
in CET enabled environments. Some CET enabled applications or libraries
will crash or misbehave when CET is enabled. Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not. Shadow stack can be enabled by
$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK
at run-time if shadow stack can be enabled by kernel.
NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.
Initially, IBT and SHSTK are marked as active when CPU supports them
and CET are enabled in glibc. They can be disabled early by tunables
before relocation. Since after relocation, GLRO(dl_x86_cpu_features)
becomes read-only, we can't update GLRO(dl_x86_cpu_features) to mark
IBT and SHSTK as inactive. Instead, check the feature_1 field in TCB
to decide if IBT and SHST are active.
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled. Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled. There
is no need to disable shadow stack at startup. Shadow stack can only
be enabled in a function which will never return. Otherwise, shadow
stack will underflow at the function return.
1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable. Only non-zero
features in GL(dl_x86_feature_1) should be enabled. After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable. It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable. Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
Sync with Linux kernel 6.6 shadow stack interface. Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.
1. When the shadow stack base in TCB is unset, the default shadow stack
is in use. Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts. If yes,
INCSSP will be used to unwind shadow stack. Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall. Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.
Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
Code is mostly inspired from the LoongArch one, which has a similar ABI,
with minor changes to support riscv32 and register differences.
This fixes elf/tst-sprof-basic. This also fixes elf/tst-audit1,
elf/tst-audit2 and elf/tst-audit8 with recent binutils snapshots when
--enable-bind-now is used.
Resolves: BZ #31151
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
_dl_tlsdesc_undefweak and _dl_tlsdesc_dynamic access the thread pointer
via the tcb field in TCB:
_dl_tlsdesc_undefweak:
_CET_ENDBR
movq 8(%rax), %rax
subq %fs:0, %rax
ret
_dl_tlsdesc_dynamic:
...
subq %fs:0, %rax
movq -8(%rsp), %rdi
ret
Since the tcb field in TCB is a pointer, %fs:0 is a 32-bit location,
not 64-bit. It should use "sub %fs:0, %RAX_LP" instead. Since
_dl_tlsdesc_undefweak returns ptrdiff_t and _dl_make_tlsdesc_dynamic
returns void *, RAX_LP is appropriate here for x32 and x86-64. This
fixes BZ #31185.
On x32, I got
FAIL: elf/tst-tlsgap
$ gdb elf/tst-tlsgap
...
open tst-tlsgap-mod1.so
Thread 2 "tst-tlsgap" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 2268754]
_dl_tlsdesc_dynamic () at ../sysdeps/x86_64/dl-tlsdesc.S:108
108 movq (%rsi), %rax
(gdb) p/x $rsi
$4 = 0xf7dbf9005655fb18
(gdb)
This is caused by
_dl_tlsdesc_dynamic:
_CET_ENDBR
/* Preserve call-clobbered registers that we modify.
We need two scratch regs anyway. */
movq %rsi, -16(%rsp)
movq %fs:DTV_OFFSET, %rsi
Since the dtv field in TCB is a pointer, %fs:DTV_OFFSET is a 32-bit
location, not 64-bit. Load the dtv field to RSI_LP instead of rsi.
This fixes BZ #31184.
In permissive mode, don't disable IBT nor SHSTK when dlopening a legacy
shared library if not single threaded since IBT and SHSTK may be still
enabled in other threads. Other threads with IBT or SHSTK enabled will
crash when calling functions in the legacy shared library. Instead, an
error will be issued.
Improve readability and make maintenance easier for dl-feature.c by
modularizing sysdeps/x86/dl-cet.c:
1. Support processors with:
a. Only IBT. Or
b. Only SHSTK. Or
c. Both IBT and SHSTK.
2. Lock CET features only if IBT or SHSTK are enabled and are not
enabled permissively.
Added annotations for autovec by GCC and GFortran - this enables GCC
>= 9 to autovectorise math calls at -Ofast.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Compilers may emit calls to 'half-width' routines (two-lane
single-precision variants). These have been added in the form of
wrappers around the full-width versions, where the low half of the
vector is simply duplicated. This will perform poorly when one lane
triggers the special-case handler, as there will be a redundant call
to the scalar version, however this is expected to be rare at Ofast.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The expression
(excepts & FE_ALL_EXCEPT) << 27
produces a signed integer overflow when 'excepts' is specified as
FE_INVALID (= 0x10), because
- excepts is of type 'int',
- FE_ALL_EXCEPT is of type 'int',
- thus (excepts & FE_ALL_EXCEPT) is (int) 0x10,
- 'int' is 32 bits wide.
The patched code produces the same instruction sequence as
previosuly.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
It clears some exception flags that are outside the EXCEPTS argument.
It fixes math/test-fexcept on qemu-user.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
libc_feupdateenv_riscv should check for FE_DFL_ENV, similar to
libc_fesetenv_riscv.
Also extend the test-fenv.c to test fenvupdate.
Checked on riscv under qemu-system.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
According to ISO C23 (7.6.4.4), fesetexcept is supposed to set
floating-point exception flags without raising a trap (unlike
feraiseexcept, which is supposed to raise a trap if feenableexcept
was called with the appropriate argument).
The flags can be set in the 387 unit or in the SSE unit. When we need
to clear a flag, we need to do so in both units, due to the way
fetestexcept is implemented.
When we need to set a flag, it is sufficient to do it in the SSE unit,
because that is guaranteed to not trap. However, on i386 CPUs that have
only a 387 unit, set the flags in the 387, as long as this cannot trap.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
According to ISO C23 (7.6.4.4), fesetexcept is supposed to set
floating-point exception flags without raising a trap (unlike
feraiseexcept, which is supposed to raise a trap if feenableexcept
was called with the appropriate argument).
The flags can be set in the 387 unit or in the SSE unit. To set
a flag, it is sufficient to do it in the SSE unit, because that is
guaranteed to not trap. However, on i386 CPUs that have only a
387 unit, set the flags in the 387, as long as this cannot trap.
Checked on i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
According to ISO C23 (7.6.4.4), fesetexcept is supposed to set
floating-point exception flags without raising a trap (unlike
feraiseexcept, which is supposed to raise a trap if feenableexcept was
called with the appropriate argument).
This is a side-effect of how we implement the GNU extension
feenableexcept, where feenableexcept/fesetenv/fesetmode/feupdateenv
might issue prctl (PR_SET_FPEXC, PR_FP_EXC_PRECISE) depending of the
argument. And on PR_FP_EXC_PRECISE, setting a floating-point exception
flag triggers a trap.
To make the both functions follow the C23, fesetexcept and
fesetexceptflag now fail if the argument may trigger a trap.
The math tests now check for an value different than 0, instead
of bail out as unsupported for EXCEPTION_SET_FORCES_TRAP.
Checked on powerpc64le-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The tunable parsing duplicates the tunable environment variable so it
null-terminates each one since it simplifies the later parsing. It has
the drawback of adding another point of failure (__minimal_malloc
failing), and the memory copy requires tuning the compiler to avoid mem
operations calls.
The parsing now tracks the tunable start and its size. The
dl-tunable-parse.h adds helper functions to help parsing, like a strcmp
that also checks for size and an iterator for suboptions that are
comma-separated (used on hwcap parsing by x86, powerpc, and s390x).
Since the environment variable is allocated on the stack by the kernel,
it is safe to keep the references to the suboptions for later parsing
of string tunables (as done by set_hwcaps by multiple architectures).
Checked on x86_64-linux-gnu, powerpc64le-linux-gnu, and
aarch64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Since GCC commit f31a019d1161ec78846473da743aedf49cca8c27 "Emit
funcall external declarations only if actually used.", the glibc
testsuite has failed to build for 32-bit SPARC with GCC mainline.
/scratch/jmyers/glibc-bot/install/compilers/sparc64-linux-gnu/lib/gcc/sparc64-glibc-linux-gnu/14.0.0/../../../../sparc64-glibc-linux-gnu/bin/ld: /scratch/jmyers/glibc-bot/install/compilers/sparc64-linux-gnu/lib/gcc/sparc64-glibc-linux-gnu/14.0.0/32/libgcc.a(_divsi3.o): in function `.div':
/scratch/jmyers/glibc-bot/src/gcc/libgcc/config/sparc/lb1spc.S:138: multiple definition of `.div'; /scratch/jmyers/glibc-bot/build/glibcs/sparcv9-linux-gnu/glibc/libc.a(sdiv.o):/scratch/jmyers/glibc-bot/src/glibc/gnulib/../sysdeps/sparc/sparc32/sparcv9/sdiv.S:13: first defined here
/scratch/jmyers/glibc-bot/install/compilers/sparc64-linux-gnu/lib/gcc/sparc64-glibc-linux-gnu/14.0.0/../../../../sparc64-glibc-linux-gnu/bin/ld: disabling relaxation; it will not work with multiple definitions
collect2: error: ld returned 1 exit status
make[3]: *** [../Rules:298: /scratch/jmyers/glibc-bot/build/glibcs/sparcv9-linux-gnu/glibc/nptl/tst-cancel24-static] Error 1
https://sourceware.org/pipermail/libc-testresults/2023q4/012154.html
I'm not sure of the exact sequence of undefined references that cause
first the glibc object file defining .div and then the libgcc object
file defining both .div and .udiv to be pulled in (which must have
been perturbed by that GCC change in a way that introduced the build
failure), but I think the failure illustrates that it's inherently
fragile for glibc to define symbols in separate object files that
libgcc defines in the same object file - and indeed for glibc to
redefine libgcc symbols at all, since the division into object files
shouldn't really be part of the interface between libgcc and libc.
These symbols appear to be in libc only for compatibility, maybe one
of the cases where they were accidentally exported from shared libc in
glibc 2.0 before the introduction of symbol versioning and so programs
started expecting shared libc to provide them. Thus, there is no need
to have them in static libc. Add this set of libgcc functions to
shared-only-routines so they are no longer provided in static libc.
(No change is made regarding .mul - dotmul source file - since unlike
the other symbols in this grouping, it doesn't actually appear to be a
libgcc symbol, at least in current GCC.)
Tested with build-many-glibcs.py for sparcv9-linux-gnu with GCC
mainline.
Verify that legacy shadow stack code in .init_array section in application
and shared library, which are marked as shadow stack enabled, will trigger
segfault.
So far if the ucontext structure was obtained by getcontext and co,
the return address was stored in general purpose register 14 as
it is defined as return address in the ABI.
In contrast, the context passed to a signal handler contains the address
in psw.addr field.
If somebody e.g. wants to dump the address of the context, the origin
needs to be known.
Now this patch adjusts getcontext and friends and stores the return address
also in psw.addr field.
Note that setcontext isn't adjusted and it is not supported to pass a
ucontext structure from signal-handler to setcontext. We are not able to
restore all registers and branching to psw.addr without clobbering one
register.
This commit uses a common implementation 'strlen-evex-base.S' for both
'strlen-evex' and 'strlen-evex512'
The motivation is to reduce the number of implementations to maintain.
This incidentally gives a small performance improvement.
All tests pass on x86.
Benchmarks were taken on SKX.
https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html
Geometric mean for strlen-evex512 over all benchmarks (N=10) was (new/old) 0.939
Geometric mean for wcslen-evex512 over all benchmarks (N=10) was (new/old) 0.965
Code Size Changes:
strlen-evex512.S : +24 bytes
wcslen-evex512.S : +54 bytes
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Since shadow stack (SHSTK) is enabled in the Linux kernel without
enabling indirect branch tracking (IBT), don't assume that SHSTK
implies IBT. Use "CPU_FEATURE_ACTIVE (IBT)" to check if IBT is active
and "CPU_FEATURE_ACTIVE (SHSTK)" to check if SHSTK is active.
This patch reserves space for HWCAP3/HWCAP4 in the TCB of powerpc.
These hardware capabilities bits will be used by future Power
architectures.
Versioned symbol '__parse_hwcap_3_4_and_convert_at_platform' advertises
the availability of the new HWCAP3/HWCAP4 data in the TCB.
This is an ABI change for GLIBC 2.39.
Suggested-by: Peter Bergner <bergner@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
Current implementation of strcmp for power10 has
performance regression for multiple small sizes
and alignment combination.
Most of these performance issues are fixed by this
patch. The compare loop is unrolled and page crosses
of unrolled loop is handled.
Thanks to Paul E. Murphy for helping in fixing the
performance issues.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
Co-Authored-By: Paul E. Murphy <murphyp@linux.ibm.com>
Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
Optimized memchr for POWER10 based on existing rawmemchr and strlen.
Reordering instructions and loop unrolling helped in getting better performance.
Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
The PT_GNU_PROPERTY segment is scanned before PT_NOTE. For binaries
with the PT_GNU_PROPERTY segment, we can check it to avoid scan of
the PT_NOTE segment.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch is based on __strcmp_power9 and __strlen_power10.
Improvements from __strcmp_power9:
1. Uses new POWER10 instructions
- This code uses lxvp to decrease contention on load
by loading 32 bytes per instruction.
2. Performance implication
- This version has around 30% better performance on average.
- Performance regression is seen for a specific combination
of sizes and alignments. Some of them is observed without
changes also, while rest may be induced by the patch.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
To avoid any environment variable to change setuid binaries
semantics.
Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Loader already ignores LD_DEBUG, LD_DEBUG_OUTPUT, and
LD_TRACE_LOADED_OBJECTS. Both LD_WARN and LD_VERBOSE are similar to
LD_DEBUG, in the sense they enable additional checks and debug
information, so it makes sense to disable them.
Also add both LD_VERBOSE and LD_WARN on filtered environment variables
for setuid binaries.
Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The .cfi_return_column directive changes the return column for the whole
FDE range. But the actual intent is to tell the unwinder that the value
in x30 (lr) now resides in x15 after the move, and that is expressed by
the .cfi_register directive.
New implementation is based on the existing exp/exp2, with different
reduction constants and polynomial. Worst-case error in round-to-
nearest is 0.513 ULP.
The exp/exp2 shared table is reused for exp10 - .rodata size of
e_exp_data increases by 64 bytes.
As for exp/exp2, targets with single-instruction rounding/conversion
intrinsics can use them by toggling TOINT_INTRINSICS=1 and adding the
necessary code to their math_private.h.
Improvements on Neoverse V1 compared to current GLIBC master:
exp10 thruput: 3.3x in [-0x1.439b746e36b52p+8 0x1.34413509f79ffp+8]
exp10 latency: 1.8x in [-0x1.439b746e36b52p+8 0x1.34413509f79ffp+8]
Tested on:
aarch64-linux-gnu (TOINT_INTRINSICS, fma contraction) and
x86_64-linux-gnu (!TOINT_INTRINSICS, no fma contraction)
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
It is not strictly required by the POSIX, since O_PATH is a Linux
extension, but it is QoI to fail early instead of at readdir. Also
the check is free, since fdopendir already checks if the file
descriptor is opened for read.
Checked on x86_64-linux-gnu.
The linker just concatenates the .init and .fini sections which
results in the complete _init and _fini functions. If needed the
linker adds padding bytes due to an alignment. GNU ld is adding
NOPs, which is fine. But e.g. mold is adding traps which results
in broken _init and _fini functions.
Thus this patch removes the alignment in .init and .fini sections
in crtn.S files.
We keep the 4 byte function alignment in crti.S files. As the
assembler now also outputs the start of _init and _fini functions
as multiples of 4 byte, it perhaps has to fill it. Although GNU as
is using NOPs here, to be sure, we just keep the alignment with
0x07 (=NOPs) at the end of crti.S.
In order to avoid an obvious NOP slide in _fini, this patch also
uses an lg instead of lgr instruction. Then the emitted instructions
needs a multiple of 4 bytes.