Systemd execution environment configuration may prohibit changing a memory
mapping to become executable:
MemoryDenyWriteExecute=
Takes a boolean argument. If set, attempts to create memory mappings
that are writable and executable at the same time, or to change existing
memory mappings to become executable, or mapping shared memory segments
as executable, are prohibited.
When it is set, systemd service stops working if PLT rewrite is enabled.
Check if mprotect works before rewriting PLT. This fixes BZ #31230.
This also works with SELinux when deny_execmem is on.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Ffsll function randomly regress by ~20%, depending on how code gets
aligned in memory. Ffsll function code size is 17 bytes. Since default
function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes
aligned memory. When ffsll function load at 16, 32 or 64 bytes aligned
memory, entire code fits in single 64 bytes cache line. When ffsll
function load at 48 bytes aligned memory, it splits in two cache line,
hence random regression.
Ffsll function size reduction from 17 bytes to 12 bytes ensures that it
will always fit in single 64 bytes cache line.
This patch fixes ffsll function random performance regression.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
CET feature bits in TCB, which are Linux specific, are used to check if
CET features are active. Move CET feature check to Linux/x86 directory.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Move sysdeps/x86/libc-start.h to sysdeps/x86_64/libc-start.h and use
sysdeps/generic/libc-start.h for i386.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
CET is only support for x86_64, this patch reverts:
- faaee1f07e x86: Support shadow stack pointer in setjmp/longjmp.
- be9ccd27c0 i386: Add _CET_ENDBR to indirect jump targets in
add_n.S/sub_n.S
- c02695d776 x86/CET: Update vfork to prevent child return
- 5d844e1b72 i386: Enable CET support in ucontext functions
- 124bcde683 x86: Add _CET_ENDBR to functions in crti.S
- 562837c002 x86: Add _CET_ENDBR to functions in dl-tlsdesc.S
- f753fa7dea x86: Support IBT and SHSTK in Intel CET [BZ #21598]
- 825b58f3fb i386-mcount.S: Add _CET_ENDBR to _mcount and __fentry__
- 7e119cd582 i386: Use _CET_NOTRACK in i686/memcmp.S
- 177824e232 i386: Use _CET_NOTRACK in memcmp-sse4.S
- 0a899af097 i386: Use _CET_NOTRACK in memcpy-ssse3-rep.S
- 7fb613361c i386: Use _CET_NOTRACK in memcpy-ssse3.S
- 77a8ae0948 i386: Use _CET_NOTRACK in memset-sse2-rep.S
- 00e7b76a8f i386: Use _CET_NOTRACK in memset-sse2.S
- 90d15dc577 i386: Use _CET_NOTRACK in strcat-sse2.S
- f1574581c7 i386: Use _CET_NOTRACK in strcpy-sse2.S
- 4031d7484a i386/sub_n.S: Add a missing _CET_ENDBR to indirect jump
- target
-
Checked on i686-linux-gnu.
The CET is only supported for x86_64 and there is no plan to add
kernel support for i386. Move the Makefile rules and files from the
generic x86 folder to x86_64 one.
Checked on x86_64-linux-gnu and i686-linux-gnu.
PLT rewrite calculated displacement with
ElfW(Addr) disp = value - branch_start - JMP32_INSN_SIZE;
On x32, displacement from 0xf7fbe060 to 0x401030 was calculated as
unsigned int disp = 0x401030 - 0xf7fbe060 - 5;
with disp == 0x8442fcb and caused displacement overflow. The PLT entry
was changed to:
0xf7fbe060 <+0>: e9 cb 2f 44 08 jmp 0x401030
0xf7fbe065 <+5>: cc int3
0xf7fbe066 <+6>: cc int3
0xf7fbe067 <+7>: cc int3
0xf7fbe068 <+8>: cc int3
0xf7fbe069 <+9>: cc int3
0xf7fbe06a <+10>: cc int3
0xf7fbe06b <+11>: cc int3
0xf7fbe06c <+12>: cc int3
0xf7fbe06d <+13>: cc int3
0xf7fbe06e <+14>: cc int3
0xf7fbe06f <+15>: cc int3
x32 has 32-bit address range, but it doesn't wrap address around at 4GB,
JMP target was changed to 0x100401030 (0xf7fbe060LL + 0x8442fcbLL + 5),
which is above 4GB.
Always use uint64_t to calculate displacement. This fixes BZ #31218.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after
relocation.
For x86-64, add
#define DT_X86_64_PLT (DT_LOPROC + 0)
#define DT_X86_64_PLTSZ (DT_LOPROC + 1)
#define DT_X86_64_PLTENT (DT_LOPROC + 3)
1. DT_X86_64_PLT: The address of the procedure linkage table.
2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage
table.
3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table
entry.
With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the
memory offset of the indirect branch instruction.
Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section
with direct branch after relocation when the lazy binding is disabled.
PLT rewrite is disabled by default since SELinux may disallow modifying
code pages and ld.so can't detect it in all cases. Use
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1
to enable PLT rewrite with 32-bit direct jump at run-time or
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2
to enable PLT rewrite with 32-bit direct jump and on APX processors with
64-bit absolute jump at run-time.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
setcontext and swapcontext put a restore token on the old shadow stack
which is used to restore the target shadow stack when switching user
contexts. When longjmp from a user context, the target shadow stack
can be different from the current shadow stack and INCSSP can't be
used to restore the shadow stack pointer to the target shadow stack.
Update longjmp to search for a restore token. If found, use the token
to restore the shadow stack pointer before using INCSSP to pop the
shadow stack. Stop the token search and use INCSSP if the shadow stack
entry value is the same as the current shadow stack pointer.
It is a user error if there is a shadow stack switch without leaving a
restore token on the old shadow stack.
The only difference between __longjmp.S and __longjmp_chk.S is that
__longjmp_chk.S has a check for invalid longjmp usages. Merge
__longjmp.S and __longjmp_chk.S by adding the CHECK_INVALID_LONGJMP
macro.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Since shadow stack is only supported for x86-64, ignore --enable-cet for
i386. Always setting $(enable-cet) for i386 to "no" to support
ifneq ($(enable-cet),no)
in x86 Makefiles. We can't use
ifeq ($(enable-cet),yes)
since $(enable-cet) can be "yes", "no" or "permissive".
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled. Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled. There
is no need to disable shadow stack at startup. Shadow stack can only
be enabled in a function which will never return. Otherwise, shadow
stack will underflow at the function return.
1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable. Only non-zero
features in GL(dl_x86_feature_1) should be enabled. After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable. It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable. Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
Sync with Linux kernel 6.6 shadow stack interface. Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.
1. When the shadow stack base in TCB is unset, the default shadow stack
is in use. Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts. If yes,
INCSSP will be used to unwind shadow stack. Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall. Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.
Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
_dl_tlsdesc_undefweak and _dl_tlsdesc_dynamic access the thread pointer
via the tcb field in TCB:
_dl_tlsdesc_undefweak:
_CET_ENDBR
movq 8(%rax), %rax
subq %fs:0, %rax
ret
_dl_tlsdesc_dynamic:
...
subq %fs:0, %rax
movq -8(%rsp), %rdi
ret
Since the tcb field in TCB is a pointer, %fs:0 is a 32-bit location,
not 64-bit. It should use "sub %fs:0, %RAX_LP" instead. Since
_dl_tlsdesc_undefweak returns ptrdiff_t and _dl_make_tlsdesc_dynamic
returns void *, RAX_LP is appropriate here for x32 and x86-64. This
fixes BZ #31185.
On x32, I got
FAIL: elf/tst-tlsgap
$ gdb elf/tst-tlsgap
...
open tst-tlsgap-mod1.so
Thread 2 "tst-tlsgap" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 2268754]
_dl_tlsdesc_dynamic () at ../sysdeps/x86_64/dl-tlsdesc.S:108
108 movq (%rsi), %rax
(gdb) p/x $rsi
$4 = 0xf7dbf9005655fb18
(gdb)
This is caused by
_dl_tlsdesc_dynamic:
_CET_ENDBR
/* Preserve call-clobbered registers that we modify.
We need two scratch regs anyway. */
movq %rsi, -16(%rsp)
movq %fs:DTV_OFFSET, %rsi
Since the dtv field in TCB is a pointer, %fs:DTV_OFFSET is a 32-bit
location, not 64-bit. Load the dtv field to RSI_LP instead of rsi.
This fixes BZ #31184.
According to ISO C23 (7.6.4.4), fesetexcept is supposed to set
floating-point exception flags without raising a trap (unlike
feraiseexcept, which is supposed to raise a trap if feenableexcept
was called with the appropriate argument).
The flags can be set in the 387 unit or in the SSE unit. When we need
to clear a flag, we need to do so in both units, due to the way
fetestexcept is implemented.
When we need to set a flag, it is sufficient to do it in the SSE unit,
because that is guaranteed to not trap. However, on i386 CPUs that have
only a 387 unit, set the flags in the 387, as long as this cannot trap.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This commit uses a common implementation 'strlen-evex-base.S' for both
'strlen-evex' and 'strlen-evex512'
The motivation is to reduce the number of implementations to maintain.
This incidentally gives a small performance improvement.
All tests pass on x86.
Benchmarks were taken on SKX.
https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html
Geometric mean for strlen-evex512 over all benchmarks (N=10) was (new/old) 0.939
Geometric mean for wcslen-evex512 over all benchmarks (N=10) was (new/old) 0.965
Code Size Changes:
strlen-evex512.S : +24 bytes
wcslen-evex512.S : +54 bytes
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
The _dl_non_dynamic_init does not parse LD_PROFILE, which does not
enable profile for dlopen objects. Since dlopen is deprecated for
static objects, it is better to remove the support.
It also allows to trim down libc.a of profile support.
Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The dl-symbol-redir-ifunc.h redirects compiler-generated libcalls to
arch-specific memory implementations to avoid ifunc calls where it is not
yet possible. The memcmp-isa-default-impl.h aims to fix the same issue
by calling the specific memset implementation directly.
Using the memcmp symbol directly allows the compiler to inline the memset
calls (especially because _dl_tunable_set_hwcaps uses constants values),
generating better code.
Checked on x86_64-linux-gnu.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The tunable privilege levels were a retrofit to try and keep the malloc
tunable environment variables' behavior unchanged across security
boundaries. However, CVE-2023-4911 shows how tricky can be
tunable parsing in a security-sensitive environment.
Not only parsing, but the malloc tunable essentially changes some
semantics on setuid/setgid processes. Although it is not a direct
security issue, allowing users to change setuid/setgid semantics is not
a good security practice, and requires extra code and analysis to check
if each tunable is safe to use on all security boundaries.
It also means that security opt-in features, like aarch64 MTE, would
need to be explicit enabled by an administrator with a wrapper script
or with a possible future system-wide tunable setting.
Co-authored-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
strrchr-evex-base used `vpcompress{b|d}` in the page cross logic but
was missing the CPU_FEATURE checks for VBMI2 in the
ifunc/ifunc-impl-list.
The fix is either to add those checks or change the logic to not use
`vpcompress{b|d}`. Choosing the latter here so that the strrchr-evex
implementation is usable on SKX.
New implementation is a bit slower, but this is in a cold path so its
probably okay.
This commit refactors `strrchr-evex` and `strrchr-evex512` to use a
common implementation: `strrchr-evex-base.S`.
The motivation is `strrchr-evex` needed to be refactored to not use
64-bit masked registers in preperation for AVX10.
Once vec-width masked register combining was removed, the EVEX and
EVEX512 implementations can easily be implemented in the same file
without any major overhead.
The net result is performance improvements (measured on TGL) for both
`strrchr-evex` and `strrchr-evex512`. Although, note there are some
regressions in the test suite and it may be many of the cases that
make the total-geomean of improvement/regression across bench-strrchr
are cold. The point of the performance measurement is to show there
are no major regressions, but the primary motivation is preperation
for AVX10.
Benchmarks where taken on TGL:
https://www.intel.com/content/www/us/en/products/sku/213799/intel-core-i711850h-processor-24m-cache-up-to-4-80-ghz/specifications.html
EVEX geometric_mean(N=5) of all benchmarks New / Original : 0.74
EVEX512 geometric_mean(N=5) of all benchmarks New / Original: 0.87
Full check passes on x86.
In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.
It may be possible to detect up-to-date dtv faster. But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.
This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once. The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.
This patch uses acquire/release synchronization when accessing the
generation counter.
Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.
I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On Skylake, it changes log1p bench performance by:
Before After Improvement
max 63.349 58.347 8%
min 4.448 5.651 -30%
mean 12.0674 10.336 14%
The minimum code path is
if (hx < 0x3FDA827A) /* x < 0.41422 */
{
if (__glibc_unlikely (ax >= 0x3ff00000)) /* x <= -1.0 */
{
...
}
if (__glibc_unlikely (ax < 0x3e200000)) /* |x| < 2**-29 */
{
math_force_eval (two54 + x); /* raise inexact */
if (ax < 0x3c900000) /* |x| < 2**-54 */
{
...
}
else
return x - x * x * 0.5;
FMA and non-FMA code sequences look similar. Non-FMA version is slightly
faster. Since log1p is called by asinh and atanh, it improves asinh
performance by:
Before After Improvement
max 75.645 63.135 16%
min 10.074 10.071 0%
mean 15.9483 14.9089 6%
and improves atanh performance by:
Before After Improvement
max 91.768 75.081 18%
min 15.548 13.883 10%
mean 18.3713 16.8011 8%
On Skylake, it improves expm1 bench performance by:
Before After Improvement
max 70.204 68.054 3%
min 20.709 16.2 22%
mean 22.1221 16.7367 24%
NB: Add
extern long double __expm1l (long double);
extern long double __expm1f128 (long double);
for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.
With multiarch disabled, the default memmove implementation provides
the fortify routines for memcpy, mempcpy, and memmove. However, it
does not provide the internal hidden definitions used when building
with fortify enabled. The memset has a similar issue.
Checked on x86_64-linux-gnu building with different options:
default and --disable-multi-arch plus default, --disable-default-pie,
--enable-fortify-source={2,3}, and --enable-fortify-source={2,3}
with --disable-default-pie.
Tested-by: Andreas K. Huettel <dilfridge@gentoo.org>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Based on feedback by Mike Gilbert <floppym@gentoo.org>
Linux-6.1.38-dist x86_64 AMD Phenom-tm- II X6 1055T Processor
-march=amdfam10
failures occur for x32 ABI
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Bump autoconf requirement to 2.71 to allow regenerating configure on
more recent distributions. autoconf 2.71 has been in Fedora since F36
and is the current version in Debian stable (bookworm). It appears to
be current in Gentoo as well.
All sysdeps configure and preconfigure scripts have also been
regenerated; all changes are trivial transformations that do not affect
functionality.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The change is meant to avoid unwanted PLT entries for the wmemset and
wcrtomb routines when _FORTIFY_SOURCE is set.
On top of that, ensure that *_chk routines have their hidden builtin
definitions available.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
If libc_hidden_builtin_{def,proto} isn't properly set for *_chk routines,
there are unwanted PLT entries in libc.so.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The first segment in a shared library may be read-only, not executable.
To support LD_PREFER_MAP_32BIT_EXEC on such shared libraries, we also
check MAP_DENYWRITE to decide if MAP_32BIT should be passed to mmap.
Normally the first segment is mapped with MAP_COPY, which is defined
as (MAP_PRIVATE | MAP_DENYWRITE). But if the segment alignment is
greater than the page size, MAP_COPY isn't used to allocate enough
space to ensure that the segment can be properly aligned. Map the
first segment with MAP_COPY in this case to fix BZ #30452.
These files could be useful to any port that wants to use ld.so.cache.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On i386 and x86_64, for libc.a specifically, __mempcpy_chk calls
mempcpy which leads POSIX routines to call non-POSIX mempcpy indirectly.
This leads the linknamespace test to fail when glibc is built with
__FORTIFY_SOURCE=3.
Since calling mempcpy doesn't bring any benefit for libc.a, directly
call __mempcpy instead.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Since the assembly source file with -evex suffix should use YMM registers,
not ZMM registers, include x86-evex256-vecs.h by default to use YMM
registers in memcmpeq-evex.S
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Applying this commit results in a bit-identical rebuild of
mathvec/libmvec.so.1 (which is the only binary that gets rebuilt).
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.
The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.
Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.
Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
This makes the prefer_map_32bit_exec tunable no longer Linux-specific.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20230423215526.346009-4-bugaevc@gmail.com>
This is based on the Linux port's version, but laid out to match Mach's
struct i386_thread_state, much like the i386 version does.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
The divss instruction clobbers its first argument, and the constraints
need to reflect that. Fortunately, with GCC 12, generated code does
not actually change, so there is no externally visible bug.
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Just like the other existing rtld-str* files, this provides rtld with
usable versions of stpncpy and strncpy.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20230319151017.531737-22-bugaevc@gmail.com>
The source code is the same as sysdeps/i386/htl/tcb-offsets.sym, but of
course the produced tcb-offsets.h will be different.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20230319151017.531737-21-bugaevc@gmail.com>
And make always supported. The configure option was added on glibc 2.25
and some features require it (such as hwcap mask, huge pages support, and
lock elisition tuning). It also simplifies the build permutations.
Changes from v1:
* Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs
more discussion.
* Cleanup more code.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This allows other targets to use the same inputs for their own libmvec
microbenchmarks without having to duplicate them in their own
subdirectory.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Update libm test ulps for
commit 3efbf11fdf
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date: Tue Feb 14 11:24:59 2023 +0100
update auto-libm-test-out-hypot
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
Although static linker can optimize it to local call, it follows the
internal scheme to provide hidden proto and definitions.
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
It moves the op_t definition out to an specific header, adds
the attribute 'may-alias', and cleanup its duplicated definitions.
Checked with a build and check with run-built-tests=no for all major
Linux ABIs.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Occurs when `src` has no null-term.
Two cases:
1) Zero-length check is doing:
```
test %rdx, %rdx
jl L(zero_len)
```
which doesn't actually check zero (was at some point `decq` and the
flag never got updated).
The fix is just make the flag `jle` i.e:
```
test %rdx, %rdx
jle L(zero_len)
```
2) Length check in page-cross case checking if we should continue is
doing:
```
cmpq %r8, %rdx
jb L(page_cross_small)
```
which means we will continue searching for null-term if length ends at
the end of a page and there was no null-term in `src`.
The fix is to make the flag:
```
cmpq %r8, %rdx
jbe L(page_cross_small)
```
vfprintf is entangled with vfwprintf (of course), __printf_fp,
__printf_fphex, __vstrfmon_l_internal, and the strfrom family of
functions. The latter use the internal snprintf functionality,
so vsnprintf is converted as well.
The simples conversion is __printf_fphex, followed by
__vstrfmon_l_internal and __printf_fp, and finally
__vfprintf_internal and __vfwprintf_internal. __vsnprintf_internal
and strfrom* are mostly consuming the new interfaces, so they
are comparatively simple.
__printf_fp is a public symbol, so the FILE *-based interface
had to preserved.
The __printf_fp rewrite does not change the actual binary-to-decimal
conversion algorithm, and digits are still not emitted directly to
the target buffer. However, the staging buffer now uses bytes
instead of wide characters, and one buffer copy is eliminated.
The changes are at least performance-neutral in my testing.
Floating point printing and snprintf improved measurably, so that
this Lua script
for i=1,5000000 do
print(i, i * math.pi)
end
runs about 5% faster for me. To preserve fprintf performance for
a simple "%d" format, this commit has some logic changes under
LABEL (unsigned_number) to avoid additional function calls. There
are certainly some very easy performance improvements here: binary,
octal and hexadecimal formatting can easily avoid the temporary work
buffer (the number of digits can be computed ahead-of-time using one
of the __builtin_clz* built-ins). Decimal formatting can use a
specialized version of _itoa_word for base 10.
The existing (inconsistent) width handling between strfmon and printf
is preserved here. __print_fp_buffer_1 would have to use
__translated_number_width to achieve ISO conformance for printf.
Test expectations in libio/tst-vtables-common.c are adjusted because
the internal staging buffer merges all virtual function calls into
one.
In general, stack buffer usage is greatly reduced, particularly for
unbuffered input streams. __printf_fp can still use a large buffer
in binary128 mode for %g, though.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIGSEGV
in `L(ret_nonzero_vec_end_0)` because the sequential logic
assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
To be clear, this change does not mean the usage of `memcmp` is
supported. The program behaviour is undefined (UB) in the
presence of data races, and `memcmp` is incorrect when the values
of `a` and/or `b` are modified concurrently (data race). This UB
may manifest itself as a SIGSEGV. That being said, if we can
allow the idiomatic use cases, like those in yottadb with
opportunistic concurrency control (OCC), to execute without a
SIGSEGV, at no cost to regular use cases, then we can aim to
minimize harm to those existing users.
The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.
The relevant sequential logic can be seen in the following
pseudo-code:
```
/*
* rsi = a
* rdi = b
* rdx = len - 32
*/
/* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
in this case, this check is also assumed to cover a[0:(31 - len)]
and b[0:(31 - len)]. */
movups (%rsi), %xmm0
movups (%rdi), %xmm1
PCMPEQ %xmm0, %xmm1
pmovmskb %xmm1, %eax
subl %ecx, %eax
jnz L(END_NEQ)
/* cmp a[len-16:len-1] and b[len-16:len-1]. */
movups 16(%rsi, %rdx), %xmm0
movups 16(%rdi, %rdx), %xmm1
PCMPEQ %xmm0, %xmm1
pmovmskb %xmm1, %eax
subl %ecx, %eax
jnz L(END_NEQ2)
ret
L(END2):
/* Position first mismatch. */
bsfl %eax, %eax
/* The sequential version is able to assume this value is a
positive 32-bit value because the first check included bytes in
range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
greater than `31 - len` so the minimum value of `edx` + `eax` is
`(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
`a` or `b` could have been changed so a mismatch in `eax` less or
equal than `(31 - len)` is possible (the new low bound is `(16 -
len)`. This can result in a negative 32-bit signed integer, which
when zero extended to 64-bits is a random large value this out
out of bounds. */
addl %edx, %eax
/* Crash here because 32-bit negative number in `eax` zero
extends to out of bounds 64-bit offset. */
movzbl 16(%rdi, %rax), %ecx
movzbl 16(%rsi, %rax), %eax
```
This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevents the 32-bit zero extension
and since `eax` is still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be in bounds.
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes strncpy for x32. Tested on x86-64 and x32. On x86-64,
libc.so is the same with and without the fix.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits. The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.
This pach fixes strncat for x32. Tested on x86-64 and x32. On x86-64,
libc.so is the same with and without the fix.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Implemented:
wcscat-avx2 (+ 744 bytes
wcscpy-avx2 (+ 539 bytes)
wcpcpy-avx2 (+ 577 bytes)
wcsncpy-avx2 (+1108 bytes)
wcpncpy-avx2 (+1214 bytes)
wcsncat-avx2 (+1085 bytes)
Performance Changes:
Times are from N = 10 runs of the benchmark suite and are reported
as geometric mean of all ratios of New Implementation / Best Old
Implementation. Best Old Implementation was determined with the
highest ISA implementation.
wcscat-avx2 -> 0.975
wcscpy-avx2 -> 0.591
wcpcpy-avx2 -> 0.698
wcsncpy-avx2 -> 0.730
wcpncpy-avx2 -> 0.711
wcsncat-avx2 -> 0.954
Code Size Changes:
This change increase the size of libc.so by ~5.5kb bytes. For
reference the patch optimizing the normal strcpy family functions
decreases libc.so by ~5.2kb.
Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
Implemented:
wcscat-evex (+ 905 bytes)
wcscpy-evex (+ 674 bytes)
wcpcpy-evex (+ 709 bytes)
wcsncpy-evex (+1358 bytes)
wcpncpy-evex (+1467 bytes)
wcsncat-evex (+1213 bytes)
Performance Changes:
Times are from N = 10 runs of the benchmark suite and are reported
as geometric mean of all ratios of New Implementation / Best Old
Implementation. Best Old Implementation was determined with the
highest ISA implementation.
wcscat-evex -> 0.991
wcscpy-evex -> 0.587
wcpcpy-evex -> 0.695
wcsncpy-evex -> 0.719
wcpncpy-evex -> 0.694
wcsncat-evex -> 0.979
Code Size Changes:
This change increase the size of libc.so by ~6.3kb bytes. For
reference the patch optimizing the normal strcpy family functions
decreases libc.so by ~5.7kb.
Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
Optimizations are:
1. Use more overlapping stores to avoid branches.
2. Reduce how unrolled the aligning copies are (this is more of a
code-size save, its a negative for some sizes in terms of
perf).
3. For st{r|p}n{cat|cpy} re-order the branches to minimize the
number that are taken.
Performance Changes:
Times are from N = 10 runs of the benchmark suite and are
reported as geometric mean of all ratios of
New Implementation / Old Implementation.
strcat-avx2 -> 0.998
strcpy-avx2 -> 0.937
stpcpy-avx2 -> 0.971
strncpy-avx2 -> 0.793
stpncpy-avx2 -> 0.775
strncat-avx2 -> 0.962
Code Size Changes:
function -> Bytes New / Bytes Old -> Ratio
strcat-avx2 -> 685 / 1639 -> 0.418
strcpy-avx2 -> 560 / 903 -> 0.620
stpcpy-avx2 -> 592 / 939 -> 0.630
strncpy-avx2 -> 1176 / 2390 -> 0.492
stpncpy-avx2 -> 1268 / 2438 -> 0.520
strncat-avx2 -> 1042 / 2563 -> 0.407
Notes:
1. Because of the significant difference between the
implementations they are split into three files.
strcpy-avx2.S -> strcpy, stpcpy, strcat
strncpy-avx2.S -> strncpy
strncat-avx2.S > strncat
I couldn't find a way to merge them without making the
ifdefs incredibly difficult to follow.
Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
Optimizations are:
1. Use more overlapping stores to avoid branches.
2. Reduce how unrolled the aligning copies are (this is more of a
code-size save, its a negative for some sizes in terms of
perf).
3. Improve the loop a bit (similiar to what we do in strlen with
2x vpminu + kortest instead of 3x vpminu + kmov + test).
4. For st{r|p}n{cat|cpy} re-order the branches to minimize the
number that are taken.
Performance Changes:
Times are from N = 10 runs of the benchmark suite and are
reported as geometric mean of all ratios of
New Implementation / Old Implementation.
stpcpy-evex -> 0.922
strcat-evex -> 0.985
strcpy-evex -> 0.880
strncpy-evex -> 0.831
stpncpy-evex -> 0.780
strncat-evex -> 0.958
Code Size Changes:
function -> Bytes New / Bytes Old -> Ratio
strcat-evex -> 819 / 1874 -> 0.437
strcpy-evex -> 700 / 1074 -> 0.652
stpcpy-evex -> 735 / 1094 -> 0.672
strncpy-evex -> 1397 / 2611 -> 0.535
stpncpy-evex -> 1489 / 2691 -> 0.553
strncat-evex -> 1184 / 2832 -> 0.418
Notes:
1. Because of the significant difference between the
implementations they are split into three files.
strcpy-evex.S -> strcpy, stpcpy, strcat
strncpy-evex.S -> strncpy
strncat-evex.S > strncat
I couldn't find a way to merge them without making the
ifdefs incredibly difficult to follow.
2. All implementations can be made evex512 by including
"x86-evex512-vecs.h" at the top.
3. All implementations have an optional define:
`USE_EVEX_MASKED_STORE`
Setting to one uses evex-masked stores for handling short
strings. This saves code size and branches. It's disabled
for all implementations are the moment as there are some
serious drawbacks to masked stores in certain cases, but
that may be fixed on future architectures.
Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
Changes to generated code are:
1. In a few places use `vpcmpeqb` instead of `vpcmpneq` to save a
byte of code size.
2. Add a branch for length <= (VEC_SIZE * 6) as opposed to doing
the entire block of [VEC_SIZE * 4 + 1, VEC_SIZE * 8] in a
single basic-block (the space to add the extra branch without
changing code size is bought with the above change).
Change (2) has roughly a 20-25% speedup for sizes in [VEC_SIZE * 4 +
1, VEC_SIZE * 6] and negligible to no-cost for [VEC_SIZE * 6 + 1,
VEC_SIZE * 8]
From N=10 runs on Tigerlake:
align1,align2 ,length ,result ,New Time ,Cur Time ,New Time / Old Time
0 ,0 ,129 ,0 ,5.404 ,6.887 ,0.785
0 ,0 ,129 ,1 ,5.308 ,6.826 ,0.778
0 ,0 ,129 ,18446744073709551615 ,5.359 ,6.823 ,0.785
0 ,0 ,161 ,0 ,5.284 ,6.827 ,0.774
0 ,0 ,161 ,1 ,5.317 ,6.745 ,0.788
0 ,0 ,161 ,18446744073709551615 ,5.406 ,6.778 ,0.798
0 ,0 ,193 ,0 ,6.804 ,6.802 ,1.000
0 ,0 ,193 ,1 ,6.950 ,6.754 ,1.029
0 ,0 ,193 ,18446744073709551615 ,6.792 ,6.719 ,1.011
0 ,0 ,225 ,0 ,6.625 ,6.699 ,0.989
0 ,0 ,225 ,1 ,6.776 ,6.735 ,1.003
0 ,0 ,225 ,18446744073709551615 ,6.758 ,6.738 ,0.992
0 ,0 ,256 ,0 ,5.402 ,5.462 ,0.989
0 ,0 ,256 ,1 ,5.364 ,5.483 ,0.978
0 ,0 ,256 ,18446744073709551615 ,5.341 ,5.539 ,0.964
Rewriting with VMM API allows for memcmpeq-evex to be used with
evex512 by including "x86-evex512-vecs.h" at the top.
Complete check passes on x86-64.
The only change to the existing generated code is `tzcnt` -> `bsf` to
save a byte of code size here and there.
Rewriting with VMM API allows for memcmp-evex-movbe to be used with
evex512 by including "x86-evex512-vecs.h" at the top.
Complete check passes on x86-64.
Changes from v1:
Use vec api for register.
Replace VPCMP with VPCMPEQ
Restructure and remove 1 unconditional jump.
Change page cross logic to use sall.
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.
- strrchr function using 512 bit vectors.
- wcsrchr function using 512 bit vectors.
Code size data:
strrchr-evex.o 879 byte
strrchr-evex512.o 601 byte (-32%)
wcsrchr-evex.o 882 byte
wcsrchr-evex512.o 572 byte (-35%)
Placeholder function, not used by any processor at the moment.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This makes it more likely that the compiler can compute the strlen
argument in _startup_fatal at compile time, which is required to
avoid a dependency on strlen this early during process startup.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The old exception handling implementation used function interposition
to replace the dynamic loader implementation (no TLS support) with the
libc implementation (TLS support). This results in problems if the
link order between the dynamic loader and libc is reversed (bug 25486).
The new implementation moves the entire implementation of the
exception handling functions back into the dynamic loader, using
THREAD_GETMEM and THREAD_SETMEM for thread-local data support.
These depends on Hurd support for these macros, added in commit
b65a82e4e7 ("hurd: Add THREAD_GET/SETMEM/_NC").
One small obstacle is that the exception handling facilities are used
before the TCB has been set up, so a check is needed if the TCB is
available. If not, a regular global variable is used to store the
exception handling information.
Also rename dl-error.c to dl-catch.c, to avoid confusion with the
dlerror function.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
GCC 13 has added more _FloatN and _FloatNx versions of existing
<math.h> and <complex.h> built-in functions, for use in libstdc++-v3.
This breaks the glibc build because of how those functions are defined
as aliases to functions with the same ABI but different types. Add
appropriate -fno-builtin-* options for compiling relevant files, as
already done for the case of long double functions aliasing double
ones and based on the list of files used there.
I fixed some mistakes in that list of double files that I noticed
while implementing this fix, but there may well be more such
(harmless) cases, in this list or the new one (files that don't
actually exist or don't define the named functions as aliases so don't
need the options). I did try to exclude cases where glibc doesn't
define certain functions for _FloatN or _FloatNx types at all from the
new uses of -fno-builtin-* options. As with the options for double
files (see the commit message for commit
49348beafe, "Fix build with GCC 10 when
long double = double."), it's deliberate that the options are used
even if GCC currently doesn't have a built-in version of a given
functions, so providing some level of future-proofing against more
such built-in functions being added in future.
Tested with build-many-glibcs.py for aarch64-linux-gnu
powerpc-linux-gnu powerpc64le-linux-gnu x86_64-linux-gnu (compilers
and glibcs builds) with GCC mainline.
This patch improves following functionality
- Replace VPCMP with VPCMPEQ.
- Replace page cross check logic with sall.
- Remove extra lea from align_more.
- Remove uncondition loop jump.
- Use bsf to check max length in first vector.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.
- strchrnul function using 512 bit vectors.
- strchr function using 512 bit vectors.
- wcschr function using 512 bit vectors.
Code size data:
strchrnul-evex.o 599 byte
strchrnul-evex512.o 569 byte (-5%)
strchr-evex.o 639 byte
strchr-evex512.o 595 byte (-7%)
wcschr-evex.o 644 byte
wcschr-evex512.o 607 byte (-6%)
Placeholder function, not used by any processor at the moment.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Generic implementation on top of __bswap_32 always expands
inline to either bswap or movbe depending on -march=*.
Signed-off-by: Cristian Rodríguez <crrodriguez@opensuse.org>
Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and
strncasecmp{l} functions can be added by including strcmp-evex.S with
"x86-evex512-vecs.h" defined.
In addition save code size a bit in a few places.
1. tzcnt ... -> bsf ...
2. vpcmp{b|d} $0 ... -> vpcmpeq{b|d}
This saves a touch of code size but has minimal net affect.
Full check passes on x86-64.
commit b412213eee
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Tue Oct 18 17:44:07 2022 -0700
x86: Optimize strrchr-evex.S and implement with VMM headers
Added `vpcompress{b|d}` to the page-cross logic with is an
AVX512-VBMI2 instruction. This is not supported on SKX. Since the
page-cross logic is relatively cold and the benefit is minimal
revert the page-cross case back to the old logic which is supported
on SKX.
Tested on x86-64.
Optimization is:
1. Cache latest result in "fast path" loop with `vmovdqu` instead of
`kunpckdq`. This helps if there are more than one matches.
Code Size Changes:
strrchr-evex.S : +30 bytes (Same number of cache lines)
Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.
strrchr-evex.S : 0.932 (From cases with higher match frequency)
Full results attached in email.
Full check passes on x86-64.
Optimizations are:
1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch
in short string case.
2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case.
3. Use more code-size efficient instructions.
- tzcnt ... -> bsf ...
- vpcmpb $0 ... -> vpcmpeq ...
Code Size Changes:
memrchr-evex.S : -29 bytes
Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.
memrchr-evex.S : 0.949 (Mostly from improvements in small strings)
Full results attached in email.
Full check passes on x86-64.
Optimizations are:
1. Use the fact that bsf(0) leaves the destination unchanged to save a
branch in short string case.
2. Restructure code so that small strings are given the hot path.
- This is a net-zero on the benchmark suite but in general makes
sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
- tzcnt ... -> bsf ...
- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
blocks / causes the basic-block to span extra cache-lines.
The optimizations (especially for point 2) make the strnlen and
strlen code essentially incompatible so split strnlen-evex
to a new file.
Code Size Changes:
strlen-evex.S : -23 bytes
strnlen-evex.S : -167 bytes
Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.
strlen-evex.S : 0.992 (No real change)
strnlen-evex.S : 0.947
Full results attached in email.
Full check passes on x86-64.
Size Optimizations:
1. Condence hot path for better cache-locality.
- This is most impact for strchrnul where the logic strings with
len <= VEC_SIZE or with a match in the first VEC no fits entirely
in the first cache line.
2. Reuse common targets in first 4x VEC and after the loop.
3. Don't align targets so aggressively if it doesn't change the number
of fetch blocks it will require and put more care in avoiding the
case where targets unnecessarily split cache lines.
4. Align the loop better for DSB/LSD
5. Use more code-size efficient instructions.
- tzcnt ... -> bsf ...
- vpcmpb $0 ... -> vpcmpeq ...
6. Align labels less aggressively, especially if it doesn't save fetch
blocks / causes the basic-block to span extra cache-lines.
Code Size Changes:
strchr-evex.S : -63 bytes
strchrnul-evex.S: -48 bytes
Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.
strchr-evex.S (Fixed) : 0.971
strchr-evex.S (Rand) : 0.932
strchrnul-evex.S : 0.965
Full results attached in email.
Full check passes on x86-64.
Optimizations are:
1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch
in short string case.
2. Restructure code so that small strings are given the hot path.
- This is a net-zero on the benchmark suite but in general makes
sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
- tzcnt ... -> bsf ...
- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
blocks / causes the basic-block to span extra cache-lines.
The optimizations (especially for point 2) make the memchr and
rawmemchr code essentially incompatible so split rawmemchr-evex
to a new file.
Code Size Changes:
memchr-evex.S : -107 bytes
rawmemchr-evex.S : -53 bytes
Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.
memchr-evex.S : 0.928
rawmemchr-evex.S : 0.986 (Less targets cross cache lines)
Full results attached in email.
Full check passes on x86-64.
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.
- memchr function using 512 bit vectors.
- rawmemchr function using 512 bit vectors.
- wmemchr function using 512 bit vectors.
Code size data:
memchr-evex.o 762 byte
memchr-evex512.o 576 byte (-24%)
rawmemchr-evex.o 461 byte
rawmemchr-evex512.o 412 byte (-11%)
wmemchr-evex.o 794 byte
wmemchr-evex512.o 552 byte (-30%)
Placeholder function, not used by any processor at the moment.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
In the future, this will result in a compilation failure if the
macros are unexpectedly undefined (due to header inclusion ordering
or header inclusion missing altogether).
Assembler sources are more difficult to convert. In many cases,
they are hand-optimized for the mangling and no-mangling variants,
which is why they are not converted.
sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c
are special: These are C sources, but most of the implementation is
in assembler, so the PTR_DEMANGLE macro has to be undefined in some
cases, to match the assembler style.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This allows us to define a generic no-op version of PTR_MANGLE and
PTR_DEMANGLE. In the future, we can use PTR_MANGLE and PTR_DEMANGLE
unconditionally in C sources, avoiding an unintended loss of hardening
due to missing include files or unlucky header inclusion ordering.
In i386 and x86_64, we can avoid a <tls.h> dependency in the C
code by using the computed constant from <tcb-offsets.h>. <sysdep.h>
no longer includes these definitions, so there is no cyclic dependency
anymore when computing the <tcb-offsets.h> constants.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This way, we can define the pointer guard macros without including
<sysdep.h> on x86-64. Other architectures will not have such an
inclusion dependency, and the implied header file inclusion would
create a porting hazard.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To avoid duplicate the VMM / GPR / mask insn macros in all incoming
evex512 files use the macros defined in 'reg-macros.h' and
'{vec}-macros.h'
This commit does not change libc.so
Tested build on x86-64