Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.
Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 64 bytes
There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 59 bytes
There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The new code:
1. prioritizes smaller lengths more.
2. optimizes target placement more carefully.
3. reuses logic more.
4. fixes up various inefficiencies in the logic.
The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874
Regressions:
1. The page cross case is now colder, especially re-entry from the
page cross case if a match is not found in the first VEC
(roughly 50%). My general opinion with this patch is this is
acceptable given the "coldness" of this case (less than 4%) and
generally performance improvement in the other far more common
cases.
2. There are some regressions 5-15% for medium/large user-arg
lengths that have a match in the first VEC. This is because the
logic was rewritten to optimize finds in the first VEC if the
user-arg length is shorter (where we see roughly 20-50%
performance improvements). It is not always the case this is a
regression. My intuition is some frontend quirk is partially
explaining the data although I haven't been able to find the
root cause.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add a second iteration for memrchr to set `pos` starting from the end
of the buffer.
Previously `pos` was only set relative to the beginning of the
buffer. This isn't really useful for memrchr because the beginning
of the search space is (buf + len).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.
This can be useful when hoisting a vzeroupper to save code size
for example:
```
L(foo):
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
VZEROUPPER_RETURN
L(bar):
xorl %eax, %eax
VZEROUPPER_RETURN
```
Can become:
```
L(foo):
COND_VZEROUPPER
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
ret
L(bar):
xorl %eax, %eax
ret
```
This code does not change any existing functionality.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.
Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).
Tested on powerpc, powerpc64 and powerpc64le.
Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:
(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.
(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.
(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.
Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)
Geometric mean of all benchmarks: new / old = 0.66
Difficult skiptable(0) : new / old = 0.02
Difficult skiptable(1) : new / old = 0.01
Difficult 2-way : new / old = 0.25
Difficult testing first 2 : new / old = 1.26
Difficult skiptable(0) : new / old = 0.05
Difficult skiptable(1) : new / old = 0.06
Difficult 2-way : new / old = 0.26
Difficult testing first 2 : new / old = 1.05
Difficult skiptable(0) : new / old = 0.42
Difficult skiptable(1) : new / old = 0.24
Difficult 2-way : new / old = 0.21
Difficult testing first 2 : new / old = 1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The next revision of the ISO C standard has added the timegm function
(that was already supported in glibc). Update the feature test
conditionals on its declaration in <time.h> accordingly.
Tested for x86_64.
Newer versions of GNU grep (after grep 3.7, not inclusive) will warn on
'egrep' and 'fgrep' invocations.
Convert usages within the tree to their expanded non-aliased counterparts
to avoid irritating warnings during ./configure and the test suite.
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Added in Linux 5.15 (884a7e5964e06ed93c7771c0d7cf19c09a8946f1), the new
syscalls allows a caller to free the memory of a dying target process.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
It was added on Linux 5.10 (ecb8ac8b1f146915aa6b96449b66dd48984caacc)
with the same functionality as madvise but using a pidfd of the target
process.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This definition is only used as a fallback with old kernel headers.
The change follows kernel commit bfdf4e6208051ed7165b2e92035b4bf11
("rseq: Remove broken uapi field layout on 32-bit little endian").
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The issue is only when used within libc.so (iconvconfig already builds
with _TIME_SIZE=64).
This is a missing spot initially from 52a5fe70a2.
Checked on i686-linux-gnu.
LoongArch is going to be the first architecture supported by Linux that
has neither fstat* nor newfstatat [1], instead exclusively relying on
statx. So in fstatat64's implementation, we need to also enable statx
usage if neither fstatat64 nor newfstatat is present, to prepare for
this new case of kernel ABI.
[1]: https://lore.kernel.org/all/20220518092619.1269111-1-chenhuacai@loongson.cn/
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux 5.18 adds a constant MADV_DONTNEED_LOCKED (defined in multiple
header files, but with the same value on all architectures). Add this
constant to bits/mman-linux.h.
Tested for x86_64.
Linux 5.18 defines a new AArch64 HWCAP value HWCAP2_MTE3; add it to
glibc's sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h.
Tested with build-many-glibcs.py for aarch64-linux-gnu.
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on sparc64-linux-gnu and sparcv9-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked with qemu-user that arguments are correctly passed on both
constructors and main program.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on s390x-linux-gnu and s390-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>