The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c966099cdc)
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit df7e295d18)
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5307aa9c18)
Code didn't actually use any sse4 instructions since `ptest` was
removed in:
commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Nov 10 16:18:56 2021 -0600
x86: Shrink memcmp-sse4.S code size
The new memcmp-sse2 implementation is also faster.
geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905
Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.
Size = 1:
size, align0, align1, ret, New Time/Old Time
1, 1, 1, 0, 1.2
1, 1, 1, 1, 1.197
1, 1, 1, -1, 1.2
This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).
Python3 Size = 1 -> 13.64%
Python3 Size = [4, 8] -> 60.92%
GCC11 Size = 1 -> 1.29%
GCC11 Size = [4, 8] -> 33.86%
size, align0, align1, ret, New Time/Old Time
4, 4, 4, 0, 0.622
4, 4, 4, 1, 0.797
4, 4, 4, -1, 0.805
5, 5, 5, 0, 0.623
5, 5, 5, 1, 0.777
5, 5, 5, -1, 0.802
6, 6, 6, 0, 0.625
6, 6, 6, 1, 0.813
6, 6, 6, -1, 0.788
7, 7, 7, 0, 0.625
7, 7, 7, 1, 0.799
7, 7, 7, -1, 0.795
8, 8, 8, 0, 0.625
8, 8, 8, 1, 0.848
8, 8, 8, -1, 0.914
9, 9, 9, 0, 0.625
Size = 65:
size, align0, align1, ret, New Time/Old Time
65, 0, 0, 0, 1.103
65, 0, 0, 1, 1.216
65, 0, 0, -1, 1.227
65, 65, 0, 0, 1.091
65, 0, 65, 1, 1.19
65, 65, 65, -1, 1.215
This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.
size, align0, align1, ret, New Time/Old Time
128, 4, 8, 0, 0.858
128, 4, 8, 1, 0.879
128, 4, 8, -1, 0.888
As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7cbc03d030)
Just a few QOL changes.
1. Prefer `add` > `lea` as it has high execution units it can run
on.
2. Don't break macro-fusion between `test` and `jcc`
3. Reduce code size by removing gratuitous padding bytes (-90
bytes).
geometric_mean(N=20) of all benchmarks New / Original: 0.959
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 244b415d38)
The rational is:
1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
regression on Tigerlake using SSE42 versus AVX across the
benchtest suite).
2. AVX2 version covers the majority of targets that previously
prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
becoming outdated.
All in all the saving the code size is worth it.
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 305769b2a1)
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 84e7c46df4)
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bbf8122234)
Slightly faster method of doing TOLOWER that saves an
instruction.
Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.
geometric_mean(N=40) of all benchmarks New / Original: .920
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit d154758e61)
Slightly faster method of doing TOLOWER that saves an
instruction.
Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.
geometric_mean(N=40) of all benchmarks New / Original: .894
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 670b54bc58)
The generic implementation is faster.
geometric_mean(N=20) of all benchmarks New / Original: .710
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9c8a6ad620)
The generic implementation is faster (see strcspn commit).
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6533585352)
The generic implementation is faster.
geometric_mean(N=20) of all benchmarks New / Original: .678
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fe28e7d9d9)
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.
geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 412d103431)
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.
geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 30d627d477)
Small code cleanup for size: -81 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks New / Original: .985
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit ec285ea904)
Small code cleanup for size: -53 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks Original / New: 1.00
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6fbf4d51e)
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Mon Feb 7 05:55:15 2022 -0800
x86-64: Optimize bzero
added the optimized bzero. Remove bzero weak alias in SS2 memset to
avoid undefined __bzero in memset-sse2-unaligned-erms.
(cherry picked from commit 0fb8800029)
Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.
commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:21 2022 -0600
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.
This change will need to be backported.
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9fef7039a7)
Logic can read before the start of `s1` / `s2` if both `s1` and `s2`
are near the start of a page. To avoid having the result contimated by
these comparisons the `strcmp` variants would mask off these
comparisons. This was missing in the `strncmp` variants causing
the bug. This commit adds the masking to `strncmp` so that out of
range comparisons don't affect the result.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as
well a full xcheck on x86_64 linux.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e108c02a5e)
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Mon Feb 7 05:55:15 2022 -0800
x86-64: Optimize bzero
Remove setting the .text section for the code. This commit
adds that back.
(cherry picked from commit 7912236f4a)
memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC).
bzero can be slightly more optimized for this case by using a zero-idiom
xor for broadcasting the set value to a register (vector or GPR).
Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 3d9f171bfb)
commit b62ace2740
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Feb 6 00:54:18 2022 -0600
x86: Improve vec generation in memset-vec-unaligned-erms.S
Revert usage of 'pshufb' in broadcast logic as it is an SSSE3
instruction and memset.S is restricted to only SSE2 instructions.
(cherry picked from commit 1b0c60f95b)
No bug.
Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.
For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.
Results for memset-avx2 small (geomean of N = 20 benchset runs).
size, New Time, Old Time, New / Old
0, 4.100, 3.831, 0.934
1, 5.074, 4.399, 0.867
2, 4.433, 4.411, 0.995
4, 4.487, 4.415, 0.984
8, 4.454, 4.396, 0.987
16, 4.502, 4.443, 0.987
All relevant string/wcsmbs tests are passing.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b62ace2740)
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases as well are nearly universally improved.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8418eb3ff4)
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b77b06e0e2)
No bug.
Optimizations are twofold.
1) Replace page cross and 0/1 checks with masked load instructions in
L(less_vec). In applications this reduces branch-misses in the
hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.
Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range. From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de0)
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 4df1fa6ddc)
No bug.
This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.
memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes
The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.
See https://research.google/pubs/pub48320/ for more details.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d717)
No bug.
The optimizations are as follows:
1) Always align entry to 64 bytes. This makes behavior more
predictable and makes other frontend optimizations easier.
2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
significant benefits in the case that:
0 < (dst - src) < [256, 512]
3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
improvement and for FSRM [-10%, 25%].
In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6b7502ec0)
Clang cannot assemble movzx in the AT&T dialect mode.
../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
movzx (%rsi), %ecx
^~~~
Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6720d36b66)
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
(cherry picked from commit 14dbbf46a0)
In strcmp-evex.S, to compare 2 32-byte strings, replace
VMOVU (%rdi, %rdx), %YMM0
VMOVU (%rsi, %rdx), %YMM1
/* Each bit in K0 represents a mismatch in YMM0 and YMM1. */
VPCMP $4, %YMM0, %YMM1, %k0
VPCMP $0, %YMMZERO, %YMM0, %k1
VPCMP $0, %YMMZERO, %YMM1, %k2
/* Each bit in K1 represents a NULL in YMM0 or YMM1. */
kord %k1, %k2, %k1
/* Each bit in K1 represents a NULL or a mismatch. */
kord %k0, %k1, %k1
kmovd %k1, %ecx
testl %ecx, %ecx
jne L(last_vector)
with
VMOVU (%rdi, %rdx), %YMM0
VPTESTM %YMM0, %YMM0, %k2
/* Each bit cleared in K1 represents a mismatch or a null CHAR
in YMM0 and 32 bytes at (%rsi, %rdx). */
VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2}
kmovd %k1, %ecx
incl %ecx
jne L(last_vector)
It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.
Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit c46e9afb2d)
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.
it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.
It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bad852b61b)
No bug.
Optimization are
1. change control flow for L(more_2x_vec) to fall through to loop and
jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
size and saves jumps for length > 4x VEC_SIZE.
2. For EVEX/AVX512 move L(less_vec) closer to entry.
3. Avoid complex address mode for length > 2x VEC_SIZE
4. Slightly better aligning code for the loop from the perspective of
code size and uops.
5. Align targets so they make full use of their fetch block and if
possible cache line.
6. Try and reduce total number of icache lines that will need to be
pulled in for a given length.
7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
jumping to the stosb target in the sse2 code section will almost
certainly be to a new page. The new version does increase code size
marginally by duplicating the target but should get better iTLB
behavior as a result.
test-memset, test-wmemset, and test-bzero are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e59ced2384)
No bug.
The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.
Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.
In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.
The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.
test-memcmp and test-wmemcmp are both passing.
(cherry picked from commit 1bd8b8d58f)
Optimize loads of all bits set into ZMM register in AVX512 SVML codes
by replacing
vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX
and
vmovups .L_2il0floatpacket.13(%rip), %zmmX
with
vpternlogd $0xff, %zmmX, %zmmX, %zmmX
This fixes BZ #28252.
(cherry picked from commit 78c9ec9000)
It consolidates the code required to call la_pltexit audit
callback.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8c0664e2b8)
Resolved conflicts:
sysdeps/hppa/dl-runtime.c
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:
1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
architectures and it changes the semantics of various machine
relocation functions.
2. The elf_get_dynamic_info() change was done sideways, previously
to 490e6c62aa get-dynamic-info.h was included by the first
dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
It means that the code within elf_get_dynamic_info() that uses
RTLD_BOOTSTRAP is in fact unused.
To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.
And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.
To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).
Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header. Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.
The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).
Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit d6d89608ac)
Resolved conflicts:
elf/rtld.c
Before to 490e6c62aa ('elf: Avoid nested functions in the loader
[BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on
the first dynamic-link.h include and later within _dl_start(). The
former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used
on setup_vdso() (since it is a global definition), while the former does
define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation.
With the commit change, the function is now included and defined once
instead of defined as a nested function. So rtld.c defines without
defining RTLD_BOOTSTRAP and it brokes at least powerpc32.
This patch fixes by moving the get-dynamic-info.h include out of
dynamic-link.h, which then the caller can corirectly set the expected
semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or
RESOLVE_MAP.
It also required to enable some asserts only for the loader bootstrap
to avoid issues when called from setup_vdso().
As a side note, this is another issues with nested functions: it is
not clear from pre-processed output (-E -dD) how the function will
be build and its semantic (since nested function will be local and
extra C defines may change it).
I checked on x86_64-linux-gnu (w/o --enable-static-pie),
i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4,
aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and
s390x-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit 4af6982e4c)
Resolved conflicts:
elf/rtld.c
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).
The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)
Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
elf_machine_lazy_rel can drop the `scope` parameter.
Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 490e6c62aa)
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c627209832)
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_evex. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 7e08db3359)
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit ddf0992cf5)
Use testl, instead of andl, to check __x86_string_control to avoid
updating __x86_string_control.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3c8b9879ca)