Commit Graph

14945 Commits

Author SHA1 Message Date
Noah Goldstein
c72a1a062a x86: Fix missing __wmemcmp def for disable-multiarch build
commit 8804157ad9
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Fri Apr 15 12:27:59 2022 -0500

    x86: Optimize memcmp SSE2 in memcmp.S

Only defined wmemcmp and missed __wmemcmp. This commit fixes that by
defining __wmemcmp and setting wmemcmp as a weak alias to __wmemcmp.

Both multiarch and disable-multiarch builds succeed and full xchecks
pass.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-19 20:18:57 -05:00
Fangrui Song
3e9acce8c5 elf: Remove __libc_init_secure
After 73fc4e28b9,
__libc_enable_secure_decided is always 0 and a statically linked
executable may overwrite __libc_enable_secure without considering
AT_SECURE.

The __libc_enable_secure has been correctly initialized in _dl_aux_init,
so just remove __libc_enable_secure_decided and __libc_init_secure.
This allows us to remove some startup_get*id functions from
22b79ed7f4.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2022-04-19 15:52:27 -07:00
=Joshua Kinard
78fb888273 mips: Fix mips64n32 64 bit time_t stat support (BZ#29069)
Add missing support initially added by 4e8521333b
(which missed n32 stat).
2022-04-18 10:02:25 -03:00
Noah Goldstein
23102686ec x86: Cleanup page cross code in memcmp-avx2-movbe.S
Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:52 -05:00
Noah Goldstein
7cbc03d030 x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:42 -05:00
Noah Goldstein
8804157ad9 x86: Optimize memcmp SSE2 in memcmp.S
New code save size (-303 bytes) and has significantly better
performance.

geometric_mean(N=20) of page cross cases New / Original: 0.634
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:35 -05:00
Adhemerval Zanella
592b6d00aa stdio: Split __get_errname definition from errlist.c
The loader does not need to pull all __get_errlist definitions
and its size is decreased:

Before:
$ size elf/ld.so
   text    data     bss     dec     hex filename
 197774   11024     456  209254   33166 elf/ld.so

After:
$ size elf/ld.so
   text    data     bss     dec     hex filename
 191510    9936     456  201902   314ae elf/ld.so

Checked on x86_64-linux-gnu.
2022-04-15 09:37:57 -03:00
Noah Goldstein
26b2478322 x86: Reduce code size of mem{move|pcpy|cpy}-ssse3
The goal is to remove most SSSE3 function as SSE4, AVX2, and EVEX are
generally preferable. memcpy/memmove is one exception where avoiding
unaligned loads with `palignr` is important for some targets.

This commit replaces memmove-ssse3 with a better optimized are lower
code footprint verion. As well it aliases memcpy to memmove.

Aside from this function all other SSSE3 functions should be safe to
remove.

The performance is not changed drastically although shows overall
improvements without any major regressions or gains.

bench-memcpy geometric_mean(N=50) New / Original: 0.957

bench-memcpy-random geometric_mean(N=50) New / Original: 0.912

bench-memcpy-large geometric_mean(N=50) New / Original: 0.892

Benchmarks where run on Zhaoxin KX-6840@2000MHz See attached numbers
for all results.

More important this saves 7246 bytes of code size in memmove an
additional 10741 bytes by reusing memmove code for memcpy (total 17987
bytes saves). As well an additional 896 bytes of rodata for the jump
table entries.
2022-04-14 23:21:42 -05:00
Noah Goldstein
d85916e30a x86: Remove mem{move|cpy}-ssse3-back
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
41bfe224e5 x86: Remove str{p}{n}cpy-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
e084ccd37e x86: Remove str{n}cat-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c024a0b9cf x86: Remove str{n}{case}cmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c725f408db x86: Remove {w}memcmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Adhemerval Zanella
404656009b nptl: Handle spurious EINTR when thread cancellation is disabled (BZ#29029)
Some Linux interfaces never restart after being interrupted by a signal
handler, regardless of the use of SA_RESTART [1].  It means that for
pthread cancellation, if the target thread disables cancellation with
pthread_setcancelstate and calls such interfaces (like poll or select),
it should not see spurious EINTR failures due the internal SIGCANCEL.

However recent changes made pthread_cancel to always sent the internal
signal, regardless of the target thread cancellation status or type.
To fix it, the previous semantic is restored, where the cancel signal
is only sent if the target thread has cancelation enabled in
asynchronous mode.

The cancel state and cancel type is moved back to cancelhandling
and atomic operation are used to synchronize between threads.  The
patch essentially revert the following commits:

  8c1c0aae20 nptl: Move cancel type out of cancelhandling
  2b51742531 nptl: Move cancel state out of cancelhandling
  26cfbb7162 nptl: Remove CANCELING_BITMASK

However I changed the atomic operation to follow the internal C11
semantic and removed the MACRO usage, it simplifies a bit the
resulting code (and removes another usage of the old atomic macros).

Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
and powerpc64-linux-gnu.

[1] https://man7.org/linux/man-pages/man7/signal.7.html

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
2022-04-14 12:48:31 -03:00
Stefan Liebler
2376944b9e S390: Add new s390 platform z16.
The new IBM z16 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented.

_dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2
is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active
as those instructions may be replaced or removed in future.

tst-glibc-hwcaps.c is extended in order to test z16 via new marker5.

A fatal glibc error is dumped if glibc was build with architecture
level set for z16, but run on an older machine. (See dl-hwcap-check.h)
2022-04-14 10:37:45 +02:00
Noah Goldstein
535e935a28 Replace {u}int_fast{16|32} with {u}int32_t
On 32-bit machines this has no affect. On 64-bit machines
{u}int_fast{16|32} are set as {u}int64_t which is often not
ideal. Particularly x86_64 this change both saves code size and
may save instruction cost.

Full xcheck passes on x86_64.
2022-04-13 21:23:04 -05:00
Samuel Thibault
45a8e05785 hurd: Define ELIBEXEC
So we can implement it in the exec server.
2022-04-12 22:16:40 +02:00
Szabolcs Nagy
707efc2955 Remove _dl_skip_args_internal declaration
It does not seem to be used.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2022-04-12 14:42:26 +01:00
Tom Coldrick
d2265570a7 manual: Avoid name collision in libm ULP table [BZ #28956]
The 32-bit and 64-bit variants of RISC-V share the same name - "RISC-V"
- when generating the libm error table for the info pages. This
collision, and the way how the table is generated, mean that the values
in the final table for "RISC-V" may be either for the 32- or 64-bit
variant, with no indication as to which.

As an additional side-effect, this makes the build non-reproducible, as
the error table generated is dependent upon the host filesystem
implementation.

To solve this issue, the libm-test-ulps-name files for both variants
have been modified to include their word size, so as to remove the
collision and provide more accurate information in the table.

An alternative proposed was to merge the two variants' ULP values into a
single file, but this would mean that information about error values is
lost, as the two variants are not identical. Some differences are
considerable, notably the values for the exp() function are large.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
2022-04-11 11:46:10 -04:00
Alan Modra
1a85970f41 powerpc: Relocate stinfo->main
start_addresses in sysdeps/powerpc/powerpc64/start.S is historical
baggage that should disappear.  Until someone does that, relocating
stinfo->main by hand is one solution to the fact that the field may be
unrelocated at the time it is accessed.  This is similar to what is
done for dynamic tags via the D_PTR macro.  stinfo->init and
stinfo->fini are zero in both powerpc64/start.S and powerpc32/start.S,
so make it a little more obvious they are unused by passing NULLs to
LIBC_START_MAIN.  The makefile change is needed to pick up
elf/dl-static-tls.h from dl-machine.h.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2022-04-10 08:37:38 +09:30
Alan Modra
31a9bc8c55 powerpc64: Set up thread register for _dl_relocate_static_pie
libgcc ifunc resolvers that access hwcap via a field in the tcb can't
be called until the thread pointer is set up.  Other ifunc resolvers
might need access to at_platform.  This patch sets up a fake thread
pointer early to a copy of tcbhead_t.  hwcapinfo.c already had local
variables for hwcap and at_platform, replace them with an entire
tcbhead_t.  It's not that large and this way we easily ensure hwcap
and at_platform are at the same relative offsets as they are in the
real thread block.

The patch also conditionally disables part of tst-tlsifunc-static,
"bar address read from IFUNC resolver is incorrect".  We can't get a
proper address for a thread variable before glibc initialises tls.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2022-04-10 08:33:40 +09:30
Alan Modra
d6efcc118e powerpc64: Use medium model toc accesses throughout
The PowerPC64 linker edits medium model toc-indirect code to toc-pointer
relative:
	addis r9,r2,tc_entry_for_var@toc@ha
	ld r9,tc_entry_for_var@toc@l(r9)
becomes
	addis r9,r2,(var-.TOC.)@ha
	addi r9,r9,(var-.TOC.)@l
when "var" is known to be local to the binary.  This isn't done for
small-model toc-indirect code, because "var" is almost guaranteed to
be too far away from .TOC. for a 16-bit signed offset.  And, because
the analysis of which .toc entry can be removed becomes much more
complicated in objects that mix code models, they aren't removed if
any small-model toc sequence appears in an object file.

Unfortunately, glibc's build of ld.so smashes the needed objects
together in a ld -r linking stage.  This means the GOT/TOC is left
with a whole lot of relative relocations which is untidy, but in
itself is not a serious problem.  However, static-pie on powerpc64
bombs due to a segfault caused by one of the small-model accesses
before _dl_relocate_static_pie.  (The very first one in rcrt1.o
passing start_addresses in r8 to __libc_start_main.)

So this patch makes all the toc/got accesses in assembly medium code
model, and a couple of functions hidden.  By itself this is not
enough to give us working static-pie, but it is useful in isolation to
enable better linker optimisation.

There's a serious problem in libgcc too.  libgcc ifuncs access the
AT_HWCAP words stored in the tcb with an offset from the thread
pointer (r13), but r13 isn't set at the time _dl_relocate_static_pie.
A followup patch will fix that.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2022-04-10 08:33:06 +09:30
Alan Modra
30afd8c44d linux: Constify rfv variable in dl_vdso_vsym
Compilers may decide to put the rfv variable in .data rather than on
the stack.  It's slightly better to put it in .data.rel.ro.local
instead.  Regardles of that, making it const may enable further
optimisations.  Found when examining relative relocations (GOT ones
in particular) as part of enabling static-pie for PowerPC64.
2022-04-10 08:33:02 +09:30
Adhemerval Zanella
4f2146c4f4 sparc64: Remove fcopysign{f} implementation
The builtin from generic code generates similar compliant sequence.

Checked on sparc64-linux-gnu.
2022-04-07 15:11:56 -03:00
Adhemerval Zanella
0753be0c8a alpha: Remove fcopysign{f} implementation
The generic code already uses builtins.
2022-04-07 14:56:26 -03:00
Adhemerval Zanella
0a4ae090e0 math: Use builtin for ldbl-96 copysign
All architectures that uses it (x86, ia64, m68k) implement the
builtin.

Checked on x86_64-linux-gnu and ia64-linux-gnu.
2022-04-07 14:54:14 -03:00
Adhemerval Zanella
a085346267 ia64: Remove fcopysign{f} implementation
The builtin used by generic code generates similar code.

Checked on ia64-linux-gnu.
2022-04-07 12:27:00 -03:00
Adhemerval Zanella
13d45cf9a7 x86: Remove fcopysign{f} implementation
The builtin used by generic code generates similar code.

Checked on x86_64-linux-gnu and i686-linux-gnu.
2022-04-07 12:17:15 -03:00
Adhemerval Zanella
2a45807e73 powerpc: Remove fcopysign{f} implementation
The builtin and generic implementation from generic files are suffice.

Checked on powerpc64-linux-gnu and powerpc-linux-gnu.
2022-04-07 12:00:16 -03:00
Adhemerval Zanella
cbc2c56bab benchtests: Only build libmvec benchmarks iff $(build-mathvec) is set
Checked on x86_64-linux-gnu.
2022-04-05 12:01:10 -03:00
Adhemerval Zanella
053fe27343 linux: Fix __closefrom_fallback iterates until max int (BZ#28993)
The __closefrom_fallback tries to get a available file descriptor
if the initial open ("/proc/self/fd/", ...) fails.  It assumes the
failure would be only if procfs is not mount (ENOENT), however if
the the proc file is not accessible (due some other kernel filtering
such apparmor) it will iterate over a potentially large file set
issuing close calls.

It should only try the close fallback if open returns EMFILE,
ENFILE, or ENOMEM.

Checked on x86_64-linux-gnu.
2022-04-05 08:08:19 -03:00
Fangrui Song
3ee318c923 Remove -z combreloc and HAVE_Z_COMBRELOC
-z combreloc has been the default regadless of the architecture since
binutils commit f4d733664aabd7bd78c82895e030ec9779a92809 (2002). The
configure check added in commit fdde83499a (2001) has long been
unneeded.

We can therefore treat HAVE_Z_COMBRELOC as always 1 and delete dead code
paths in dl-machine.h files (many were copied from commit a711b01d34
and ee0cb67ec2).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-04-04 17:19:07 -07:00
Adhemerval Zanella
1c225a2dd1 sparc: Remove s_abs implementations
For sparc64 is the same as the generic implementation, while for
sparc32 the builtin generates the same code.

Checked on sparc64-linux-gnu and sparcv9-linux-gnu.
2022-04-04 16:53:24 -03:00
Adhemerval Zanella
caee5be74b ia64: Remove fabs implementations
The generic implementation fixes 5 fabs tests on ia64-linux-gnu:

  math/test-double-fabs
  math/test-float-fabs
  math/test-float32-fabs
  math/test-float32x-fabs
  math/test-float64-fabs

Checked on ia64-linux-gnu.
2022-04-04 16:30:38 -03:00
Adhemerval Zanella
7eed708edf x86: Remove fabs{f} implementation
For x86_64 is the same as the generic implementation, while for i686
the builtin generates the same code.
2022-04-04 16:23:11 -03:00
Adhemerval Zanella
dc2cfd6a87 alpha: Remove s_abs implementations
The generic implementation already uses builtins.
2022-04-04 16:22:15 -03:00
Noah Goldstein
244b415d38 x86: Small improvements for wcslen
Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-28 15:00:03 -05:00
Noah Goldstein
f5bff979d0 x86: Small improvements for wcscpy-ssse3
Just a few small QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`

geometric_mean(N=20) of all benchmarks New / Original: 0.973

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-28 15:00:03 -05:00
Joseph Myers
866c599182 Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h
Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17.
Tested with build-many-glibcs.py for aarch64-linux-gnu.
2022-03-28 13:16:48 +00:00
Noah Goldstein
305769b2a1 x86: Remove AVX str{n}casecmp
The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:51 -05:00
Noah Goldstein
84e7c46df4 x86: Add EVEX optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:50 -05:00
Noah Goldstein
bbf8122234 x86: Add AVX2 optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:43 -05:00
Noah Goldstein
d154758e61 x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .920

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
670b54bc58 x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .894

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
9fef7039a7 x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]
Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.

commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jan 9 16:02:21 2022 -0600

    x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.

This change will need to be backported.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
9c8a6ad620 x86: Remove strspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .710

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
6533585352 x86: Remove strpbrk-sse2.S and use the generic implementation
The generic implementation is faster (see strcspn commit).

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
fe28e7d9d9 x86: Remove strcspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .678

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
412d103431 x86: Optimize strspn in strspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00
Noah Goldstein
30d627d477 x86: Optimize strcspn and strpbrk in strcspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 11:46:13 -05:00