Commit Graph

1673 Commits

Author SHA1 Message Date
Fangrui Song
de38b2a343 elf: Remove ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA
If an executable has copy relocations for extern protected data, that
can only work if the library containing the definition is built with
assumptions (a) the compiler emits GOT-generating relocations (b) the
linker produces R_*_GLOB_DAT instead of R_*_RELATIVE.  Otherwise the
library uses its own definition directly and the executable accesses a
stale copy.  Note: the GOT relocations defeat the purpose of protected
visibility as an optimization, but allow rtld to make the executable and
library use the same copy when copy relocations are present, but it
turns out this never worked perfectly.

ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both
a.so and b.so define protected var and the executable copy relocates
var: b.so accesses its own copy even with GLOB_DAT.  The behavior change
is from commit 62da1e3b00 (x86) and then
copied to nios2 (ae5eae7cfc) and arc
(0e7d930c4c).

Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy
relocated data like a.so.

There is now a warning for copy relocation on protected symbol since
commit 7374c02b68.  It's extremely
unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA
behavior, so let's remove it: this removes a check in the symbol lookup
code.
2022-06-15 11:29:55 -07:00
Noah Goldstein
ff439c4717 x86: Add sse42 implementation to strcmp's ifunc
This has been missing since the the ifuncs where added.

The performance of SSE4.2 is preferable to to SSE2.

Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
2022-06-14 20:58:09 -07:00
Wilco Dijkstra
fdaf78656f Add bounds check to __libc_ifunc_impl_list
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.

Passes buildmanyglibc.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-06-10 17:13:29 +01:00
Noah Goldstein
cffb9414c5 x86: Optimize svml_s_tanhf4_core_sse4.S
Optimizations are:
    1. Reduce code size (-112 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Prefer registers which get short instruction encoding.
    5. Reduce rodata size (-4k+ rodata is shared with avx2).

Result is roughly a 15-16% speedup:

       Function, New Time, Old Time, New / Old
 _ZGVbN4v_tanhf,    3.158,    3.749,     0.842
2022-06-09 12:51:25 -07:00
Noah Goldstein
bcc41f66a4 x86: Optimize svml_s_tanhf8_core_avx2.S
Optimizations are:
    1. Reduce code size (-81 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Prefer registers which get short instruction encoding.
    5. Reduce rodata size (-32 bytes).

Result is roughly a 17-18% speedup:

       Function, New Time, Old Time, New / Old
_ZGVdN8v_tanhf,     1.977,    2.402,     0.823
2022-06-09 12:51:22 -07:00
Noah Goldstein
3a49ce8799 x86: Add data file that can be shared by tanhf-avx2 and tanhf-sse4
tanhf-avx2 and tanhf-sse4 use the same data tables so we can save
over 4kb using a shared datatable. This does increase the memory
footprint of the sse4 version (as now all the targets are 32 bytes
instead of 16), generally it seems worth the code size save.

NB: This patch doesn't do anything itself, it is setup for future
patches.
2022-06-09 12:51:15 -07:00
Noah Goldstein
e560b3c2d2 x86: Optimize svml_s_tanhf16_core_avx512.S
Optimizations are:
    1. Reduce code size (-67 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Reduce rodata usage (-448 bytes).

Result is roughly a 14% speedup:

       Function, New Time, Old Time, New / Old
_ZGVeN16v_tanhf,    0.649,    0.752,     0.863
2022-06-09 12:51:12 -07:00
Noah Goldstein
fe1915d4f6 x86: Improve svml_s_atanhf4_core_sse4.S
Improvements are:
    1. Reduce code size (-62 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Prefer registers which get short instruction encoding.
    5. Reduce rodata usage (-16 bytes).

The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.

       Function, New Time, Old Time, New / Old
_ZGVbN4v_atanhf,    8.821,    8.903,     0.991
2022-06-09 12:51:09 -07:00
Noah Goldstein
65897e9916 x86: Improve svml_s_atanhf8_core_avx2.S
Improvements are:
    1. Reduce code size (-60 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Prefer registers which get short instruction encoding.
    5. Shrink rodata usage (-32 bytes).

The throughput improvement is not that significant (3-5%) as the
port 0 bottleneck is unavoidable.

       Function, New Time, Old Time, New / Old
_ZGVdN8v_atanhf,    2.799,    2.923,     0.958
2022-06-09 12:51:04 -07:00
Noah Goldstein
73bae395cf x86: Improve svml_s_atanhf16_core_avx512.S
Improvements are:
    1. Reduce code size (-64 bytes).
    2. Remove redundant move instructions.
    3. Slightly improve instruction selection/scheduling where
       possible.
    4. Reduce rodata size ([-128, -188] bytes).

The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.

        Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf,     1.39,    1.408,     0.987
2022-06-09 12:50:58 -07:00
Noah Goldstein
0f91811333 x86: Align varshift table to 32-bytes
This ensures the load will never split a cache line.
2022-06-09 12:50:26 -07:00
Noah Goldstein
4654e7fd5a x86: Add copyright to strpbrk-c.c 2022-06-09 12:50:00 -07:00
Noah Goldstein
2c9af8421d x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]
commit 6dcbb7d95d
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jun 6 21:11:33 2022 -0700

    x86: Shrink code size of memchr-avx2.S

Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.

This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.

Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-08 17:07:34 -07:00
Noah Goldstein
c28db9cb29 x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.

Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:10:32 -07:00
Noah Goldstein
56da3fe1dd x86: Shrink code size of memchr-evex.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 64 bytes

There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:10:32 -07:00
Noah Goldstein
6dcbb7d95d x86: Shrink code size of memchr-avx2.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 59 bytes

There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:10:31 -07:00
Noah Goldstein
af5306a735 x86: Optimize memrchr-avx2.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:10:27 -07:00
Noah Goldstein
b4209615a0 x86: Optimize memrchr-evex.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:10:24 -07:00
Noah Goldstein
731feee386 x86: Optimize memrchr-sse2.S
The new code:
    1. prioritizes smaller lengths more.
    2. optimizes target placement more carefully.
    3. reuses logic more.
    4. fixes up various inefficiencies in the logic.

The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874

Regressions:
    1. The page cross case is now colder, especially re-entry from the
       page cross case if a match is not found in the first VEC
       (roughly 50%). My general opinion with this patch is this is
       acceptable given the "coldness" of this case (less than 4%) and
       generally performance improvement in the other far more common
       cases.

    2. There are some regressions 5-15% for medium/large user-arg
       lengths that have a match in the first VEC. This is because the
       logic was rewritten to optimize finds in the first VEC if the
       user-arg length is shorter (where we see roughly 20-50%
       performance improvements). It is not always the case this is a
       regression. My intuition is some frontend quirk is partially
       explaining the data although I haven't been able to find the
       root cause.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:09:36 -07:00
Noah Goldstein
dd5c483b25 x86: Add COND_VZEROUPPER that can replace vzeroupper if no ret
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.

This can be useful when hoisting a vzeroupper to save code size
for example:

```
L(foo):
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	VZEROUPPER_RETURN

L(bar):
	xorl	%eax, %eax
	VZEROUPPER_RETURN
```

Can become:

```
L(foo):
	COND_VZEROUPPER
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	ret

L(bar):
	xorl	%eax, %eax
	ret
```

This code does not change any existing functionality.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:08:28 -07:00
Noah Goldstein
8a780a6b91 x86: Create header for VEC classes in x86 strings library
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-07 13:08:28 -07:00
Raghuveer Devulapalli
5082a287d5 x86_64: Add strstr function with 512-bit EVEX
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:

(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.

(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.

(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.

Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)

Geometric mean of all benchmarks: new / old =  0.66

Difficult skiptable(0) : new / old =  0.02
Difficult skiptable(1) : new / old =  0.01
Difficult 2-way : new / old =  0.25
Difficult testing first 2 : new / old =  1.26
Difficult skiptable(0) : new / old =  0.05
Difficult skiptable(1) : new / old =  0.06
Difficult 2-way : new / old =  0.26
Difficult testing first 2 : new / old =  1.05
Difficult skiptable(0) : new / old =  0.42
Difficult skiptable(1) : new / old =  0.24
Difficult 2-way : new / old =  0.21
Difficult testing first 2 : new / old =  1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-06 19:46:55 -07:00
Andreas Schwab
dc1e5eeb25 x86_64: Optimize sincos where sin/cos is optimized (bug 29193)
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos.  The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
2022-06-01 10:29:52 +02:00
Adhemerval Zanella
ec7bc492b6 x86_64: Remove _dl_skip_args usage
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0.   So there is no need to adjust the argc or argv.

Checked on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-05-30 16:33:34 -03:00
H.J. Lu
f8587a6189 x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
2022-05-26 14:00:25 -07:00
Sunil K Pandey
9c66efb86f x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.

Placeholder function, not used by any processor at the moment.

- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-05-26 13:11:36 -07:00
Adhemerval Zanella
efeb2bd1ab math: Add math-use-builtins-fabs (BZ#29027)
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins).  Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.

It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.

On ia64 it also fixes the sNAN handling:

  math/test-float64x-fabs
  math/test-ldouble-fabs

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
2022-05-23 17:49:18 -03:00
Adhemerval Zanella
9403b71ae9 x86_64: Remove bzero optimization
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.

GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.

On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.

  $ cat count_bstring.sh
  #!/bin/bash

  files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
  total=0
  for file in $files; do
    symbols=`objdump -R $file 2>&1`
    if [ $? -eq 0 ]; then
      ncalls=`echo $symbols | grep -w $1 | wc -l`
      ((total=total+ncalls))
      if [ $ncalls -gt 0 ]; then
        echo "$file: $ncalls"
      fi
    fi
  done
  echo "TOTAL=$total"
  $ ./count_bstring.sh bzero
  TOTAL=0

Checked on x86_64-linux-gnu.
2022-05-16 09:36:06 -03:00
Wangyang Guo
8162147872 nptl: Add backoff mechanism to spinlock loop
When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h

v5: Commit log updated for regression in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	112	140
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	112	140
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	112	140
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

There is regression in large critical section. But adaptive mutex is
aimed for "quick" locks. Small critical section is more common when
users choose to use adaptive pthread_mutex.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-05-09 14:38:40 -07:00
Siddhesh Poyarekar
5b5b1012d5 benchtests: Better libmvec integration
Improve libmvec benchmark integration so that in future other
architectures may be able to run their libmvec benchmarks as well.  This
now allows libmvec benchmarks to be run with `make BENCHSET=bench-math`.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2022-04-29 11:48:18 +05:30
Siddhesh Poyarekar
944afe6d95 benchtests: Add UNSUPPORTED benchmark status
The libmvec benchmarks print a message indicating that a certain CPU
feature is unsupported and exit prematurelyi, which breaks the JSON in
bench.out.

Handle this more elegantly in the bench makefile target by adding
support for an UNSUPPORTED exit status (77) so that bench.out continues
to have output for valid tests.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2022-04-29 11:48:16 +05:30
Noah Goldstein
c966099cdc x86: Optimize {str|wcs}rchr-evex
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-22 23:08:43 -05:00
Noah Goldstein
df7e295d18 x86: Optimize {str|wcs}rchr-avx2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-22 23:08:40 -05:00
Noah Goldstein
5307aa9c18 x86: Optimize {str|wcs}rchr-sse2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-22 23:08:36 -05:00
H.J. Lu
8ea20ee5f6 x86-64: Fix SSE2 memcmp and SSSE3 memmove for x32
Clear the upper 32 bits in RDX (memory size) for x32 to fix

FAIL: string/tst-size_t-memcmp
FAIL: string/tst-size_t-memcmp-2
FAIL: string/tst-size_t-memcpy
FAIL: wcsmbs/tst-size_t-wmemcmp

on x32 introduced by

8804157ad9 x86: Optimize memcmp SSE2 in memcmp.S
26b2478322 x86: Reduce code size of mem{move|pcpy|cpy}-ssse3

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-04-22 11:23:15 -07:00
Noah Goldstein
c72a1a062a x86: Fix missing __wmemcmp def for disable-multiarch build
commit 8804157ad9
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Fri Apr 15 12:27:59 2022 -0500

    x86: Optimize memcmp SSE2 in memcmp.S

Only defined wmemcmp and missed __wmemcmp. This commit fixes that by
defining __wmemcmp and setting wmemcmp as a weak alias to __wmemcmp.

Both multiarch and disable-multiarch builds succeed and full xchecks
pass.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-19 20:18:57 -05:00
Noah Goldstein
23102686ec x86: Cleanup page cross code in memcmp-avx2-movbe.S
Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:52 -05:00
Noah Goldstein
7cbc03d030 x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:42 -05:00
Noah Goldstein
8804157ad9 x86: Optimize memcmp SSE2 in memcmp.S
New code save size (-303 bytes) and has significantly better
performance.

geometric_mean(N=20) of page cross cases New / Original: 0.634
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:35 -05:00
Noah Goldstein
26b2478322 x86: Reduce code size of mem{move|pcpy|cpy}-ssse3
The goal is to remove most SSSE3 function as SSE4, AVX2, and EVEX are
generally preferable. memcpy/memmove is one exception where avoiding
unaligned loads with `palignr` is important for some targets.

This commit replaces memmove-ssse3 with a better optimized are lower
code footprint verion. As well it aliases memcpy to memmove.

Aside from this function all other SSSE3 functions should be safe to
remove.

The performance is not changed drastically although shows overall
improvements without any major regressions or gains.

bench-memcpy geometric_mean(N=50) New / Original: 0.957

bench-memcpy-random geometric_mean(N=50) New / Original: 0.912

bench-memcpy-large geometric_mean(N=50) New / Original: 0.892

Benchmarks where run on Zhaoxin KX-6840@2000MHz See attached numbers
for all results.

More important this saves 7246 bytes of code size in memmove an
additional 10741 bytes by reusing memmove code for memcpy (total 17987
bytes saves). As well an additional 896 bytes of rodata for the jump
table entries.
2022-04-14 23:21:42 -05:00
Noah Goldstein
d85916e30a x86: Remove mem{move|cpy}-ssse3-back
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
41bfe224e5 x86: Remove str{p}{n}cpy-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
e084ccd37e x86: Remove str{n}cat-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c024a0b9cf x86: Remove str{n}{case}cmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c725f408db x86: Remove {w}memcmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Adhemerval Zanella
13d45cf9a7 x86: Remove fcopysign{f} implementation
The builtin used by generic code generates similar code.

Checked on x86_64-linux-gnu and i686-linux-gnu.
2022-04-07 12:17:15 -03:00
Adhemerval Zanella
cbc2c56bab benchtests: Only build libmvec benchmarks iff $(build-mathvec) is set
Checked on x86_64-linux-gnu.
2022-04-05 12:01:10 -03:00
Fangrui Song
3ee318c923 Remove -z combreloc and HAVE_Z_COMBRELOC
-z combreloc has been the default regadless of the architecture since
binutils commit f4d733664aabd7bd78c82895e030ec9779a92809 (2002). The
configure check added in commit fdde83499a (2001) has long been
unneeded.

We can therefore treat HAVE_Z_COMBRELOC as always 1 and delete dead code
paths in dl-machine.h files (many were copied from commit a711b01d34
and ee0cb67ec2).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-04-04 17:19:07 -07:00
Adhemerval Zanella
7eed708edf x86: Remove fabs{f} implementation
For x86_64 is the same as the generic implementation, while for i686
the builtin generates the same code.
2022-04-04 16:23:11 -03:00
Noah Goldstein
244b415d38 x86: Small improvements for wcslen
Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-28 15:00:03 -05:00