Commit Graph

14556 Commits

Author SHA1 Message Date
H.J. Lu
c5c666f349 s_sincosf.h: Change pio4 type to float [BZ #28713]
s_cosf.c and s_sinf.c have

  if (abstop12 (y) < abstop12 (pio4))

where abstop12 takes a float argument, but pio4 is static const double.
pio4 is used only in calls to abstop12 and never in arithmetic.  Apply

-static const double pio4 = 0x1.921FB54442D18p-1;
+static const float pio4 = 0x1.921FB6p-1f;

to fix:

FAIL: math/test-float-cos
FAIL: math/test-float-sin
FAIL: math/test-float-sincos
FAIL: math/test-float32-cos
FAIL: math/test-float32-sin
FAIL: math/test-float32-sincos

when compiling with GCC 12.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
(cherry picked from commit d3e4f5a101)
2023-01-11 15:29:37 +01:00
H.J. Lu
dd0c72fb46 Regenerate ulps on x86_64 with GCC 12
Fix

FAIL: math/test-float-clog10
FAIL: math/test-float32-clog10

on Intel Core i7-1165G7 with GCC 12.

(cherry picked from commit de8a0897e3)
2023-01-11 14:35:10 +01:00
Fangrui Song
bbe4bbb6e8 elf: Drop elf/tls-macros.h in favor of __thread and tls_model attributes [BZ #28152] [BZ #28205]
elf/tls-macros.h was added for TLS testing when GCC did not support
__thread. __thread and tls_model attributes are mature now and have been
used by many newer tests.

Also delete tst-tls2.c which tests .tls_common (unused by modern GCC and
unsupported by Clang/LLD). .tls_common and .tbss definition are almost
identical after linking, so the runtime test doesn't add additional
coverage.  Assembler and linker tests should be on the binutils side.

When LLD 13.0.0 is allowed in configure.ac
(https://sourceware.org/pipermail/libc-alpha/2021-August/129866.html),
`make check` result is on par with glibc built with GNU ld on aarch64
and x86_64.

As a future clean-up, TLS_GD/TLS_LD/TLS_IE/TLS_IE macros can be removed from
sysdeps/*/tls-macros.h. We can add optional -mtls-dialect={gnu2,trad}
tests to ensure coverage.

Tested on aarch64-linux-gnu, powerpc64le-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 33c50ef428)
2023-01-10 19:09:21 +01:00
Alan Modra
d0e2ac0c59 elf/tst-tlsopt-powerpc fails when compiled with -mcpu=power10 (BZ# 29776)
Supports pcrel addressing of TLS GOT entry.  Also tweak the non-pcrel
asm constraint to better reflect how the reg is used.

(cherry picked from commit 94628de778)
2023-01-10 17:33:30 +01:00
Noah Goldstein
e3255e7d21 x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]
Previous implementation was adjusting length (rsi) to match
bytes (eax), but since there is no bound to length this can cause
overflow.

Fix is to just convert the byte-count (eax) to length by dividing by
sizeof (wchar_t) before the comparison.

Full check passes on x86-64 and build succeeds w/ and w/o multiarch.

(cherry picked from commit b0969fa53a)
2022-11-24 14:42:41 -08:00
Vladislav Khmelevsky
691f70b84a elf: Fix rtld-audit trampoline for aarch64
This patch fixes two problems with audit:

  1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS,
     resulting in x2 register value nulling in RG structure.

  2. We need to preserve the x8 register before function call, but
     don't have to save it's new value and restore it before return.

Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value
which is wrong (althoug doesn't affect anything).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit eb4181e9f4)
2022-11-22 10:34:14 -03:00
Florian Weimer
d57cdc1b5a Linux: Support __IPC_64 in sysvctl *ctl command arguments (bug 29771)
Old applications pass __IPC_64 as part of the command argument because
old glibc did not check for unknown commands, and passed through the
arguments directly to the kernel, without adding __IPC_64.
Applications need to continue doing that for old glibc compatibility,
so this commit enables this approach in current glibc.

For msgctl and shmctl, if no translation is required, make
direct system calls, as we did before the time64 changes.  If
translation is required, mask __IPC_64 from the command argument.

For semctl, the union-in-vararg argument handling means that
translation is needed on all architectures.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 22a46dee24)
2022-11-11 18:06:25 +01:00
Adhemerval Zanella
ca5df79545 linux: Fix generic struct_stat for 64 bit time (BZ# 29657)
The generic Linux struct_stat misses the conditionals to use
bits/struct_stat_time64_helper.h in the __USE_TIME_BITS64 for
architecture that uses __TIMESIZE == 32 (currently csky and nios2).

Since newer ports should not support 32 bit time_t, the generic
implementation should be used as default.

For arm, hppa, and sh a copy of default struct_stat is added,
while for csky and nios a new one based on generic is used, along
with conditionals to use bits/struct_stat_time64_helper.h.

The default struct_stat is also replaced with the generic one.

Checked on aarch64-linux-gnu and arm-linux-gnueabihf.

(cherry picked from commit 7a6ca82f80)
2022-10-25 16:12:20 -03:00
Aurelien Jarno
e570b865b5 x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementations
The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk'
instruction which belongs to the BMI1 CPU feature and the 'shrx'
instruction, which belongs to the BMI2 CPU feature.

Fixes: df7e295d18 ("x86: Optimize {str|wcs}rchr-avx2")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 7e8283170c)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
36d6b9be3d x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementation
The AVX2 memrchr implementation uses the 'shlxl' instruction, which
belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which
belongs to the LZCNT CPU feature.

Fixes: af5306a735 ("x86: Optimize memrchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 3c0c78afab)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
94b9c1b640 x86-64: Require BMI2 for AVX2 (raw|w)memchr implementations
The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi'
and 'sarx' instructions, which belongs to the BMI2 CPU feature.

Fixes: acfd088a19 ("x86: Optimize memchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit e3e7fab7fe)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
67e863742d x86-64: Require BMI2 for AVX2 wcs(n)cmp implementations
The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit f31a5a884e)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
b9cbb8dd48 x86-64: Require BMI2 for AVX2 strncmp implementation
The AVX2 strncmp implementations uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit fc7de1d9b9)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
e1561d8cf0 x86-64: Require BMI2 for AVX2 strcmp implementation
The AVX2 strcmp implementation uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 4d64c64457)
2022-10-04 00:00:59 +02:00
Aurelien Jarno
414fc856ff x86-64: Require BMI2 for AVX2 str(n)casecmp implementations
The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 10f79d3670)
2022-10-04 00:00:58 +02:00
Aurelien Jarno
95f5089d4a x86: include BMI1 and BMI2 in x86-64-v3 level
The "System V Application Binary Interface AMD64 Architecture Processor
Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3
level.

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit b80f16adbd)
2022-10-04 00:00:58 +02:00
Wangyang Guo
ea69248445 nptl: Add backoff mechanism to spinlock loop
When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h

v5: Commit log updated for regression in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	112	140
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	112	140
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	112	140
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

There is regression in large critical section. But adaptive mutex is
aimed for "quick" locks. Small critical section is more common when
users choose to use adaptive pthread_mutex.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 8162147872)
2022-09-28 07:34:53 -07:00
Noah Goldstein
04efdcfac4 sysdeps: Add 'get_fast_jitter' interace in fast-jitter.h
'get_fast_jitter' is meant to be used purely for performance
purposes. In all cases it's used it should be acceptable to get no
randomness (see default case). An example use case is in setting
jitter for retries between threads at a lock. There is a
performance benefit to having jitter, but only if the jitter can
be generated very quickly and ultimately there is no serious issue
if no jitter is generated.

The implementation generally uses 'HP_TIMING_NOW' iff it is
inlined (avoid any potential syscall paths).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 911c63a51c)
2022-09-28 07:34:31 -07:00
Arjun Shankar
68507377f2 socket: Check lengths before advancing pointer in CMSG_NXTHDR
The inline and library functions that the CMSG_NXTHDR macro may expand
to increment the pointer to the header before checking the stride of
the increment against available space.  Since C only allows incrementing
pointers to one past the end of an array, the increment must be done
after a length check.  This commit fixes that and includes a regression
test for CMSG_FIRSTHDR and CMSG_NXTHDR.

The Linux, Hurd, and generic headers are all changed.

Tested on Linux on armv7hl, i686, x86_64, aarch64, ppc64le, and s390x.

[BZ #28846]

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 9c443ac455)
2022-08-22 18:59:26 +02:00
Florian Weimer
1fcc7bfee2 alpha: Fix generic brk system call emulation in __brk_call (bug 29490)
The kernel special-cases the zero argument for alpha brk, and we can
use that to restore the generic Linux error handling behavior.

Fixes commit b57ab258c1 ("Linux:
Introduce __brk_call for invoking the brk system call").

(cherry picked from commit e7ad26ee3c)
2022-08-22 11:12:40 +02:00
Joseph Myers
4ab59ce4e5 Update syscall lists for Linux 5.19
Linux 5.19 has no new syscalls, but enables memfd_secret in the uapi
headers for RISC-V.  Update the version number in syscall-names.list
to reflect that it is still current for 5.19 and regenerate the
arch-syscall.h headers with build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit fccadcdf5b)
2022-08-08 07:54:02 +02:00
Joseph Myers
b991af5063 Update syscall-names.list for Linux 5.18
Linux 5.18 has no new syscalls.  Update the version number in
syscall-names.list to reflect that it is still current for 5.18.

Tested with build-many-glibcs.py.

(cherry picked from commit 3d9926663c)
2022-07-21 17:06:27 +02:00
Noah Goldstein
ccc54bd61c x86: Add missing IS_IN (libc) check to strncmp-sse4_2.S
Was missing to for the multiarch build rtld-strncmp-sse4_2.os was
being built and exporting symbols:

build/glibc/string/rtld-strncmp-sse4_2.os:
0000000000000000 T __strncmp_sse42

Introduced in:

commit 11ffcacb64
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jun 21 12:10:50 2017 -0700

    x86-64: Implement strcmp family IFUNC selectors in C

(cherry picked from commit 96ac447d91)
2022-07-18 20:45:21 -07:00
Noah Goldstein
35f9c72c8b x86: Move mem{p}{mov|cpy}_{chk_}erms to its own file
The primary memmove_{impl}_unaligned_erms implementations don't
interact with this function. Putting them in same file both
wastes space and unnecessarily bloats a hot code section.

(cherry picked from commit 21925f6473)
2022-07-18 20:45:21 -07:00
Noah Goldstein
7079931c51 x86: Move and slightly improve memset_erms
Implementation wise:
    1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not
       use the L(stosb) label that was previously defined.

    2. Don't give the hotpath (fallthrough) to zero size.

Code positioning wise:

Move memset_{chk}_erms to its own file.  Leaving it in between the
memset_{impl}_unaligned both adds unnecessary complexity to the
file and wastes space in a relatively hot cache section.

(cherry picked from commit 4a3f29e7e4)
2022-07-18 20:45:21 -07:00
Noah Goldstein
f4598f0351 x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl list
This was simply missing and meant we weren't testing it properly.

(cherry picked from commit 2a1099020c)
2022-07-18 20:45:21 -07:00
Noah Goldstein
aadd0a1c7c x86: Put wcs{n}len-sse4.1 in the sse4.1 text section
Previously was missing but the two implementations shouldn't get in
the sse2 (generic) text section.

(cherry picked from commit afc6e4328f)
2022-07-18 20:45:21 -07:00
Noah Goldstein
d201c59177 x86: Align entry for memrchr to 64-bytes.
The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.

As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.

(cherry picked from commit 227afaa672)
2022-07-18 20:45:21 -07:00
Noah Goldstein
c51d8d383c x86: Add BMI1/BMI2 checks for ISA_V3 check
BMI1/BMI2 are part of the ISA V3 requirements:
https://en.wikipedia.org/wiki/X86-64

And defined by GCC when building with `-march=x86-64-v3`

(cherry picked from commit 8da9f346cb)
2022-07-18 20:45:21 -07:00
Noah Goldstein
ba1c3f23d9 x86: Cleanup bounds checking in large memcpy case
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.

(cherry picked from commit 89a25c6f64)
2022-07-18 20:45:21 -07:00
Noah Goldstein
94b0dc9419 x86: Add bounds x86_non_temporal_threshold
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.

(cherry picked from commit b446822b6a)
2022-07-18 20:45:21 -07:00
Noah Goldstein
9d50e162ee x86: Add sse42 implementation to strcmp's ifunc
This has been missing since the the ifuncs where added.

The performance of SSE4.2 is preferable to to SSE2.

Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906

(cherry picked from commit ff439c4717)
2022-07-18 20:45:21 -07:00
Noah Goldstein
6e008c884d x86: Fix misordered logic for setting rep_movsb_stop_threshold
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)

(cherry picked from commit 0355915514)
2022-07-18 20:45:21 -07:00
Noah Goldstein
fc54e1fae8 x86: Align varshift table to 32-bytes
This ensures the load will never split a cache line.

(cherry picked from commit 0f91811333)
2022-07-18 20:45:21 -07:00
Noah Goldstein
820504e3ed x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.

Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit c28db9cb29)
2022-07-18 20:45:21 -07:00
Noah Goldstein
3c87383a20 x86: Shrink code size of memchr-evex.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 64 bytes

There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 56da3fe1dd)
2022-07-18 20:45:21 -07:00
Noah Goldstein
a910d7e164 x86: Shrink code size of memchr-avx2.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 59 bytes

There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 6dcbb7d95d)

x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]

commit 6dcbb7d95d
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jun 6 21:11:33 2022 -0700

    x86: Shrink code size of memchr-avx2.S

Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.

This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.

Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 2c9af8421d)
2022-07-18 20:45:21 -07:00
Noah Goldstein
b05bd59823 x86: Optimize memrchr-avx2.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit af5306a735)
2022-07-18 20:45:21 -07:00
Noah Goldstein
83a986e9fb x86: Optimize memrchr-evex.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit b4209615a0)
2022-07-18 20:45:21 -07:00
Noah Goldstein
4901009dad x86: Optimize memrchr-sse2.S
The new code:
    1. prioritizes smaller lengths more.
    2. optimizes target placement more carefully.
    3. reuses logic more.
    4. fixes up various inefficiencies in the logic.

The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874

Regressions:
    1. The page cross case is now colder, especially re-entry from the
       page cross case if a match is not found in the first VEC
       (roughly 50%). My general opinion with this patch is this is
       acceptable given the "coldness" of this case (less than 4%) and
       generally performance improvement in the other far more common
       cases.

    2. There are some regressions 5-15% for medium/large user-arg
       lengths that have a match in the first VEC. This is because the
       logic was rewritten to optimize finds in the first VEC if the
       user-arg length is shorter (where we see roughly 20-50%
       performance improvements). It is not always the case this is a
       regression. My intuition is some frontend quirk is partially
       explaining the data although I haven't been able to find the
       root cause.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 731feee386)
2022-07-18 20:45:21 -07:00
Noah Goldstein
e805606193 x86: Add COND_VZEROUPPER that can replace vzeroupper if no ret
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.

This can be useful when hoisting a vzeroupper to save code size
for example:

```
L(foo):
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	VZEROUPPER_RETURN

L(bar):
	xorl	%eax, %eax
	VZEROUPPER_RETURN
```

Can become:

```
L(foo):
	COND_VZEROUPPER
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	ret

L(bar):
	xorl	%eax, %eax
	ret
```

This code does not change any existing functionality.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit dd5c483b25)
2022-07-18 20:45:20 -07:00
Noah Goldstein
70be93d1c5 x86: Create header for VEC classes in x86 strings library
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 8a780a6b91)
2022-07-18 20:45:20 -07:00
Raghuveer Devulapalli
82a707aeb7 x86_64: Add strstr function with 512-bit EVEX
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:

(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.

(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.

(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.

Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)

Geometric mean of all benchmarks: new / old =  0.66

Difficult skiptable(0) : new / old =  0.02
Difficult skiptable(1) : new / old =  0.01
Difficult 2-way : new / old =  0.25
Difficult testing first 2 : new / old =  1.26
Difficult skiptable(0) : new / old =  0.05
Difficult skiptable(1) : new / old =  0.06
Difficult 2-way : new / old =  0.26
Difficult testing first 2 : new / old =  1.05
Difficult skiptable(0) : new / old =  0.42
Difficult skiptable(1) : new / old =  0.24
Difficult 2-way : new / old =  0.21
Difficult testing first 2 : new / old =  1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 5082a287d5)

x86: Remove __mmask intrinsics in strstr-avx512.c

The intrinsics are not available before GCC7 and using standard
operators generates code of equivalent or better quality.

Removed:
    _cvtmask64_u64
    _kshiftri_mask64
    _kand_mask64

Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958

(cherry picked from commit f2698954ff)
2022-07-18 20:45:20 -07:00
H.J. Lu
f6bc52f080 x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit f8587a6189)
2022-07-18 20:45:20 -07:00
Sunil K Pandey
8ab861d295 x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.

Placeholder function, not used by any processor at the moment.

- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 9c66efb86f)
2022-07-18 20:45:20 -07:00
Adhemerval Zanella
eb9aa96fac x86_64: Remove bzero optimization
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.

GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.

On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.

  $ cat count_bstring.sh
  #!/bin/bash

  files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
  total=0
  for file in $files; do
    symbols=`objdump -R $file 2>&1`
    if [ $? -eq 0 ]; then
      ncalls=`echo $symbols | grep -w $1 | wc -l`
      ((total=total+ncalls))
      if [ $ncalls -gt 0 ]; then
        echo "$file: $ncalls"
      fi
    fi
  done
  echo "TOTAL=$total"
  $ ./count_bstring.sh bzero
  TOTAL=0

Checked on x86_64-linux-gnu.

(cherry picked from commit 9403b71ae9)
2022-07-18 20:45:20 -07:00
Sunil K Pandey
8d324019e6 x86_64: Remove end of line trailing spaces
This commit remove trailing space introduced by following commit.

commit a775a7a3eb
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 23 01:56:29 2021 -0400

    x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]
2022-07-18 20:45:09 -07:00
Adhemerval Zanella
4b246b2bbd linux: Fix mq_timereceive check for 32 bit fallback code (BZ 29304)
On  success,  mq_receive() and mq_timedreceive() return the number of
bytes in the received message, so it requires to check if the value
is larger than 0.

Checked on i686-linux-gnu.

(cherry picked from commit 71d87d85bf)
2022-06-30 10:46:39 -03:00
Adhemerval Zanella
bb4148283f nios2: Remove _dl_skip_args usage (BZ# 29187)
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0.   So there is no need to adjust the argc or argv.

Checked with qemu-user that arguments are correctly passed on both
constructors and main program.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4868ba5d25)
2022-06-10 09:15:00 -03:00
Adhemerval Zanella
96944f0f81 hppa: Remove _dl_skip_args usage (BZ# 29165)
Different than other architectures, hppa creates an unrelated stack
frame where ld.so argc/argv adjustments done by ad43cac44a
is not done on the argc/argv saved/restore by _dl_start_user.

Instead load _dl_argc and _dl_argv directlty instead of adjust them
using _dl_skip_args value.

Checked on hppa-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 6242602273)
2022-06-10 09:13:54 -03:00