Commit Graph

14538 Commits

Author SHA1 Message Date
Florian Weimer
ff900fad89 Linux: Consolidate auxiliary vector parsing (redo)
And optimize it slightly.

This is commit 8c8510ab27 revised.

In _dl_aux_init in elf/dl-support.c, use an explicit loop
and -fno-tree-loop-distribute-patterns to avoid memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 73fc4e28b9)
2022-05-19 12:13:51 +02:00
Florian Weimer
28bdb03b1b Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED
Otherwise, <dl-auxv.h> on POWER ends up being included twice,
once in dl-sysdep.c, once in dl-support.c.  That leads to a linker
failure due to multiple definitions of _dl_cache_line_size.

Fixes commit d96d2995c1
("Revert "Linux: Consolidate auxiliary vector parsing").

(cherry picked from commit 098c795e85)
2022-05-19 11:49:06 +02:00
Florian Weimer
1cc4ddfeeb Revert "Linux: Consolidate auxiliary vector parsing"
This reverts commit 8c8510ab27.  The
revert is not perfect because the commit included a bug fix for
_dl_sysdep_start with an empty argv, introduced in commit
2d47fa6862 ("Linux: Remove
DL_FIND_ARG_COMPONENTS"), and this bug fix is kept.

The revert is necessary because the reverted commit introduced an
early memset call on aarch64, which leads to crash due to lack of TCB
initialization.

(cherry picked from commit d96d2995c1)
2022-05-19 11:49:06 +02:00
Florian Weimer
4b9cd5465d Linux: Consolidate auxiliary vector parsing
And optimize it slightly.

The large switch statement in _dl_sysdep_start can be replaced with
a large array.  This reduces source code and binary size.  On
i686-linux-gnu:

Before:

   text	   data	    bss	    dec	    hex	filename
   7791	     12	      0	   7803	   1e7b	elf/dl-sysdep.os

After:

   text	   data	    bss	    dec	    hex	filename
   7135	     12	      0	   7147	   1beb	elf/dl-sysdep.os

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 8c8510ab27)
2022-05-19 11:49:06 +02:00
Florian Weimer
08728256fa Linux: Assume that NEED_DL_SYSINFO_DSO is always defined
The definition itself is still needed for generic code.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit f19fc997a5)
2022-05-19 11:49:06 +02:00
Florian Weimer
458733fffe Linux: Remove DL_FIND_ARG_COMPONENTS
The generic definition is always used since the Native Client
port has been removed.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 2d47fa6862)
2022-05-19 11:49:06 +02:00
Florian Weimer
2139b1848e Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE
They are always defined.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit b9c3d3382f)
2022-05-19 11:49:06 +02:00
Florian Weimer
b0bd6a1323 elf: Merge dl-sysdep.c into the Linux version
The generic version is the de-facto Linux implementation.  It
requires an auxiliary vector, so Hurd does not use it.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 91c0a47ffb)
2022-05-19 11:43:53 +02:00
Noah Goldstein
596c9a32cc x86: Optimize {str|wcs}rchr-evex
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit c966099cdc)
2022-05-16 18:55:54 -07:00
Noah Goldstein
00f09a14d2 x86: Optimize {str|wcs}rchr-avx2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit df7e295d18)
2022-05-16 18:55:45 -07:00
Noah Goldstein
0a11305416 x86: Optimize {str|wcs}rchr-sse2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 5307aa9c18)
2022-05-16 18:55:37 -07:00
Noah Goldstein
df5de87260 x86: Cleanup page cross code in memcmp-avx2-movbe.S
Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 23102686ec)
2022-05-16 18:55:24 -07:00
Noah Goldstein
ffe75982cc x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 7cbc03d030)
2022-05-16 18:55:16 -07:00
Noah Goldstein
4ff6ae069b x86: Small improvements for wcslen
Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 244b415d38)
2022-05-16 18:55:09 -07:00
Noah Goldstein
80883f4354 x86: Remove AVX str{n}casecmp
The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 305769b2a1)
2022-05-16 18:55:02 -07:00
Noah Goldstein
b13a2e68eb x86: Add EVEX optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 84e7c46df4)
2022-05-16 18:54:52 -07:00
Noah Goldstein
3051cf3e74 x86: Add AVX2 optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit bbf8122234)
2022-05-16 18:54:41 -07:00
Noah Goldstein
3605c74407 x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .920

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit d154758e61)
2022-05-16 18:54:27 -07:00
Noah Goldstein
5997011826 x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .894

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 670b54bc58)
2022-05-16 18:54:17 -07:00
Noah Goldstein
a4b1cae068 x86: Remove strspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .710

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 9c8a6ad620)
2022-05-16 18:54:09 -07:00
Noah Goldstein
3811544655 x86: Remove strpbrk-sse2.S and use the generic implementation
The generic implementation is faster (see strcspn commit).

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 6533585352)
2022-05-16 18:53:59 -07:00
Noah Goldstein
0dafa75e3c x86: Remove strcspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .678

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit fe28e7d9d9)
2022-05-16 18:53:48 -07:00
Noah Goldstein
0a2da01110 x86: Optimize strspn in strspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 412d103431)
2022-05-16 18:53:39 -07:00
Noah Goldstein
0ae1006967 x86: Optimize strcspn and strpbrk in strcspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 30d627d477)
2022-05-16 18:53:28 -07:00
Noah Goldstein
dd6d3a0bbc x86: Code cleanup in strchr-evex and comment justifying branch
Small code cleanup for size: -81 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks New / Original: .985
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit ec285ea904)
2022-05-16 18:53:19 -07:00
Noah Goldstein
3c55c20756 x86: Code cleanup in strchr-avx2 and comment justifying branch
Small code cleanup for size: -53 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks Original / New: 1.00
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit a6fbf4d51e)
2022-05-16 18:53:07 -07:00
Adhemerval Zanella
dd457606ca x86_64: Remove bcopy optimizations
The symbols is not present in current POSIX specification and compiler
already generates memmove call.

(cherry picked from commit bf92893a14)
2022-05-16 18:52:57 -07:00
H.J. Lu
37f373e334 x86-64: Remove bzero weak alias in SS2 memset
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 7 05:55:15 2022 -0800

    x86-64: Optimize bzero

added the optimized bzero.  Remove bzero weak alias in SS2 memset to
avoid undefined __bzero in memset-sse2-unaligned-erms.

(cherry picked from commit 0fb8800029)
2022-05-16 18:52:47 -07:00
H.J. Lu
6cba46c858 x86_64/multiarch: Sort sysdep_routines and put one entry per line
(cherry picked from commit c328d0152d)
2022-05-16 18:52:35 -07:00
H.J. Lu
8de6e4a199 x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ))
(cherry picked from commit 1283948f23)
2022-05-16 18:52:19 -07:00
Noah Goldstein
e123f08ad5 x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]
Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.

commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jan 9 16:02:21 2022 -0600

    x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.

This change will need to be backported.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 9fef7039a7)
2022-05-05 09:13:13 -07:00
Noah Goldstein
5373c90f2e x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]
Logic can read before the start of `s1` / `s2` if both `s1` and `s2`
are near the start of a page. To avoid having the result contimated by
these comparisons the `strcmp` variants would mask off these
comparisons. This was missing in the `strncmp` variants causing
the bug. This commit adds the masking to `strncmp` so that out of
range comparisons don't affect the result.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as
well a full xcheck on x86_64 linux.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit e108c02a5e)
2022-05-05 09:11:49 -07:00
Noah Goldstein
70509f9b48 x86: Set .text section in memset-vec-unaligned-erms
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 7 05:55:15 2022 -0800

    x86-64: Optimize bzero

Remove setting the .text section for the code. This commit
adds that back.

(cherry picked from commit 7912236f4a)
2022-05-05 09:11:13 -07:00
H.J. Lu
5cb6329652 x86-64: Optimize bzero
memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC).

bzero can be slightly more optimized for this case by using a zero-idiom
xor for broadcasting the set value to a register (vector or GPR).

Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 3d9f171bfb)
2022-05-05 09:10:53 -07:00
Noah Goldstein
190ea5f7e4 x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only)
commit b62ace2740
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Feb 6 00:54:18 2022 -0600

    x86: Improve vec generation in memset-vec-unaligned-erms.S

Revert usage of 'pshufb' in broadcast logic as it is an SSSE3
instruction and memset.S is restricted to only SSE2 instructions.

(cherry picked from commit 1b0c60f95b)
2022-05-05 08:54:23 -07:00
Noah Goldstein
ea19c490a3 x86: Improve vec generation in memset-vec-unaligned-erms.S
No bug.

Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.

For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.

Results for memset-avx2 small (geomean of N = 20 benchset runs).

size, New Time, Old Time, New / Old
   0,    4.100,    3.831,     0.934
   1,    5.074,    4.399,     0.867
   2,    4.433,    4.411,     0.995
   4,    4.487,    4.415,     0.984
   8,    4.454,    4.396,     0.987
  16,    4.502,    4.443,     0.987

All relevant string/wcsmbs tests are passing.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit b62ace2740)
2022-05-05 08:54:11 -07:00
H.J. Lu
53ddafe917 x86-64: Fix strcmp-evex.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit 8418eb3ff4
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:39 2022 -0600

    x86: Optimize strcmp-evex.S

(cherry picked from commit 0e0199a9e0)
2022-05-05 08:54:03 -07:00
H.J. Lu
d299032743 x86-64: Fix strcmp-avx2.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit b77b06e0e2
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:38 2022 -0600

    x86: Optimize strcmp-avx2.S

(cherry picked from commit c15efd011c)
2022-05-05 08:53:50 -07:00
Noah Goldstein
c41a66767d x86: Optimize strcmp-evex.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8418eb3ff4)
2022-05-05 08:53:42 -07:00
Noah Goldstein
0d5b36c8cc x86: Optimize strcmp-avx2.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b77b06e0e2)
2022-05-05 08:53:34 -07:00
Joseph Myers
97cb8227b8 Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h
Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17.
Tested with build-many-glibcs.py for aarch64-linux-gnu.

(cherry picked from commit 866c599182)
2022-05-03 11:08:52 +02:00
Szabolcs Nagy
c108e87026 aarch64: Add HWCAP2_ECV from Linux 5.16
Indicates the availability of enhanced counter virtualization extension
of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable
in userspace.

(cherry picked from commit 5a1be8ebdf)
2022-05-03 11:08:52 +02:00
Joseph Myers
f858bc3093 Add SOL_MPTCP, SOL_MCTP from Linux 5.16 to bits/socket.h
Linux 5.16 adds constants SOL_MPTCP and SOL_MCTP to the getsockopt /
setsockopt levels; add these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit fdc1ae67fe)
2022-05-03 11:08:52 +02:00
Joseph Myers
0499c3a95f Update kernel version to 5.17 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.17.  (There are no new MAP_* constants covered by this test in
5.17 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 23808a422e)
2022-05-03 11:08:52 +02:00
Joseph Myers
81181ba5d9 Update kernel version to 5.16 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.16.  (There are no new MAP_* constants covered by this test in
5.16 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 790a607e23)
2022-05-03 11:08:52 +02:00
Joseph Myers
6af165658d Update syscall lists for Linux 5.17
Linux 5.17 has one new syscall, set_mempolicy_home_node.  Update
syscall-names.list and regenerate the arch-syscall.h headers with
build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit 8ef9196b26)
2022-05-03 11:08:52 +02:00
Joseph Myers
5146b73d72 Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.h
Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along
with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit
cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol
numbers for PF_CAN") but apparently missed for glibc at the time.

Tested for x86_64.

(cherry picked from commit a94d9659cd)
2022-05-03 11:07:10 +02:00
Joseph Myers
fd5dbfd1cd Update kernel version to 5.15 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.15.  (There are no new MAP_* constants covered by this test in
5.15 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 5c3ece451d)
2022-05-03 11:07:07 +02:00
Joseph Myers
bc6fba3c80 Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.h
Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add
these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit bdeb7a8fa9)
2022-05-03 11:07:03 +02:00
DJ Delorie
c66c92181d posix/glob.c: update from gnulib
Copied from gnulib/lib/glob.c in order to fix rhbz 1982608
Also fixes swbz 25659

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 7c477b57a3)
2022-04-28 11:57:23 -04:00
Adhemerval Zanella
88a8637cb4 linux: Fix fchmodat with AT_SYMLINK_NOFOLLOW for 64 bit time_t (BZ#29097)
The AT_SYMLINK_NOFOLLOW emulation ues the default 32 bit stat internal
calls, which fails with EOVERFLOW if the file constains timestamps
beyond 2038.

Checked on i686-linux-gnu.

(cherry picked from commit 118a2aee07)
2022-04-28 10:10:30 -03:00
Carlos O'Donell
55640ed3fd i386: Regenerate ulps
These failures were caught while building glibc master for Fedora
Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse'
using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon
processor.

(cherry picked from commit e465d97653)
2022-04-27 21:20:43 -04:00
Adhemerval Zanella
9681691402 linux: Fix missing internal 64 bit time_t stat usage
These are two missing spots initially done by 52a5fe70a2.

Checked on i686-linux-gnu.

(cherry picked from commit 834ddd0432)
2022-04-27 14:52:26 -03:00
Noah Goldstein
c796418d00 x86: Optimize L(less_vec) case in memcmp-evex-movbe.S
No bug.
Optimizations are twofold.

1) Replace page cross and 0/1 checks with masked load instructions in
   L(less_vec). In applications this reduces branch-misses in the
   hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.

Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range.  From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de0)
2022-04-26 18:18:16 -07:00
H.J. Lu
f3a99b2216 x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNI
Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since
they won't lower CPU frequency when ZMM load and store instructions are
used.

(cherry picked from commit ceeffe968c)
2022-04-26 18:18:16 -07:00
Noah Goldstein
4bbd0f866a x86-64: Use notl in EVEX strcmp [BZ #28646]
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 4df1fa6ddc)
2022-04-26 18:18:16 -07:00
Noah Goldstein
7cb126e7e7 x86: Shrink memcmp-sse4.S code size
No bug.

This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.

memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes

The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.

See https://research.google/pubs/pub48320/ for more details.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d717)
2022-04-26 18:18:16 -07:00
Noah Goldstein
cecbac5212 x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h
No bug.

This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.

On Skylake with ERMS:

Size,   Align1, Align2, dst>src,(rep movsb) / (vec copy)
4096,   0,      0,      0,      0.975
4096,   0,      0,      1,      0.953
4096,   12,     0,      0,      0.969
4096,   12,     0,      1,      0.872
4096,   44,     0,      0,      0.979
4096,   44,     0,      1,      0.83
4096,   0,      12,     0,      1.006
4096,   0,      12,     1,      0.989
4096,   0,      44,     0,      0.739
4096,   0,      44,     1,      0.942
4096,   12,     12,     0,      1.009
4096,   12,     12,     1,      0.973
4096,   44,     44,     0,      0.791
4096,   44,     44,     1,      0.961
4096,   2048,   0,      0,      0.978
4096,   2048,   0,      1,      0.951
4096,   2060,   0,      0,      0.986
4096,   2060,   0,      1,      0.963
4096,   2048,   12,     0,      0.971
4096,   2048,   12,     1,      0.941
4096,   2060,   12,     0,      0.977
4096,   2060,   12,     1,      0.949
8192,   0,      0,      0,      0.85
8192,   0,      0,      1,      0.845
8192,   13,     0,      0,      0.937
8192,   13,     0,      1,      0.939
8192,   45,     0,      0,      0.932
8192,   45,     0,      1,      0.927
8192,   0,      13,     0,      0.621
8192,   0,      13,     1,      0.62
8192,   0,      45,     0,      0.53
8192,   0,      45,     1,      0.516
8192,   13,     13,     0,      0.664
8192,   13,     13,     1,      0.659
8192,   45,     45,     0,      0.593
8192,   45,     45,     1,      0.575
8192,   2048,   0,      0,      0.854
8192,   2048,   0,      1,      0.834
8192,   2061,   0,      0,      0.863
8192,   2061,   0,      1,      0.857
8192,   2048,   13,     0,      0.63
8192,   2048,   13,     1,      0.629
8192,   2061,   13,     0,      0.627
8192,   2061,   13,     1,      0.62

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 475b63702e)
2022-04-26 18:18:16 -07:00
Noah Goldstein
a7392db2ff x86: Optimize memmove-vec-unaligned-erms.S
No bug.

The optimizations are as follows:

1) Always align entry to 64 bytes. This makes behavior more
   predictable and makes other frontend optimizations easier.

2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
   significant benefits in the case that:
        0 < (dst - src) < [256, 512]

3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
   improvement and for FSRM [-10%, 25%].

In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6b7502ec0)
2022-04-26 18:18:16 -07:00
Fangrui Song
2e64237a87 x86-64: Replace movzx with movzbl
Clang cannot assemble movzx in the AT&T dialect mode.

../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
 movzx (%rsi), %ecx
               ^~~~

Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6720d36b66)
2022-04-26 18:18:16 -07:00
H.J. Lu
a182bb7a39 x86-64: Remove Prefer_AVX2_STRCMP
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.

(cherry picked from commit 14dbbf46a0)
2022-04-26 18:18:16 -07:00
H.J. Lu
f35ad30da4 x86-64: Improve EVEX strcmp with masked load
In strcmp-evex.S, to compare 2 32-byte strings, replace

        VMOVU   (%rdi, %rdx), %YMM0
        VMOVU   (%rsi, %rdx), %YMM1
        /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
        VPCMP   $4, %YMM0, %YMM1, %k0
        VPCMP   $0, %YMMZERO, %YMM0, %k1
        VPCMP   $0, %YMMZERO, %YMM1, %k2
        /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
        kord    %k1, %k2, %k1
        /* Each bit in K1 represents a NULL or a mismatch.  */
        kord    %k0, %k1, %k1
        kmovd   %k1, %ecx
        testl   %ecx, %ecx
        jne     L(last_vector)

with

        VMOVU   (%rdi, %rdx), %YMM0
        VPTESTM %YMM0, %YMM0, %k2
        /* Each bit cleared in K1 represents a mismatch or a null CHAR
           in YMM0 and 32 bytes at (%rsi, %rdx).  */
        VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
        kmovd   %k1, %ecx
        incl    %ecx
        jne     L(last_vector)

It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.

Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit c46e9afb2d)
2022-04-26 18:18:16 -07:00
Noah Goldstein
baf3ece634 x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.

it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.

It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit bad852b61b)
2022-04-26 18:18:16 -07:00
Noah Goldstein
6d18a93dbb x86: Optimize memset-vec-unaligned-erms.S
No bug.

Optimization are

1. change control flow for L(more_2x_vec) to fall through to loop and
   jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
   size and saves jumps for length > 4x VEC_SIZE.

2. For EVEX/AVX512 move L(less_vec) closer to entry.

3. Avoid complex address mode for length > 2x VEC_SIZE

4. Slightly better aligning code for the loop from the perspective of
   code size and uops.

5. Align targets so they make full use of their fetch block and if
   possible cache line.

6. Try and reduce total number of icache lines that will need to be
   pulled in for a given length.

7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
   jumping to the stosb target in the sse2 code section will almost
   certainly be to a new page. The new version does increase code size
   marginally by duplicating the target but should get better iTLB
   behavior as a result.

test-memset, test-wmemset, and test-bzero are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e59ced2384)
2022-04-26 18:18:16 -07:00
Noah Goldstein
5ec3416853 x86: Optimize memcmp-evex-movbe.S for frontend behavior and size
No bug.

The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
   the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
   code size. Smaller / Less hot basic blocks are used for this.

Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.

In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.

The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.

test-memcmp and test-wmemcmp are both passing.

(cherry picked from commit 1bd8b8d58f)
2022-04-26 18:18:16 -07:00
Noah Goldstein
b5a44a6a47 x86: Modify ENTRY in sysdep.h so that p2align can be specified
No bug.

This change adds a new macro ENTRY_P2ALIGN which takes a second
argument, log2 of the desired function alignment.

The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
doesn't affect any existing functionality.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit fc5bd179ef)
2022-04-26 18:18:16 -07:00
H.J. Lu
16245986fb x86-64: Optimize load of all bits set into ZMM register [BZ #28252]
Optimize loads of all bits set into ZMM register in AVX512 SVML codes
by replacing

	vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX

and

	vmovups   .L_2il0floatpacket.13(%rip), %zmmX

with
	vpternlogd $0xff, %zmmX, %zmmX, %zmmX

This fixes BZ #28252.

(cherry picked from commit 78c9ec9000)
2022-04-26 18:18:15 -07:00
Florian Weimer
f0c71b34f9 Default to --with-default-link=no (bug 25812)
This is necessary to place the libio vtables into the RELRO segment.
New tests elf/tst-relro-ldso and elf/tst-relro-libc are added to
verify that this is what actually happens.

The new tests fail on ia64 due to lack of (default) RELRO support
inbutils, so they are XFAILed there.

(cherry picked from commit 198abcbb94)
2022-04-22 11:31:14 +02:00
=Joshua Kinard
b87b697f15 mips: Fix mips64n32 64 bit time_t stat support (BZ#29069)
Add missing support initially added by 4e8521333b
(which missed n32 stat).

(cherry picked from commit 78fb888273)
2022-04-18 13:16:21 -03:00
Adhemerval Zanella
290db09546 nptl: Handle spurious EINTR when thread cancellation is disabled (BZ#29029)
Some Linux interfaces never restart after being interrupted by a signal
handler, regardless of the use of SA_RESTART [1].  It means that for
pthread cancellation, if the target thread disables cancellation with
pthread_setcancelstate and calls such interfaces (like poll or select),
it should not see spurious EINTR failures due the internal SIGCANCEL.

However recent changes made pthread_cancel to always sent the internal
signal, regardless of the target thread cancellation status or type.
To fix it, the previous semantic is restored, where the cancel signal
is only sent if the target thread has cancelation enabled in
asynchronous mode.

The cancel state and cancel type is moved back to cancelhandling
and atomic operation are used to synchronize between threads.  The
patch essentially revert the following commits:

  8c1c0aae20 nptl: Move cancel type out of cancelhandling
  2b51742531 nptl: Move cancel state out of cancelhandling
  26cfbb7162 nptl: Remove CANCELING_BITMASK

However I changed the atomic operation to follow the internal C11
semantic and removed the MACRO usage, it simplifies a bit the
resulting code (and removes another usage of the old atomic macros).

Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
and powerpc64-linux-gnu.

[1] https://man7.org/linux/man-pages/man7/signal.7.html

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>

(cherry-picked from commit 404656009b)
2022-04-15 09:52:54 -03:00
Stefan Liebler
0c03cb54c8 S390: Add new s390 platform z16.
The new IBM z16 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented.

_dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2
is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active
as those instructions may be replaced or removed in future.

tst-glibc-hwcaps.c is extended in order to test z16 via new marker5.

A fatal glibc error is dumped if glibc was build with architecture
level set for z16, but run on an older machine. (See dl-hwcap-check.h)

(cherry picked from commit 2376944b9e)
2022-04-14 14:21:57 +02:00
Adhemerval Zanella
4dca2d3a7b hppa: Fix bind-now audit (BZ #28857)
On hppa, a function pointer returned by la_symbind is actually a function
descriptor has the plabel bit set (bit 30).  This must be cleared to get
the actual address of the descriptor.  If the descriptor has been bound,
the first word of the descriptor is the physical address of theA function,
otherwise, the first word of the descriptor points to a trampoline in the
PLT.

This patch also adds a workaround on tests because on hppa (and it seems
to be the only ABI I have see it), some shared library adds a dynamic PLT
relocation to am empty symbol name:

$ readelf -r elf/tst-audit25mod1.so
[...]
Relocation section '.rela.plt' at offset 0x464 contains 6 entries:
 Offset     Info    Type            Sym.Value  Sym. Name + Addend
00002008  00000081 R_PARISC_IPLT                508
[...]

It breaks some assumptions on the test, where a symbol with an empty
name ("") is passed on la_symbind.

Checked on x86_64-linux-gnu and hppa-linux-gnu.

(cherry picked from commit 9e94f57484)
2022-04-12 13:33:17 -04:00
Ben Woodard
b118bce87a elf: Fix runtime linker auditing on aarch64 (BZ #26643)
The rtld audit support show two problems on aarch64:

  1. _dl_runtime_resolve does not preserve x8, the indirect result
      location register, which might generate wrong result calls
      depending of the function signature.

  2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve
     were twice the size of D registers extracted from the stack frame by
     _dl_runtime_profile.

While 2. might result in wrong information passed on the PLT tracing,
1. generates wrong runtime behaviour.

The aarch64 rtld audit support is changed to:

  * Both La_aarch64_regs and La_aarch64_retval are expanded to include
    both x8 and the full sized NEON V registers, as defined by the
    ABI.

  * dl_runtime_profile needed to extract registers saved by
    _dl_runtime_resolve and put them into the new correctly sized
    La_aarch64_regs structure.

  * The LAV_CURRENT check is change to only accept new audit modules
    to avoid the undefined behavior of not save/restore x8.

  * Different than other architectures, audit modules older than
    LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval
    changed their layout and there are no requirements to support multiple
    audit interface with the inherent aarch64 issues).

  * A new field is also reserved on both La_aarch64_regs and
    La_aarch64_retval to support variant pcs symbols.

Similar to x86, a new La_aarch64_vector type to represent the NEON
register is added on the La_aarch64_regs (so each type can be accessed
directly).

Since LAV_CURRENT was already bumped to support bind-now, there is
no need to increase it again.

Checked on aarch64-linux-gnu.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ce9a68c57c)

Resolved conflicts:
	NEWS
	elf/rtld.c
2022-04-12 13:33:10 -04:00
Adhemerval Zanella
056fc1c0e3 elf: Issue la_symbind for bind-now (BZ #23734)
The audit symbind callback is not called for binaries built with
-Wl,-z,now or when LD_BIND_NOW=1 is used, nor the PLT tracking callbacks
(plt_enter and plt_exit) since this would change the expected
program semantics (where no PLT is expected) and would have performance
implications (such as for BZ#15533).

LAV_CURRENT is also bumped to indicate the audit ABI change (where
la_symbind flags are set by the loader to indicate no possible PLT
trace).

To handle powerpc64 ELFv1 function descriptor, _dl_audit_symbind
requires to know whether bind-now is used so the symbol value is
updated to function text segment instead of the OPD (for lazy binding
this is done by PPC64_LOAD_FUNCPTR on _dl_runtime_resolve).

Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
powerpc64-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 32612615c5)

Resolved conflicts:
	NEWS - Manual merge.
2022-04-12 13:32:59 -04:00
Adhemerval Zanella
efb21b5fb2 elf: Fix initial-exec TLS access on audit modules (BZ #28096)
For audit modules and dependencies with initial-exec TLS, we can not
set the initial TLS image on default loader initialization because it
would already be set by the audit setup.  However, subsequent thread
creation would need to follow the default behaviour.

This patch fixes it by setting l_auditing link_map field not only
for the audit modules, but also for all its dependencies.  This is
used on _dl_allocate_tls_init to avoid the static TLS initialization
at load time.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 254d3d5aef)
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
a8e211daea elf: Add _dl_audit_pltexit
It consolidates the code required to call la_pltexit audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8c0664e2b8)

Resolved conflicts:
	sysdeps/hppa/dl-runtime.c
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
fd9c4e8a1b elf: Add _dl_audit_pltenter
It consolidates the code required to call la_pltenter audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit eff687e846)
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
31473c273b elf: Add _dl_audit_preinit
It consolidates the code required to call la_preinit audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 0b98a87487)
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
b2d99731b6 elf: Add _dl_audit_symbind_alt and _dl_audit_symbind
It consolidates the code required to call la_symbind{32,64} audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit cda4f265c6)
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
198660741b elf: Add _dl_audit_objclose
It consolidates the code required to call la_objclose audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 311c9ee54e)
2022-04-08 14:18:11 -04:00
Adhemerval Zanella
ec0fc2a153 elf: Add _dl_audit_objsearch
It consolidates the code required to call la_objsearch audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit c91008d349)
2022-04-08 14:18:11 -04:00
Adhemerval Zanella
66e9d27a09 elf: Add _dl_audit_activity_map and _dl_audit_activity_nsid
It consolidates the code required to call la_activity audit
callback.

Also for a new Lmid_t the namespace link_map list are empty, so it
requires to check if before using it.  This can happen for when audit
module is used along with dlmopen.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 3dac3959a5)
2022-04-08 14:18:11 -04:00
Adhemerval Zanella
ce0cb6d1d2 elf: Add _dl_audit_objopen
It consolidates the code required to call la_objopen audit callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit aee6e90f93)

Resolved conflicts:
	elf/Makefile
2022-04-08 14:18:11 -04:00
Adhemerval Zanella
b868b45f67 elf: Fix dynamic-link.h usage on rtld.c
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:

  1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
     architectures and it changes the semantics of various machine
     relocation functions.

  2. The elf_get_dynamic_info() change was done sideways, previously
     to 490e6c62aa get-dynamic-info.h was included by the first
     dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
     It means that the code within elf_get_dynamic_info() that uses
     RTLD_BOOTSTRAP is in fact unused.

To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined.  The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.

And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.

To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).

Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header.  Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.

The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).

Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit d6d89608ac)

Resolved conflicts:
	elf/rtld.c
2022-04-08 14:18:11 -04:00
Adhemerval Zanella
c6df39a0bd elf: Fix elf_get_dynamic_info definition
Before to 490e6c62aa ('elf: Avoid nested functions in the loader
[BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on
the first dynamic-link.h include and later within _dl_start().  The
former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used
on setup_vdso() (since it is a global definition), while the former does
define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation.

With the commit change, the function is now included and defined once
instead of defined as a nested function.  So rtld.c defines without
defining RTLD_BOOTSTRAP and it brokes at least powerpc32.

This patch fixes by moving the get-dynamic-info.h include out of
dynamic-link.h, which then the caller can corirectly set the expected
semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or
RESOLVE_MAP.

It also required to enable some asserts only for the loader bootstrap
to avoid issues when called from setup_vdso().

As a side note, this is another issues with nested functions: it is
not clear from pre-processed output (-E -dD) how the function will
be build and its semantic (since nested function will be local and
extra C defines may change it).

I checked on x86_64-linux-gnu (w/o --enable-static-pie),
i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4,
aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and
s390x-linux-gnu.

Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit 4af6982e4c)

Resolved conflicts:
	elf/rtld.c
2022-04-08 14:18:11 -04:00
Fangrui Song
b19de59d62 elf: Avoid nested functions in the loader [BZ #27220]
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).

The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)

Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
  elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
  elf_machine_lazy_rel can drop the `scope` parameter.

Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 490e6c62aa)
2022-04-08 14:18:11 -04:00
Fangrui Song
4b5b8a1cdf powerpc: Delete unneeded ELF_MACHINE_BEFORE_RTLD_RELOC
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
(cherry picked from commit 8e2557a2b8)
2022-04-08 14:18:11 -04:00
John David Anglin
d53b9cc391 hppa: Use END instead of PSEUDO_END in swapcontext.S
(cherry picked from commit 7a5c440102)
2022-03-15 23:14:37 +00:00
John David Anglin
738ee53f0c hppa: Implement swapcontext in assembler (bug 28960)
When swapcontext.c is compiled without -g, the following error occurs:
Error: CFI instruction used without previous .cfi_startproc

Fix by converting swapcontext routine to assembler.
2022-03-15 23:12:37 +00:00
John David Anglin
6c9c230765 hppa: Fix warnings from _dl_lookup_address
This change fixes two warnings from _dl_lookup_address.

The first warning comes from dropping the volatile keyword from
desc in the call to _dl_read_access_allowed.  We now have a full
atomic barrier between loading desc[0] and the access check, so
desc no longer needs to be declared as volatile.

The second warning comes from the implicit declaration of
_dl_fix_reloc_arg.  This is fixed by including dl-runtime.h and
declaring _dl_fix_reloc_arg in dl-runtime.h.
2022-03-06 16:04:32 +00:00
John David Anglin
f610d2935f hppa: Revise gettext trampoline design
The current getcontext return trampoline is overly complex and it
unnecessarily clobbers several registers.  By saving the context
pointer (r26) in the context, __getcontext_ret can restore any
registers not restored by setcontext.  This allows getcontext to
save and restore the entire register context present when getcontext
is entered.  We use the unused oR0 context slot for the return
from __getcontext_ret.

While this is not directly useful in C, it can be exploited in
assembly code.  Registers r20, r23, r24 and r25 are not clobbered
in the call path to getcontext.  This allows a small simplification
of swapcontext.

It also allows saving and restoring the 6-bit SAR register in the
LSB of the oSAR context slot.  The getcontext flag value can be
stored in the MSB of the oSAR slot.

(cherry picked from commit 9e7e5fda38)
2022-03-06 16:00:52 +00:00
John David Anglin
c6f9085ee4 hppa: Fix swapcontext
This change fixes the failure of stdlib/tst-setcontext2 and
stdlib/tst-setcontext7 on hppa.  The implementation of swapcontext
in C is broken.  C saves the return pointer (rp) and any non
call-clobbered registers (in this case r3, r4 and r5) on the
stack.  However, the setcontext call in swapcontext pops the
stack and subsequent calls clobber the saved registers.  When
the context in oucp is restored, both tests fault.

Here we rewrite swapcontext in assembly code to avoid using
the stack for register values that need to be used after
restoration.  The getcontext and setcontext routines are
revised to save and restore register ret1 for normal returns.
We copy the oucp pointer to ret1.  This allows access to
the old context after calling getcontext and setcontext.

(cherry picked from commit 71b108d7eb)
2022-03-06 16:00:52 +00:00
John David Anglin
3be79b72d5 Fix elf/tst-audit2 on hppa
The test elf/tst-audit2 fails on hppa with a segmentation fault in the
long branch stub used to call malloc from calloc.  This occurs because
the test is not a PIC executable and calloc is called from the dynamic
linker before the dp register is initialized in _dl_start_user.

The fix is to move the dp register initialization into
elf_machine_runtime_setup.  Since the address of $global$ can't be
loaded directly, we continue to use the DT_PLTGOT value from the
the main_map to initialize dp.  Since l_main_map is not available
in v2.34 and earlier, we use a new function, elf_machine_main_map,
to find the main map.
2022-03-06 15:56:57 +00:00
Noah Goldstein
15b00d2af0 x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.c
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b98d0bbf74)
2022-02-18 15:34:40 -08:00
Noah Goldstein
d093b677c3 x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 7835d611af)
2022-02-18 14:59:54 -08:00
Noah Goldstein
38e0d24794 x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.

Co-authored-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit c627209832)
2022-02-18 14:59:47 -08:00
Dmitry V. Levin
007e054d78 linux: fix accuracy of get_nprocs and get_nprocs_conf [BZ #28865]
get_nprocs() and get_nprocs_conf() use various methods to obtain an
accurate number of processors.  Re-introduce __get_nprocs_sched() as
a source of information, and fix the order in which these methods are
used to return the most accurate information.  The primary source of
information used in both functions remains unchanged.

This also changes __get_nprocs_sched() error return value from 2 to 0,
but all its users are already prepared to handle that.

Old fallback order:
  get_nprocs:
    /sys/devices/system/cpu/online -> /proc/stat -> 2
  get_nprocs_conf:
    /sys/devices/system/cpu/ -> /proc/stat -> 2

New fallback order:
  get_nprocs:
    /sys/devices/system/cpu/online -> /proc/stat -> sched_getaffinity -> 2
  get_nprocs_conf:
    /sys/devices/system/cpu/ -> /proc/stat -> sched_getaffinity -> 2

Fixes: 342298278e ("linux: Revert the use of sched_getaffinity on get_nproc")
Closes: BZ #28865
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

(cherry picked from commit e1d32b8364)
2022-02-07 20:18:29 +00:00
Adhemerval Zanella
0351c75c5f linux: Fix missing __convert_scm_timestamps (BZ #28860)
Commit 948ce73b31 made recvmsg/recvmmsg to always call
__convert_scm_timestamps for 64 bit time_t symbol, so adjust it to
always build it for __TIMESIZE != 64.

It fixes build for architecture with 32 bit time_t support when
configured with minimum kernel of 5.1.

(cherry-picked from 798d716df7)
2022-02-03 17:20:36 -03:00
Samuel Thibault
d8302ba2da hurd if_index: Explicitly use AF_INET for if index discovery
5bf07e1b3a ("Linux: Simplify __opensock and fix race condition [BZ #28353]")
made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET
knows about network interfaces, so better actually specify that in
if_index.

(cherry picked from commit 1d3decee99)
2022-02-03 16:22:04 +01:00
Florian Weimer
ad615b59c7 Linux: Simplify __opensock and fix race condition [BZ #28353]
AF_NETLINK support is not quite optional on modern Linux systems
anymore, so it is likely that the first attempt will always succeed.
Consequently, there is no need to cache the result.  Keep AF_UNIX
and the Internet address families as a fallback, for the rare case
that AF_NETLINK is missing.  The other address families previously
probed are totally obsolete be now, so remove them.

Use this simplified version as the generic implementation, disabling
Netlink support as needed.

(cherry picked from commit 5bf07e1b3a)
2022-02-03 16:21:50 +01:00