Commit Graph

16259 Commits

Author SHA1 Message Date
Palmer Dabbelt
07fe71f59b
arm: Avoid UB in elf_machine_rel()
This recently came up during a cleanup to remove misaligned accesses
from the RISC-V port.

Link: https://sourceware.org/pipermail/libc-alpha/2022-June/139961.html
Suggested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Fangrui Song <maskray@google.com>
2024-06-26 12:45:43 +02:00
mengqinggang
a10b6ad471 LoongArch: Fix tst-gnu2-tls2 test case
asm volatile ("movfcsr2gr $t0, $fcsr0" ::: "$t0");
asm volatile ("st.d $t0, %0" :"=m"(restore_fcsr));

generate to the following instructions with -Og flag:

movfcsr2gr      $t0, $zero
addi.d          $t0, $sp, 2047(0x7ff)
addi.d          $t0, $t0, 77(0x4d)
st.w            $t0, $t0, 0

fcsr0 register and restore_fcsr variable are both stored in t0 register.

Change to:

asm volatile ("movfcsr2gr %0, $fcsr0" :"=r"(restore_fcsr));

to avoid restore_fcsr address in t0.

Comparing float value using memcmp because float value cannot be
directly compared for equality.

Put LOAD_REGISTER_FCSR and SAVE_REGISTER_FCC after LOAD_REGISTER_FLOAT.
Some float instructions may change fcsr register.
2024-06-26 12:02:07 +08:00
Adhemerval Zanella
c90cfce849 posix: Fix pidfd_spawn/pidfd_spawnp leak if execve fails (BZ 31695)
If the pidfd_spawn/pidfd_spawnp helper process succeeds, but evecve
fails for some reason (either with an invalid/non-existent, memory
allocation, etc.) the resulting pidfd is never closed, nor returned
to caller (so it can call close).

Since the process creation failed, it should be up to posix_spawn to
also, close the file descriptor in this case (similar to what it
does to reap the process).

This patch also changes the waitpid with waitid (P_PIDFD) for pidfd
case, to avoid a possible pid re-use.

Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-06-25 12:11:48 -03:00
Andreas K. Hüttel
d32c342425
Revert "MIPSr6/math: Use builtin fma and fmaf"
Apologies, I mistakenly interpreted this to be already accepted.
Reverting until v6 or later is reviewed and approved.

This reverts commit 9e06e4a43b.
2024-06-25 01:02:58 +02:00
Christoph Müllner
81c7f6193c
RISC-V: Execute a PAUSE hint in spin loops
The atomic_spin_nop() macro can be used to run arch-specific
code in the body of a spin loop to potentially improve efficiency.
RISC-V's Zihintpause extension includes a PAUSE instruction for
this use-case, which is encoded as a HINT, which means that it
behaves like a NOP on systems that don't implement Zihintpause.

Binutils supports Zihintpause since 2.36, so this patch uses
the ".insn" directive to keep the code compatible with older
toolchains.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-24 21:36:49 +02:00
YunQiang Su
9e06e4a43b
MIPSr6/math: Use builtin fma and fmaf
MIPSr6 has MADDF.s/MADDF.d instructions, which are fused.

In MIPS ISA, double support can be subsetted.  Only FMAF is enabled
for this case.

	* sysdeps/mips/fpu/math-use-builtins-fma.h

Signed-off-by: YunQiang Su <syq@gcc.gnu.org>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-06-24 19:43:57 +02:00
John David Anglin
aecde502e9 hppa/vdso: Add wrappers for vDSO functions
The upcoming parisc (hppa) v6.11 Linux kernel will include vDSO
support for gettimeofday(), clock_gettime() and clock_gettime64()
syscalls for 32- and 64-bit userspace.
The patch below adds the necessary glue code for glibc.

Signed-off-by: Helge Deller <deller@gmx.de>

Changes in v2:
- add vsyscalls for 64-bit too
2024-06-23 19:39:28 -04:00
John David Anglin
9dddb26954 Update hppa libm-test-ulps 2024-06-23 13:51:25 -04:00
John David Anglin
da61ba3f89 Update hppa libm-test-ulps 2024-06-20 19:44:04 -04:00
Julian Zhu
9f2bf0e23a
RISC-V: Update ulps
For the exp10m1, exp2m1, log10p1 and log2p1 implementations.

Signed-off-by: Julian Zhu <jz531210@gmail.com>
2024-06-20 23:46:32 +02:00
Julian Zhu
cb20e7c7cc
MIPS: Update ulps
Update mips32/mips64 ulps for the exp10m1, exp2m1, and log10p1 implementations.

Signed-off-by: Julian Zhu <jz531210@gmail.com>
2024-06-20 23:45:24 +02:00
Florian Weimer
b375e597da i386: Update ulps
This is from a -march=i686 -mtune=generic build with
--disable-multi-arch, running on a Cascade Lake CPU.
2024-06-20 19:00:48 +02:00
Florian Weimer
362588f7cc s390x: Capture grep output in static PIE check
The test is not a run-time check, so update the description.
Also use readelf -W for a more stable output format and fix
an LC_ALL typo.

This avoids garbled configure messages:

checking for s390-specific static PIE requirements (runtime check)...  0x0000000000000017 (JMPREL)             0x280
yes
2024-06-20 14:34:06 +02:00
Florian Weimer
71dafdf5f1 powerpc: Update ulps
Results based on POWER8 and POWER9 machines running
powerpc64-linux-gnu, with and without --disable-multi-arch.
2024-06-20 12:15:31 +02:00
Florian Weimer
3cb77b7d1e i386: Update ulps
Based on a -march=x86-64-v4 -mfpmath=sse build, with and without
--disable-multi-arch, running on a Zen 4 CPU.  Also used different
-march=x8i6-64-v… settings.
2024-06-20 12:15:09 +02:00
Xi Ruoyao
9405d54c62
LoongArch: Update ulps
Add ulps for recently added C23 exp10m1, exp2m1, and log10p1 functions.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2024-06-19 21:17:19 +02:00
Andreas K. Hüttel
4f1cf0c0e1
sparc: Regenerate ULPs
Linux catbus 5.15.110-gentoo-r1 #1 SMP Fri Jun 9 17:53:23 PDT 2023 sparc64 sun4v UltraSparc T5 (Niagara5) GNU/Linux

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2024-06-19 14:58:32 +02:00
Stefan Liebler
19f6d6a480 s390x: Regenerate ULPs.
Needed due to:
- "Implement C23 log10p1"
  commit ID 55eb99e9a9
- "Implement C23 exp2m1, exp10m1"
  commit ID 7ec903e028
2024-06-19 08:42:30 +02:00
mengqinggang
9a675d998e LoongArch: Fix _dl_tlsdesc_dynamic in LSX case
HWCAP value is overwritten at the first comparison of the LASX case.
The second comparison at LSX get incorrect result.
Change to use t0 to save HWCAP value, and use t1 to save comparison
result.
2024-06-19 10:06:41 +08:00
Adhemerval Zanella
92341e3150 arm: Update ulps
For the exp10m1, exp2m1, and log10p1 implementations.
2024-06-18 17:31:10 -03:00
Adhemerval Zanella
45f5f51b85 aarch64: Update ulps
For the exp10m1, exp2m1, and log10p1 implementations.
2024-06-18 17:31:10 -03:00
Adhemerval Zanella
52b397bafa powerpc: Update ulps
For the exp10m1, exp2m1, and log10p1 implementations.
2024-06-18 17:31:10 -03:00
Florian Weimer
f6ea5d1291 Linux: Include <dl-symbol-redir-ifunc.h> in dl-sysdep.c
The _dl_sysdep_parse_arguments function contains initalization
of a large on-stack variable:

  dl_parse_auxv_t auxv_values = { 0, };

This uses a non-inline version of memset on powerpc64le-linux-gnu,
so it must use the baseline memset.
2024-06-18 10:56:34 +02:00
Carlos Llamas
176671f604 linux: add definitions for hugetlb page size encodings
A desired hugetlb page size can be encoded in the flags parameter of
system calls such as mmap() and shmget(). The Linux UAPI headers have
included explicit definitions for these encodings since v4.14.

This patch adds these definitions that are used along with MAP_HUGETLB
and SHM_HUGETLB flags as specified in the corresponding man pages. This
relieves programs from having to duplicate and/or compute the encodings
manually.

Additionally, the filter on these definitions in tst-mman-consts.py is
removed, as suggested by Florian. I then ran this tests successfully,
confirming the alignment with the kernel headers.

  PASS: misc/tst-mman-consts
  original exit status 0

Signed-off-by: Carlos Llamas <cmllamas@google.com>
Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-06-18 10:56:34 +02:00
Stefan Liebler
e260ceb4aa elf: Remove HWCAP_IMPORTANT
Remove the definitions of HWCAP_IMPORTANT after removal of
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask.  There HWCAP_IMPORTANT
was used as default value.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
ad0aa1f549 elf: Remove LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask
Remove the environment variable LD_HWCAP_MASK and the tunable
glibc.cpu.hwcap_mask as those are not used anymore in common-code
after removal in elf/dl-cache.c:search_cache().

The only remaining user is sparc32 where it is used in
elf_machine_matches_host().  If sparc32 does not need it anymore,
we can get rid of it at all.  Otherwise we could also move
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask to be sparc32 specific.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
343439a31e elf: Remove _DL_PLATFORMS_COUNT
Remove the definitions of _DL_PLATFORMS_COUNT as those are not used
anymore after removal in elf/dl-cache.c:search_cache().

Note: On x86, we can also get rid of the definitions
HWCAP_PLATFORMS_START and HWCAP_PLATFORMS_COUNT.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
43c7c5e62d elf: Remove _DL_FIRST_PLATFORM
Remove the definitions of _DL_FIRST_PLATFORM as those were only used
in the _DL_HWCAP_PLATFORM definitions and in _dl_string_platform().
Both were removed.

Note: Removed on every architecture despite of powerpc, where
_dl_string_platform() is still used.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
ed23449dac elf: Remove _DL_HWCAP_PLATFORM
Remove the definitions of _DL_HWCAP_PLATFORM as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
374c8b4483 elf: Remove platform strings in dl-procinfo.c
Remove the platform strings in dl-procinfo.c where also
the implementation of _dl_string_platform() was removed.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
8faada8302 elf: Remove _dl_string_platform
Despite of powerpc where the returned integer is stored in tcb,
and the diagnostics output, there is no user anymore.

Thus this patch removes the diagnostics output and
_dl_string_platform for all other platforms.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler
f14b6dfc87 x86: Remove HWCAP_START and HWCAP_COUNT
Both defines are not used anymore.  Those were only used for
_dl_string_hwcap(), which itself was removed with commit
ab40f20364
"elf: Remove _dl_string_hwcap"

Just clean up.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
YunQiang Su
eaf4fc516a
math: Update mips32/mips64 ulps for log2p1 2024-06-17 21:45:53 +02:00
Andreas K. Hüttel
98ffc1bfeb
Convert to autoconf 2.72 (vanilla release, no distribution patches)
As discussed at the patch review meeting

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
2024-06-17 21:15:28 +02:00
Joseph Myers
7ec903e028 Implement C23 exp2m1, exp10m1
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the exp2m1 and exp10m1 functions (exp2(x)-1 and
exp10(x)-1, like expm1).

As with other such functions, these use type-generic templates that
could be replaced with faster and more accurate type-specific
implementations in future.  Test inputs are copied from those for
expm1, plus some additions close to the overflow threshold (copied
from exp2 and exp10) and also some near the underflow threshold.

exp2m1 has the unusual property of having an input (M_MAX_EXP) where
whether the function overflows (under IEEE semantics) depends on the
rounding mode.  Although these could reasonably be XFAILed in the
testsuite (as we do in some cases for arguments very close to a
function's overflow threshold when an error of a few ulps in the
implementation can result in the implementation not agreeing with an
ideal one on whether overflow takes place - the testsuite isn't smart
enough to handle this automatically), since these functions aren't
required to be correctly rounding, I made the implementation check for
and handle this case specially.

The Makefile ordering expected by lint-makefiles for the new functions
is a bit peculiar, but I implemented it in this patch so that the test
passes; I don't know why log2 also needed moving in one Makefile
variable setting when it didn't in my previous patches, but the
failure showed a different place was expected for that function as
well.

The powerpc64le IFUNC setup seems not to be as self-contained as one
might hope; it shouldn't be necessary to add IFUNCs for new functions
such as these simply to get them building, but without setting up
IFUNCs for the new functions, there were undefined references to
__GI___expm1f128 (that IFUNC machinery results in no such function
being defined, but doesn't stop include/math.h from doing the
redirection resulting in the exp2m1f128 and exp10m1f128
implementations expecting to call it).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-06-17 16:31:49 +00:00
Joseph Myers
55eb99e9a9 Implement C23 log10p1
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the log10p1 functions (log10(1+x): like log1p, but for
base-10 logarithms).

This is directly analogous to the log2p1 implementation (except that
whereas log2p1 has a smaller underflow range than log1p, log10p1 has a
larger underflow range).  The test inputs are copied from those for
log1p and log2p1, plus a few more inputs in that wider underflow
range.

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-06-17 13:48:13 +00:00
Joseph Myers
bb014f50c4 Implement C23 logp1
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p).  As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.

Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.

The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either).  It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately.  For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-06-17 13:47:09 +00:00
Noah Goldstein
5b54a33435 x86: Fix value for x86_memset_non_temporal_threshold when it is undesirable
When we don't want to use non-temporal stores for memset, we set
`x86_memset_non_temporal_threshold` to SIZE_MAX.

The current code, however, we using `maximum_non_temporal_threshold`
as the upper bound which is `SIZE_MAX >> 4` so we ended up with a
value of `0`.

Fix is to just use `SIZE_MAX` as the upper bound for when setting the
tunable.
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-06-14 17:25:05 -05:00
Andreas K. Hüttel
3953b5b88f
i686: Regenerate ulps
Linux pinacolada 6.6.32-gentoo #1 SMP PREEMPT Sun Jun  9 14:18:17 CEST 2024 x86_64 Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz GenuineIntel GNU/Linux
32bit build for multilib environment

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2024-06-14 21:24:24 +02:00
Xi Ruoyao
97aa7b7346 LoongArch: Ensure sp 16-byte aligned for tlsdesc
"ADDI sp, sp, 24" and "ADDI sp, sp, SZFCSREG" (SZFCSREG = 4) are
misaligning the stack: the ABI mandates a 16-byte alignment.  Fix it
by changing the first one to "ADDI sp, sp, 32", and reuse the spare 4th
slot for saving fcsr.

Reported-by: Jinyang He <hejinyang@loongson.cn>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2024-06-14 10:14:54 +08:00
H.J. Lu
29807a271e x86: Properly set x86 minimum ISA level [BZ #31883]
Properly set libc_cv_have_x86_isa_level in shell for MINIMUM_X86_ISA_LEVEL
defined as

(__X86_ISA_V1 + __X86_ISA_V2 + __X86_ISA_V3 + __X86_ISA_V4)

Also set __X86_ISA_V2 to 1 for i386 if __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
is defined.  There are no changes in config.h nor in config.make on x86-64.
On i386, -march=x86-64-v2 with GCC generates

 #define MINIMUM_X86_ISA_LEVEL 2

in config.h and

have-x86-isa-level = 2

in config.make.  This fixes BZ #31883.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-06-12 14:27:54 -07:00
Adhemerval Zanella
7edd3814b0 linux: Remove __stack_prot
The __stack_prot is used by Linux to make the stack executable if
a modules requires it.  It is also marked as RELRO, which requires
to change the segment permission to RW to update it.

Also, there is no need to keep track of the flags: either the stack
will have the default permission of the ABI or should be change to
PROT_READ | PROT_WRITE | PROT_EXEC.  The only additional flag,
PROT_GROWSDOWN or PROT_GROWSUP, is Linux only and can be deducted
from _STACK_GROWS_DOWN/_STACK_GROWS_UP.

Also, the check_consistency function was already removed some time
ago.

Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-06-12 15:25:54 -03:00
H.J. Lu
09bc68b0ac x86: Properly set MINIMUM_X86_ISA_LEVEL for i386 [BZ #31867]
On i386, set the default minimum ISA level to 0, not 1 (baseline which
includes SSE2).  There are no changes in config.h nor in config.make on
x86-64.  This fixes BZ #31867.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Tested-by: Ian Jordan <immoloism@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-06-11 00:10:08 -07:00
Joe Damato
bef2a827a5 x86: Enable non-temporal memset tunable for AMD
In commit 46b5e98ef6 ("x86: Add seperate non-temporal tunable for
memset") a tunable threshold for enabling non-temporal memset was added,
but only for Intel hardware.

Since that commit, new benchmark results suggest that non-temporal
memset is beneficial on AMD, as well, so allow this tunable to be set
for AMD.

See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
which has been updated to include data using different stategies for
large memset on AMD Zen2, Zen3, and Zen4.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-06-10 16:18:18 -05:00
Samuel Thibault
74f9ee3b91 hurd: Fix lsetxattr return value
The manpage says that lsetxattr returns 0 on success, like setxattr.
2024-06-10 21:56:13 +02:00
Joe Damato
92c270d32c Linux: Add epoll ioctls
As of Linux kernel 6.9, some ioctls and a parameters structure have been
introduced which allow user programs to control whether a particular
epoll context will busy poll.

Update the headers to include these for the convenience of user apps.

The ioctls were added in Linux kernel 6.9 commit 18e2bf0edf4dd
("eventpoll: Add epoll ioctl for epoll_params") [1] to
include/uapi/linux/eventpoll.h.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?h=v6.9&id=18e2bf0edf4dd

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-04 12:09:15 -05:00
Szabolcs Nagy
2a9943b4a0 math: Fix exp10 undefined left shift
Left shift of ki is undefined when ki<0, copy the logic from exp,
which uses unsigned arithmetics, to fix it.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-04 15:33:26 +01:00
Joseph Myers
1d441791cb Add new AArch64 HWCAP2 definitions from Linux 6.9 to bits/hwcap.h
Linux 6.9 adds 15 new HWCAP2_* values for AArch64; add them to
bits/hwcap.h in glibc.

Tested with build-many-glibcs.py for aarch64-linux-gnu.
2024-06-04 12:25:05 +00:00
Noah Goldstein
46b5e98ef6 x86: Add seperate non-temporal tunable for memset
The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.

This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-30 12:36:09 -05:00
Noah Goldstein
5bf0ab8057 x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-30 12:36:09 -05:00