glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-11 05:40:06 +00:00

Author	SHA1	Message	Date
Andrew Pinski	e6590f0c86	aarch64: Remove non-temporal load/stores from oryon-1's memset The hardware architects have a new recommendation not to use non-temporal load/stores for memset. This patch removes this path. I found there was no difference in the memset speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-11-21 11:32:23 -03:00
Andrew Pinski	eb5eeb4740	aarch64: Remove non-temporal load/stores from oryon-1's memcpy The hardware architects have a new recommendation not to use non-temporal load/stores for memcpy. This patch removes this path. I found there was no difference in the memcpy speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-11-21 11:32:17 -03:00
Andrew Pinski	e162ab2bf1	AArch64: Remove thunderx{,2} memcpy ThunderX1 and ThunderX2 have been retired for a few years now. So let's remove the thunderx{,2} specific versions of memcpy. The performance gain or them was for medium and large sizes while the generic (aarch64) memcpy will handle just slightly worse. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2024-11-20 11:23:53 +00:00
Wilco Dijkstra	8ecb477ea1	AArch64: Remove memset-reg.h Remove memset-reg.h by moving register definitions into the memset implementations. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-09-10 14:18:03 +01:00
Andrew Pinski	2f1f7a5f8a	Aarch64: Add new memset for Qualcomm's oryon-1 core Qualcom's new core, oryon-1, has a different characteristics for memset than the current versions of memset. For non-zero, larger sizes, using GPRs rather than the SIMD stores is ~30% faster. For even larger sizes, using the nontemporal stores is needed not to polute the L1/L2 caches. For zero values, using `dc zva` should be used. Since we know the size will always be 64 bytes, we don't need to figure out the size there. I started with the emag memset and added back the `dc zva` code. Changes since v1: * v3: Fix comment formating Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-06-30 13:47:17 +02:00
Andrew Pinski	4dc83cac78	Aarch64: Add memcpy for qualcomm's oryon-1 core Qualcomm's new core (oryon-1) has a different performance characteristic than other cores. For memcpy, it is faster to use the GPRs to do the copy for large sizes (2x faster). For even larger sizes, it is better to use the nontemporal load/store instructions so we don't pollute the L1/L2 caches. For smaller sizes, the characteristic are very similar to other cores. I used the thunderx memcpy as a starting point and expanded from there. Changes since v1: * v2: Fix ordering in Makefile. * v3: Fix comment grammar about the ldnp/stnp instructions. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2024-06-30 13:46:33 +02:00
Adhemerval Zanella	ef9596352b	aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777) The generic version provides weak definitions of memchr/strlen, which are already provided by the ifunc resolvers. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-23 09:36:08 -03:00
Adhemerval Zanella	bcae44ea85	elf: Only process multiple tunable once (BZ 31686) The `680c597e9c` commit made loader reject ill-formatted strings by first tracking all set tunables and then applying them. However, it does not take into consideration if the same tunable is set multiple times, where parse_tunables_string appends the found tunable without checking if it was already in the list. It leads to a stack-based buffer overflow if the tunable is specified more than the total number of tunables. For instance: GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of total support for different tunable). Instead, use the index of the tunable list to get the expected tunable entry. Since now the initial list is zero-initialized, the compiler might emit an extra memset and this requires some minor adjustment on some ports. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reported-by: Yuto Maeda <maeda@cyberdefense.jp> Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2024-05-07 12:16:36 -03:00
Wilco Dijkstra	2e94e2f5d2	AArch64: Check kernel version for SVE ifuncs Old Linux kernels disable SVE after every system call. Calling the SVE-optimized memcpy afterwards will then cause a trap to reenable SVE. As a result, applications with a high use of syscalls may run slower with the SVE memcpy. This is true for kernels between 4.15.0 and before 6.2.0, except for 5.14.0 which was patched. Avoid this by checking the kernel version and selecting the SVE ifunc on modern kernels. Parse the kernel version reported by uname() into a 24-bit kernel.major.minor value without calling any library functions. If uname() is not supported or if the version format is not recognized, assume the kernel is modern. Tested-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2024-03-21 16:50:51 +00:00
Paul Eggert	dff8da6b3e	Update copyright dates with scripts/update-copyrights	2024-01-01 10:53:40 -08:00
Szabolcs Nagy	8e755f5bc8	aarch64: fix tested ifunc variants Don't test a64fx string functions when BTI is enabled since they are not BTI compatible.	2023-12-04 14:41:26 +00:00
Wilco Dijkstra	2f5524cc53	AArch64: Remove Falkor memcpy The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:52:50 +00:00
Wilco Dijkstra	3d7090f14b	AArch64: Add memset_zva64 Add a specialized memset for the common ZVA size of 64 to avoid the overhead of reading the ZVA size. Since the code is identical to __memset_falkor, remove the latter. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:50:44 +00:00
Wilco Dijkstra	9627ab99b5	AArch64: Cleanup emag memset Cleanup emag memset - merge the memset_base64.S file, remove the unused ZVA code (since it is disabled on emag). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2023-11-13 16:45:47 +00:00
Wilco Dijkstra	9fd3409842	AArch64: Cleanup ifuncs Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename strlen_mte to strlen_generic. Remove rtld-memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-11-01 13:41:59 +00:00
Wilco Dijkstra	2bd0017988	AArch64: Add support for MOPS memcpy/memmove/memset Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for memcpy, memmove and memset (use .inst for now so it works with all binutils versions without needing complex configure and conditional compilation). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-10-24 13:37:48 +01:00
Paul Pluzhnikov	65cc53fe7c	Fix misspellings in sysdeps/ -- BZ 25337	2023-05-30 23:02:29 +00:00
Wilco Dijkstra	d2d3f3720c	AArch64: Improve SVE memcpy and memmove Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-02-06 16:15:34 +00:00
Wilco Dijkstra	1bbb1a2022	AArch64: Improve strlen_asimd Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2023-01-17 15:09:18 +00:00
Joseph Myers	6d7e8eda9b	Update copyright dates with scripts/update-copyrights	2023-01-06 21:14:39 +00:00
Wilco Dijkstra	e6f3fe362f	aarch64: Use memcpy_simd as the default memcpy Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. If SVE is available, a SVE memcpy will be used by default (including for Neoverse N2).	2022-10-26 14:16:50 +01:00
Wilco Dijkstra	a8e72913fe	aarch64: Cleanup memset ifunc Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of 256, so add an explicit check.	2022-10-26 14:12:55 +01:00
Adhemerval Zanella	5355f9ca7b	elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2022-10-10 10:32:28 -03:00
Wilco Dijkstra	fdaf78656f	Add bounds check to __libc_ifunc_impl_list Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-06-10 17:13:29 +01:00
Wilco Dijkstra	eea282d9c6	AArch64: Sort makefile entries Sort makefile entries to reduce conflicts.	2022-06-07 16:58:15 +01:00
Wilco Dijkstra	9f298bfe1f	AArch64: Add SVE memcpy Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE vectors which improves the random memcpy benchmark significantly. Cleanup the memcpy and memmove ifunc selectors.	2022-06-07 16:58:03 +01:00
Wilco Dijkstra	e5fa62b8db	AArch64: Check for SVE in ifuncs [BZ #28744 ] Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove. This fixes BZ #28744.	2022-01-06 14:36:28 +00:00
Paul Eggert	581c785bf3	Update copyright dates with scripts/update-copyrights I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: * 912-#endif remote: * 913: remote: * 914- remote: * error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines	2022-01-01 11:40:24 -08:00
Wilco Dijkstra	b31bd11454	AArch64: Improve A64FX memcpy v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-02 18:36:03 +00:00
Naohiro Tamura	381b29616a	aarch64: Disable A64FX memcpy/memmove BTI unconditionally This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit `07b427296b` Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-09-24 13:26:59 +01:00
Naohiro Tamura	1d9f99ce1b	AArch64: Update A64FX memset not to degrade at 16KB This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2021-09-06 10:23:24 +01:00
Szabolcs Nagy	f873adf3df	Revert "AArch64: Update A64FX memset not to degrade at 16KB" Because of wrong commit author. Will recommit it with right author. This reverts commit `23777232c2`.	2021-09-06 10:23:25 +01:00
Naohiro Tamura via Libc-alpha	23777232c2	AArch64: Update A64FX memset not to degrade at 16KB This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2021-09-03 15:59:46 +01:00
Wilco Dijkstra	a5db6a5cae	[5/5] AArch64: Improve A64FX memset medium loops Simplify the code for memsets smaller than L1. Improve the unroll8 and L1_prefetch loops. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:46:20 +01:00
Wilco Dijkstra	e69d9981f8	[4/5] AArch64: Improve A64FX memset by removing unroll32 Remove unroll32 code since it doesn't improve performance. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:44:27 +01:00
Wilco Dijkstra	186092c6ba	[3/5] AArch64: Improve A64FX memset for remaining bytes Simplify handling of remaining bytes. Avoid lots of taken branches and complex whilelo computations, instead unconditionally write vectors from the end. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:42:07 +01:00
Wilco Dijkstra	9bc2ed8f46	[2/5] AArch64: Improve A64FX memset for large sizes Improve performance of large memsets. Simplify alignment code. For zero memset use DC ZVA, which almost doubles performance. For non-zero memsets use the unroll8 loop which is about 10% faster. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:39:37 +01:00
Wilco Dijkstra	07b427296b	[1/5] AArch64: Improve A64FX memset for small sizes Improve performance of small memsets by reducing instruction counts and improving code alignment. Bench-memset shows 35-45% performance gain for small sizes. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:30:27 +01:00
Naohiro Tamura	4f26956d5b	aarch64: Added optimized memset for A64FX This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>	2021-05-27 09:47:53 +01:00
Naohiro Tamura	fa527f345c	aarch64: Added optimized memcpy and memmove for A64FX This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>	2021-05-27 09:47:53 +01:00
Szabolcs Nagy	04c6a8073d	aarch64: Fix the list of tested IFUNC variants [BZ #26818 ] Some IFUNC variants are not compatible with BTI and MTE so don't set them as usable for testing and benchmarking on a BTI or MTE enabled system. As far as IFUNC selectors are concerned a system is BTI enabled if the cpu supports it and glibc was built with BTI branch protection. Most IFUNC variants are BTI compatible, but thunderx2 memcpy and memmove use a jump table with indirect jump, without a BTI j. Fixes bug 26818.	2021-01-25 16:15:54 +00:00
Szabolcs Nagy	c3c4a25e65	aarch64: Move and update the definition of MTE_ENABLED The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use that definition. Move the definition to init-arch.h so all ifunc selectors can use it and expose an "mte" shorthand for mte enabled runtime. For now we allow user code to enable tag checks and use PROT_MTE mappings without libc involvment, this is not guaranteed ABI, but can be useful for testing and debugging with MTE.	2021-01-25 15:35:43 +00:00
Shuo Wang	28f2ce2772	aarch64: revert memcpy optimze for kunpeng to avoid performance degradation In commit `863d775c48`, kunpeng920 is added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k. This is the result, tested in glibc-2.28: before backport after backport Performance improvement memcpy_1k 0.005 0.005 0.00% memcpy_10k 0.032 0.029 10.34% memcpy_100k 0.356 0.429 -17.02% memcpy_1m 7.470 11.153 -33.02% This is the demo #include "stdio.h" #include "string.h" #include "stdlib.h" char a[10241024] = {12}; char b[10241024] = {13}; int main(int argc, char argv[]) { int i = atoi(argv[1]); int j; int size = atoi(argv[2]); for (j = 0; j < i; j++) memcpy(b, a, size1024); return 0; } # gcc -g -O0 memcpy.c -o memcpy # time taskset -c 10 ./memcpy 100000 1024 Co-authored-by: liqingqing <liqingqing3@huawei.com>	2021-01-21 16:44:15 +00:00
Paul Eggert	2b778ceb40	Update copyright dates with scripts/update-copyrights I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: * pre-commit check failed ... remote: * error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master	2021-01-02 12:17:34 -08:00
Szabolcs Nagy	45b1e17e91	aarch64: use PTR_ARG and SIZE_ARG instead of DELOUSE DELOUSE was added to asm code to make them compatible with non-LP64 ABIs, but it is an unfortunate name and the code was not compatible with ABIs where pointer and size_t are different. Glibc currently only supports the LP64 ABI so these macros are not really needed or tested, but for now the name is changed to be more meaningful instead of removing them completely. Some DELOUSE macros were dropped: clone, strlen and strnlen used it unnecessarily. The out of tree ILP32 patches are currently not maintained and will likely need a rework to rebase them on top of the time64 changes.	2020-12-31 16:50:58 +00:00
Szabolcs Nagy	4033f21eb2	aarch64: remove the strlen_asimd symbol This symbol is not in the implementation reserved namespace for static linking and it was never used: it seems it was mistakenly added in the orignal strlen_asimd commit `436e4d5b96`	2020-12-15 14:42:45 +00:00
Wilco Dijkstra	e11ed9d2b4	AArch64: Use __memcpy_simd on Neoverse N2/V1 Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as the memcpy/memmove ifunc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2020-10-14 14:27:50 +01:00
Wilco Dijkstra	bd394d131c	AArch64: Improve backwards memmove performance On some microarchitectures performance of the backwards memmove improves if the stores use STR with decreasing addresses. So change the memmove loop in memcpy_advsimd.S to use 2x STR rather than STP. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2020-08-28 17:51:40 +01:00
Szabolcs Nagy	2dc33b928b	aarch64: Use future HWCAP2_MTE in ifunc resolver Make glibc MTE-safe on systems where MTE is available. This allows using heap tagging with an LD_PRELOADed malloc implementation that enables MTE. We don't document this as guaranteed contract yet, so glibc may not be MTE safe when HWCAP2_MTE is set (older glibcs certainly aren't). This is mainly for testing and debugging. The HWCAP flag is not exposed in public headers until Linux adds it to its uapi. The HWCAP value reservation will be in Linux 5.9.	2020-07-27 12:54:22 +01:00
Wilco Dijkstra	f46ef33ad1	AArch64: Improve strlen_asimd performance (bug 25824) Optimize strlen using a mix of scalar and SIMD code. On modern micro architectures large strings are 2.6 times faster than existing strlen_asimd and 35% faster than the new MTE version of strlen. On a random strlen benchmark using small sizes the speedup is 7% vs strlen_asimd and 40% vs the MTE strlen. This fixes the main strlen regressions on Cortex-A53 and other cores with a simple Neon unit. Rename __strlen_generic to __strlen_mte, and select strlen_asimd when MTE is not enabled (this is waiting on support for a HWCAP_MTE bit). This fixes big-endian bug 25824. Passes GLIBC regression tests. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2020-07-17 15:07:23 +01:00

1 2

84 Commits