glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-29 08:11:08 +00:00

Author	SHA1	Message	Date
Szabolcs Nagy	f873adf3df	Revert "AArch64: Update A64FX memset not to degrade at 16KB" Because of wrong commit author. Will recommit it with right author. This reverts commit `23777232c2`.	2021-09-06 10:23:25 +01:00
Naohiro Tamura via Libc-alpha	23777232c2	AArch64: Update A64FX memset not to degrade at 16KB This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2021-09-03 15:59:46 +01:00
Wilco Dijkstra	a5db6a5cae	[5/5] AArch64: Improve A64FX memset medium loops Simplify the code for memsets smaller than L1. Improve the unroll8 and L1_prefetch loops. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:46:20 +01:00
Wilco Dijkstra	e69d9981f8	[4/5] AArch64: Improve A64FX memset by removing unroll32 Remove unroll32 code since it doesn't improve performance. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:44:27 +01:00
Wilco Dijkstra	186092c6ba	[3/5] AArch64: Improve A64FX memset for remaining bytes Simplify handling of remaining bytes. Avoid lots of taken branches and complex whilelo computations, instead unconditionally write vectors from the end. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:42:07 +01:00
Wilco Dijkstra	9bc2ed8f46	[2/5] AArch64: Improve A64FX memset for large sizes Improve performance of large memsets. Simplify alignment code. For zero memset use DC ZVA, which almost doubles performance. For non-zero memsets use the unroll8 loop which is about 10% faster. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:39:37 +01:00
Wilco Dijkstra	07b427296b	[1/5] AArch64: Improve A64FX memset for small sizes Improve performance of small memsets by reducing instruction counts and improving code alignment. Bench-memset shows 35-45% performance gain for small sizes. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>	2021-08-10 13:30:27 +01:00
Naohiro Tamura	4f26956d5b	aarch64: Added optimized memset for A64FX This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>	2021-05-27 09:47:53 +01:00

8 Commits