Commit Graph

8 Commits

Author SHA1 Message Date
Szabolcs Nagy
f873adf3df Revert "AArch64: Update A64FX memset not to degrade at 16KB"
Because of wrong commit author. Will recommit it with right author.

This reverts commit 23777232c2.
2021-09-06 10:23:25 +01:00
Naohiro Tamura via Libc-alpha
23777232c2 AArch64: Update A64FX memset not to degrade at 16KB
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.

Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-03 15:59:46 +01:00
Wilco Dijkstra
a5db6a5cae [5/5] AArch64: Improve A64FX memset medium loops
Simplify the code for memsets smaller than L1. Improve the unroll8 and
L1_prefetch loops.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:46:20 +01:00
Wilco Dijkstra
e69d9981f8 [4/5] AArch64: Improve A64FX memset by removing unroll32
Remove unroll32 code since it doesn't improve performance.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:44:27 +01:00
Wilco Dijkstra
186092c6ba [3/5] AArch64: Improve A64FX memset for remaining bytes
Simplify handling of remaining bytes. Avoid lots of taken branches and complex
whilelo computations, instead unconditionally write vectors from the end.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:42:07 +01:00
Wilco Dijkstra
9bc2ed8f46 [2/5] AArch64: Improve A64FX memset for large sizes
Improve performance of large memsets. Simplify alignment code. For zero memset
use DC ZVA, which almost doubles performance. For non-zero memsets use the
unroll8 loop which is about 10% faster.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:39:37 +01:00
Wilco Dijkstra
07b427296b [1/5] AArch64: Improve A64FX memset for small sizes
Improve performance of small memsets by reducing instruction counts and
improving code alignment. Bench-memset shows 35-45% performance gain for
small sizes.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:30:27 +01:00
Naohiro Tamura
4f26956d5b aarch64: Added optimized memset for A64FX
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00