glibc

AuroraMiddleware/glibc

Fork 0

mirror of https://sourceware.org/git/glibc.git synced 2024-11-30 00:31:08 +00:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Naohiro Tamura	381b29616a	aarch64: Disable A64FX memcpy/memmove BTI unconditionally This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit `07b427296b` Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-09-24 13:26:59 +01:00
Naohiro Tamura	fa527f345c	aarch64: Added optimized memcpy and memmove for A64FX This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>	2021-05-27 09:47:53 +01:00

Naohiro Tamura

381b29616a

aarch64: Disable A64FX memcpy/memmove BTI unconditionally

This patch disables A64FX memcpy/memmove BTI instruction insertion
unconditionally such as A64FX memset patch [1] for performance.

[1] commit 07b427296b

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

2021-09-24 13:26:59 +01:00

Naohiro Tamura

fa527f345c

aarch64: Added optimized memcpy and memmove for A64FX

This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>

2021-05-27 09:47:53 +01:00

2 Commits