Introduce an Arm MTE compatible strlen implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance on modern cores. On cores with less
efficient Advanced SIMD implementation such as Cortex-A53 it can
be slower.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
This variant of strlen uses vector loads and operations to reduce the
size of the code and also eliminate the non-ascii fallback. This
works very well for falkor because of its two vector units and
efficient vector ops. In the best case it reduces latency of cases in
bench-strlen by 48%, with gains throughout the benchmark.
strlen-walk also sees uniform gains in the 5%-15% range.
Overall the routine appears to work better than the stock one for falkor
regardless of the benchmark, length of string or cache state.
The same cannot be said of a53 and a72 though. a53 performance was
greatly reduced and for a72 it was a bit of a mixed bag, slightly on the
negative side but I reckon it might be fast in some situations.
* sysdeps/aarch64/strlen.S (__strlen): Rename to STRLEN.
[!STRLEN](STRLEN): Set to __strlen.
* sysdeps/aarch64/multiarch/strlen.c: New file.
* sysdeps/aarch64/multiarch/strlen_generic.S: Likewise.
* sysdeps/aarch64/multiarch/strlen_asimd.S: Likewise.
* sysdeps/aarch64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add strlen.
* sysdeps/aarch64/multiarch/Makefile (sysdep_routines): Add
strlen_generic and strlen_asimd.
Reviewed-By: szabolcs.nagy@arm.com
CC: pinskia@gmail.com
MIN_PAGE_SIZE is normally set to 4096 but for testing it can be set to
16 so that it exercises the page crossing code for every misaligned
access. The value was set to 15, which is obviously wrong, so fixed
as obvious and tested.
* sysdeps/aarch64/strlen.S [TEST_PAGE_CROSS](MIN_PAGE_SIZE):
Fix value.
is the fastest way to search for '\0'. Otherwise use memchr with an infinite
size. This is 3x faster on benchtests for large sizes. Passes GLIBC tests.
* sysdeps/aarch64/rawmemchr.S (__rawmemchr): New file.
* sysdeps/aarch64/strlen.S (__strlen): Change to __strlen to avoid PLT.
for nul bytes which reverts to separate loop when a non-ASCII char is encountered.
Speedup on test-strlen is ~10%, long ASCII strings are processed ~60% faster,
and on random tests it is ~80% better.
This patch moves the AArch64 port to the main sysdeps hierarchy. The
move is essentially:
git mv ports/sysdeps/aarch64 sysdeps/aarch64
git mv ports/sysdeps/unix/sysv/linux/aarch64 sysdeps/unix/sysv/linux/aarch64
The README is updated and I've updated ChangeLog.aarch64 along the
lines of the ARM move. The AArch64 build has been tested to confirm
that there were no changes in objdump -dr output or the shared
objects.