Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as
the memcpy/memmove ifunc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add a new memcpy using 128-bit Q registers - this is faster on modern
cores and reduces codesize. Similar to the generic memcpy, small cases
include copies up to 32 bytes. 64-128 byte copies are split into two
cases to improve performance of 64-96 byte copies. Large copies align
the source rather than the destination.
bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,
so make this memcpy the default on N1 (on Centriq it is 15% faster than
memcpy_falkor).
Passes GLIBC regression tests.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The performance improvement is about 20%-30% for
larger cases and about 1%-5% for smaller cases.
Used SIMD load/store instead of GPR for large
overlapping forward moves.
Reused existing memcpy implementation for smaller
or overlapping backward moves.
Fixed the existing memcpy implementation to allow it
to deal with the overlapping case.
Simplified loop tails in the memcpy implementation -
use branchless overlapping sequence of fixed length
load/stores instead of branching depending on the
size.
A cleanup/optimization converting str's to stp's.
Added __memmove_thunderx2 to the list of the
available implementations.
This is an optimized memmove implementation for the Qualcomm Falkor
processor core. Due to the way the falkor memcpy needs to be written,
code cannot be easily shared between memmove and memcpy like in case
of other aarch64 memcpy implementations due to which this routine is
separate. The underlying principle is the same as that of memcpy
where it tries to use registers with the same lower 4 bits for
fetching the same stream, thus optimizing hardware prefetcher
performance.
The memcpy copy loop copies 64 bytes at a time using the same register
pair since that's the way to train the hardware prefetcher on the
falkor core. memmove cannot quite do that since it needs to avoid
overlaps, so it does the next best thing, i.e. has a 32 byte loop with
a 32 byte end (prefetch a loop ahead to account for overlapping
locations) with register pairs that alias so that they hit the same
prefetcher. Due to this difference in loop size, they have to
currently be separate implementations but efforts are on to try and
get memmove to fall back into memcpy whenever it can without simply
duplicating all of the code.
Performance:
The routine fares around 20-25% better than the generic memmove for
most medium to large sizes (i.e. > 128 bytes) for the new walking
memmove benchmark (memmove-walk) with an unexplained regression
between 1K and 2K. The minor regression is something worth looking
into for us, but the remaining gains are significant enough that we
would like this included upstream as we looking into the cause for the
regression. Here is a snippet of the numbers as generated from the
microbenchmark by the compare_strings script. Comparisons are against
__memmove_generic:
Function: memmove
Variant: walk
__memmove_thunderx __memmove_falkor __memmove_generic
========================================================================================================================
<snip>
length=16384: 12508800.00 ( 6.09%) 11486800.00 ( 13.76%) 13319600.00
length=16400: 13614200.00 ( -0.67%) 11585000.00 ( 14.33%) 13523600.00
length=16385: 13448400.00 ( 0.10%) 11732700.00 ( 12.84%) 13461200.00
length=16399: 13594100.00 ( -0.22%) 11859600.00 ( 12.57%) 13564400.00
length=16386: 13211600.00 ( 1.13%) 11503800.00 ( 13.91%) 13362400.00
length=16398: 13218600.00 ( 2.12%) 11573200.00 ( 14.30%) 13504700.00
length=16387: 13510900.00 ( -0.37%) 11744200.00 ( 12.76%) 13461300.00
length=16397: 13603700.00 ( -0.15%) 11878200.00 ( 12.55%) 13583200.00
length=16388: 13461700.00 ( -0.13%) 11558000.00 ( 14.03%) 13444100.00
length=16396: 13517500.00 ( -0.03%) 11561300.00 ( 14.45%) 13513900.00
length=16389: 13534100.00 ( 0.17%) 11756800.00 ( 13.28%) 13556900.00
length=16395: 13585600.00 ( 0.11%) 11791800.00 ( 13.30%) 13601200.00
length=16390: 13480100.00 ( -0.13%) 11685500.00 ( 13.20%) 13462100.00
length=16394: 13529900.00 ( -0.23%) 11549800.00 ( 14.43%) 13498200.00
length=16391: 13595400.00 ( -0.26%) 11768200.00 ( 13.22%) 13560600.00
length=16393: 13567000.00 ( 0.20%) 11779700.00 ( 13.35%) 13594700.00
length=32768: 71308800.00 ( -6.53%) 50220800.00 ( 24.98%) 66939200.00
length=32784: 72100800.00 (-11.55%) 50114100.00 ( 22.47%) 64636300.00
length=32769: 71767000.00 ( -7.10%) 51238400.00 ( 23.54%) 67010000.00
length=32783: 70113700.00 (-40.95%) 51129000.00 ( -2.78%) 49744400.00
length=32770: 71367600.00 ( -6.52%) 50244700.00 ( 25.01%) 67000900.00
length=32782: 64366700.00 ( 4.71%) 50101400.00 ( 25.83%) 67545600.00
length=32771: 71440100.00 ( -6.51%) 51263900.00 ( 23.57%) 67074900.00
length=32781: 66993000.00 ( 0.34%) 51108300.00 ( 23.97%) 67220300.00
length=32772: 71443900.00 (-60.50%) 50062100.00 (-12.47%) 44512600.00
length=32780: 71759100.00 ( -6.58%) 50263200.00 ( 25.35%) 67328600.00
length=32773: 71714900.00 (-33.21%) 51076600.00 ( 5.12%) 53835400.00
length=32779: 71756900.00 ( -6.56%) 51290800.00 ( 23.83%) 67337800.00
length=32774: 59689300.00 (-34.55%) 50068400.00 (-12.86%) 44363300.00
length=32778: 71847500.00 (-18.20%) 50084100.00 ( 17.61%) 60786500.00
length=32775: 71599300.00 ( -6.54%) 51278200.00 ( 23.70%) 67204800.00
length=32777: 71862900.00 (-60.85%) 51094000.00 (-14.36%) 44677900.00
length=65536: 282848000.00 ( -6.60%) 199187000.00 ( 24.93%) 265325000.00
length=65552: 243285000.00 (-41.61%) 198512000.00 (-15.54%) 171805000.00
length=65537: 255415000.00 (-23.47%) 202499000.00 ( 2.11%) 206858000.00
length=65551: 280122000.00 (-62.95%) 203349000.00 (-18.29%) 171911000.00
length=65538: 283676000.00 (-14.46%) 198368000.00 ( 19.96%) 247848000.00
length=65550: 275566000.00 (-51.76%) 198494000.00 ( -9.31%) 181581000.00
length=65539: 283699000.00 ( -6.58%) 203453000.00 ( 23.57%) 266195000.00
length=65549: 286572000.00 ( -6.65%) 202607000.00 ( 24.60%) 268712000.00
length=65540: 283710000.00 ( -6.59%) 199161000.00 ( 25.17%) 266160000.00
length=65548: 237573000.00 ( 11.48%) 198462000.00 ( 26.06%) 268395000.00
length=65541: 284150000.00 ( -6.58%) 203273000.00 ( 23.75%) 266600000.00
length=65547: 286250000.00 ( -6.70%) 202594000.00 ( 24.48%) 268263000.00
length=65542: 284167000.00 ( -6.60%) 199122000.00 ( 25.31%) 266584000.00
length=65546: 285656000.00 ( -6.59%) 198443000.00 ( 25.95%) 268002000.00
length=65543: 284600000.00 ( -6.58%) 203247000.00 ( 23.89%) 267030000.00
length=65545: 285665000.00 ( -6.40%) 202575000.00 ( 24.55%) 268472000.00
<snip>
* sysdeps/aarch64/multiarch/Makefile (sysdep_routines): Add
memmove_falkor.
* sysdeps/aarch64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Likewise.
* sysdeps/aarch64/multiarch/memmove.c: Likewise.
* sysdeps/aarch64/multiarch/memmove_falkor.S: New file.