Vector registers perform much better for moves compared to pairs of
registers on falkor, so use them instead. This results in a time
reduction of up to 50% (i.e. 2x improvement) for a lot of the smaller
sizes, i.e. up to 1K in memmove-walk. Improvements for larger sizes are
smaller, at about 1%-2%.
* sysdeps/aarch64/multiarch/memmove_falkor.S
(__memcpy_falkor): Use vector registers.
The tail of the copy loops are unable to train the falkor hardware
prefetcher because they load from a different base compared to the hot
loop. In this case avoid serializing the instructions by loading them
into different registers. Also peel the last iteration of the loop
into the tail (and have them use different registers) since it gives
better performance for medium sizes.
This results in performance improvements of between 3% and 20% over
the current falkor implementation for sizes between 128 bytes and 1K
on the memmove-walk benchmark, thus mostly covering the regressions
seen against the generic memmove.
* sysdeps/aarch64/multiarch/memmove_falkor.S
(__memmove_falkor): Use multiple registers to move data in
loop tail.
This is an optimized memmove implementation for the Qualcomm Falkor
processor core. Due to the way the falkor memcpy needs to be written,
code cannot be easily shared between memmove and memcpy like in case
of other aarch64 memcpy implementations due to which this routine is
separate. The underlying principle is the same as that of memcpy
where it tries to use registers with the same lower 4 bits for
fetching the same stream, thus optimizing hardware prefetcher
performance.
The memcpy copy loop copies 64 bytes at a time using the same register
pair since that's the way to train the hardware prefetcher on the
falkor core. memmove cannot quite do that since it needs to avoid
overlaps, so it does the next best thing, i.e. has a 32 byte loop with
a 32 byte end (prefetch a loop ahead to account for overlapping
locations) with register pairs that alias so that they hit the same
prefetcher. Due to this difference in loop size, they have to
currently be separate implementations but efforts are on to try and
get memmove to fall back into memcpy whenever it can without simply
duplicating all of the code.
Performance:
The routine fares around 20-25% better than the generic memmove for
most medium to large sizes (i.e. > 128 bytes) for the new walking
memmove benchmark (memmove-walk) with an unexplained regression
between 1K and 2K. The minor regression is something worth looking
into for us, but the remaining gains are significant enough that we
would like this included upstream as we looking into the cause for the
regression. Here is a snippet of the numbers as generated from the
microbenchmark by the compare_strings script. Comparisons are against
__memmove_generic:
Function: memmove
Variant: walk
__memmove_thunderx __memmove_falkor __memmove_generic
========================================================================================================================
<snip>
length=16384: 12508800.00 ( 6.09%) 11486800.00 ( 13.76%) 13319600.00
length=16400: 13614200.00 ( -0.67%) 11585000.00 ( 14.33%) 13523600.00
length=16385: 13448400.00 ( 0.10%) 11732700.00 ( 12.84%) 13461200.00
length=16399: 13594100.00 ( -0.22%) 11859600.00 ( 12.57%) 13564400.00
length=16386: 13211600.00 ( 1.13%) 11503800.00 ( 13.91%) 13362400.00
length=16398: 13218600.00 ( 2.12%) 11573200.00 ( 14.30%) 13504700.00
length=16387: 13510900.00 ( -0.37%) 11744200.00 ( 12.76%) 13461300.00
length=16397: 13603700.00 ( -0.15%) 11878200.00 ( 12.55%) 13583200.00
length=16388: 13461700.00 ( -0.13%) 11558000.00 ( 14.03%) 13444100.00
length=16396: 13517500.00 ( -0.03%) 11561300.00 ( 14.45%) 13513900.00
length=16389: 13534100.00 ( 0.17%) 11756800.00 ( 13.28%) 13556900.00
length=16395: 13585600.00 ( 0.11%) 11791800.00 ( 13.30%) 13601200.00
length=16390: 13480100.00 ( -0.13%) 11685500.00 ( 13.20%) 13462100.00
length=16394: 13529900.00 ( -0.23%) 11549800.00 ( 14.43%) 13498200.00
length=16391: 13595400.00 ( -0.26%) 11768200.00 ( 13.22%) 13560600.00
length=16393: 13567000.00 ( 0.20%) 11779700.00 ( 13.35%) 13594700.00
length=32768: 71308800.00 ( -6.53%) 50220800.00 ( 24.98%) 66939200.00
length=32784: 72100800.00 (-11.55%) 50114100.00 ( 22.47%) 64636300.00
length=32769: 71767000.00 ( -7.10%) 51238400.00 ( 23.54%) 67010000.00
length=32783: 70113700.00 (-40.95%) 51129000.00 ( -2.78%) 49744400.00
length=32770: 71367600.00 ( -6.52%) 50244700.00 ( 25.01%) 67000900.00
length=32782: 64366700.00 ( 4.71%) 50101400.00 ( 25.83%) 67545600.00
length=32771: 71440100.00 ( -6.51%) 51263900.00 ( 23.57%) 67074900.00
length=32781: 66993000.00 ( 0.34%) 51108300.00 ( 23.97%) 67220300.00
length=32772: 71443900.00 (-60.50%) 50062100.00 (-12.47%) 44512600.00
length=32780: 71759100.00 ( -6.58%) 50263200.00 ( 25.35%) 67328600.00
length=32773: 71714900.00 (-33.21%) 51076600.00 ( 5.12%) 53835400.00
length=32779: 71756900.00 ( -6.56%) 51290800.00 ( 23.83%) 67337800.00
length=32774: 59689300.00 (-34.55%) 50068400.00 (-12.86%) 44363300.00
length=32778: 71847500.00 (-18.20%) 50084100.00 ( 17.61%) 60786500.00
length=32775: 71599300.00 ( -6.54%) 51278200.00 ( 23.70%) 67204800.00
length=32777: 71862900.00 (-60.85%) 51094000.00 (-14.36%) 44677900.00
length=65536: 282848000.00 ( -6.60%) 199187000.00 ( 24.93%) 265325000.00
length=65552: 243285000.00 (-41.61%) 198512000.00 (-15.54%) 171805000.00
length=65537: 255415000.00 (-23.47%) 202499000.00 ( 2.11%) 206858000.00
length=65551: 280122000.00 (-62.95%) 203349000.00 (-18.29%) 171911000.00
length=65538: 283676000.00 (-14.46%) 198368000.00 ( 19.96%) 247848000.00
length=65550: 275566000.00 (-51.76%) 198494000.00 ( -9.31%) 181581000.00
length=65539: 283699000.00 ( -6.58%) 203453000.00 ( 23.57%) 266195000.00
length=65549: 286572000.00 ( -6.65%) 202607000.00 ( 24.60%) 268712000.00
length=65540: 283710000.00 ( -6.59%) 199161000.00 ( 25.17%) 266160000.00
length=65548: 237573000.00 ( 11.48%) 198462000.00 ( 26.06%) 268395000.00
length=65541: 284150000.00 ( -6.58%) 203273000.00 ( 23.75%) 266600000.00
length=65547: 286250000.00 ( -6.70%) 202594000.00 ( 24.48%) 268263000.00
length=65542: 284167000.00 ( -6.60%) 199122000.00 ( 25.31%) 266584000.00
length=65546: 285656000.00 ( -6.59%) 198443000.00 ( 25.95%) 268002000.00
length=65543: 284600000.00 ( -6.58%) 203247000.00 ( 23.89%) 267030000.00
length=65545: 285665000.00 ( -6.40%) 202575000.00 ( 24.55%) 268472000.00
<snip>
* sysdeps/aarch64/multiarch/Makefile (sysdep_routines): Add
memmove_falkor.
* sysdeps/aarch64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Likewise.
* sysdeps/aarch64/multiarch/memmove.c: Likewise.
* sysdeps/aarch64/multiarch/memmove_falkor.S: New file.