glibc/sysdeps/aarch64
Siddhesh Poyarekar dd5bc7f1b3 aarch64: Optimized implementation of memmove for Qualcomm Falkor
This is an optimized memmove implementation for the Qualcomm Falkor
processor core.  Due to the way the falkor memcpy needs to be written,
code cannot be easily shared between memmove and memcpy like in case
of other aarch64 memcpy implementations due to which this routine is
separate.  The underlying principle is the same as that of memcpy
where it tries to use registers with the same lower 4 bits for
fetching the same stream, thus optimizing hardware prefetcher
performance.

The memcpy copy loop copies 64 bytes at a time using the same register
pair since that's the way to train the hardware prefetcher on the
falkor core.  memmove cannot quite do that since it needs to avoid
overlaps, so it does the next best thing, i.e. has a 32 byte loop with
a 32 byte end (prefetch a loop ahead to account for overlapping
locations) with register pairs that alias so that they hit the same
prefetcher.  Due to this difference in loop size, they have to
currently be separate implementations but efforts are on to try and
get memmove to fall back into memcpy whenever it can without simply
duplicating all of the code.

Performance:

The routine fares around 20-25% better than the generic memmove for
most medium to large sizes (i.e. > 128 bytes) for the new walking
memmove benchmark (memmove-walk) with an unexplained regression
between 1K and 2K.  The minor regression is something worth looking
into for us, but the remaining gains are significant enough that we
would like this included upstream as we looking into the cause for the
regression.  Here is a snippet of the numbers as generated from the
microbenchmark by the compare_strings script.  Comparisons are against
__memmove_generic:

Function: memmove
Variant: walk
                                    __memmove_thunderx	__memmove_falkor	__memmove_generic
========================================================================================================================
<snip>
                        length=16384:  12508800.00 (  6.09%)	 11486800.00 ( 13.76%)	 13319600.00
                        length=16400:  13614200.00 ( -0.67%)	 11585000.00 ( 14.33%)	 13523600.00
                        length=16385:  13448400.00 (  0.10%)	 11732700.00 ( 12.84%)	 13461200.00
                        length=16399:  13594100.00 ( -0.22%)	 11859600.00 ( 12.57%)	 13564400.00
                        length=16386:  13211600.00 (  1.13%)	 11503800.00 ( 13.91%)	 13362400.00
                        length=16398:  13218600.00 (  2.12%)	 11573200.00 ( 14.30%)	 13504700.00
                        length=16387:  13510900.00 ( -0.37%)	 11744200.00 ( 12.76%)	 13461300.00
                        length=16397:  13603700.00 ( -0.15%)	 11878200.00 ( 12.55%)	 13583200.00
                        length=16388:  13461700.00 ( -0.13%)	 11558000.00 ( 14.03%)	 13444100.00
                        length=16396:  13517500.00 ( -0.03%)	 11561300.00 ( 14.45%)	 13513900.00
                        length=16389:  13534100.00 (  0.17%)	 11756800.00 ( 13.28%)	 13556900.00
                        length=16395:  13585600.00 (  0.11%)	 11791800.00 ( 13.30%)	 13601200.00
                        length=16390:  13480100.00 ( -0.13%)	 11685500.00 ( 13.20%)	 13462100.00
                        length=16394:  13529900.00 ( -0.23%)	 11549800.00 ( 14.43%)	 13498200.00
                        length=16391:  13595400.00 ( -0.26%)	 11768200.00 ( 13.22%)	 13560600.00
                        length=16393:  13567000.00 (  0.20%)	 11779700.00 ( 13.35%)	 13594700.00
                        length=32768:  71308800.00 ( -6.53%)	 50220800.00 ( 24.98%)	 66939200.00
                        length=32784:  72100800.00 (-11.55%)	 50114100.00 ( 22.47%)	 64636300.00
                        length=32769:  71767000.00 ( -7.10%)	 51238400.00 ( 23.54%)	 67010000.00
                        length=32783:  70113700.00 (-40.95%)	 51129000.00 ( -2.78%)	 49744400.00
                        length=32770:  71367600.00 ( -6.52%)	 50244700.00 ( 25.01%)	 67000900.00
                        length=32782:  64366700.00 (  4.71%)	 50101400.00 ( 25.83%)	 67545600.00
                        length=32771:  71440100.00 ( -6.51%)	 51263900.00 ( 23.57%)	 67074900.00
                        length=32781:  66993000.00 (  0.34%)	 51108300.00 ( 23.97%)	 67220300.00
                        length=32772:  71443900.00 (-60.50%)	 50062100.00 (-12.47%)	 44512600.00
                        length=32780:  71759100.00 ( -6.58%)	 50263200.00 ( 25.35%)	 67328600.00
                        length=32773:  71714900.00 (-33.21%)	 51076600.00 (  5.12%)	 53835400.00
                        length=32779:  71756900.00 ( -6.56%)	 51290800.00 ( 23.83%)	 67337800.00
                        length=32774:  59689300.00 (-34.55%)	 50068400.00 (-12.86%)	 44363300.00
                        length=32778:  71847500.00 (-18.20%)	 50084100.00 ( 17.61%)	 60786500.00
                        length=32775:  71599300.00 ( -6.54%)	 51278200.00 ( 23.70%)	 67204800.00
                        length=32777:  71862900.00 (-60.85%)	 51094000.00 (-14.36%)	 44677900.00
                        length=65536: 282848000.00 ( -6.60%)	199187000.00 ( 24.93%)	265325000.00
                        length=65552: 243285000.00 (-41.61%)	198512000.00 (-15.54%)	171805000.00
                        length=65537: 255415000.00 (-23.47%)	202499000.00 (  2.11%)	206858000.00
                        length=65551: 280122000.00 (-62.95%)	203349000.00 (-18.29%)	171911000.00
                        length=65538: 283676000.00 (-14.46%)	198368000.00 ( 19.96%)	247848000.00
                        length=65550: 275566000.00 (-51.76%)	198494000.00 ( -9.31%)	181581000.00
                        length=65539: 283699000.00 ( -6.58%)	203453000.00 ( 23.57%)	266195000.00
                        length=65549: 286572000.00 ( -6.65%)	202607000.00 ( 24.60%)	268712000.00
                        length=65540: 283710000.00 ( -6.59%)	199161000.00 ( 25.17%)	266160000.00
                        length=65548: 237573000.00 ( 11.48%)	198462000.00 ( 26.06%)	268395000.00
                        length=65541: 284150000.00 ( -6.58%)	203273000.00 ( 23.75%)	266600000.00
                        length=65547: 286250000.00 ( -6.70%)	202594000.00 ( 24.48%)	268263000.00
                        length=65542: 284167000.00 ( -6.60%)	199122000.00 ( 25.31%)	266584000.00
                        length=65546: 285656000.00 ( -6.59%)	198443000.00 ( 25.95%)	268002000.00
                        length=65543: 284600000.00 ( -6.58%)	203247000.00 ( 23.89%)	267030000.00
                        length=65545: 285665000.00 ( -6.40%)	202575000.00 ( 24.55%)	268472000.00
<snip>

	* sysdeps/aarch64/multiarch/Makefile (sysdep_routines): Add
	memmove_falkor.
	* sysdeps/aarch64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Likewise.
	* sysdeps/aarch64/multiarch/memmove.c: Likewise.
	* sysdeps/aarch64/multiarch/memmove_falkor.S: New file.
2017-10-05 22:20:23 +05:30
..
bits Remove bits/string.h. 2017-06-20 08:21:24 -04:00
fpu Optimized generic expf and exp2f with wrappers 2017-09-25 10:44:39 +01:00
multiarch aarch64: Optimized implementation of memmove for Qualcomm Falkor 2017-10-05 22:20:23 +05:30
nptl Optimize generic spinlock code and use C11 like atomic macros. 2017-06-06 09:41:56 +02:00
soft-fp Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
__longjmp.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
abort-instr.h
atomic-machine.h Optimize generic spinlock code and use C11 like atomic macros. 2017-06-06 09:41:56 +02:00
backtrace.c
bsd-_setjmp.S
bsd-setjmp.S
configure Require autoconf 2.69 2014-09-29 07:53:36 -07:00
configure.ac
crti.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
crtn.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
dl-irel.h Change argument type passed to ifunc resolvers 2017-08-31 10:34:55 -07:00
dl-link.sym
dl-machine.h aarch64: don't use MIN in dl-machine.h 2017-10-04 17:49:38 +01:00
dl-sysdep.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
dl-tls.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
dl-tlsdesc.h elf: Remove internal_function attribute 2017-08-31 16:59:37 +02:00
dl-tlsdesc.S aarch64: Use PTR_REG macro to fix ILP32 bug and make code consistent 2017-08-22 16:22:05 -07:00
dl-trampoline.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
dl-tunables.list tunables, aarch64: New tunable to override cpu 2017-06-30 22:58:39 +05:30
Implies
jmpbuf-offsets.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
jmpbuf-unwind.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
ldsodefs.h Add ifunc support for aarch64. 2017-03-15 16:46:26 -07:00
libc-tls.c Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
libm-test-ulps AArch64: update libm-test-ulps 2017-09-28 15:28:46 +01:00
libm-test-ulps-name Do not hardcode platform names in manual/libm-err-tab.pl (bug 14139). 2016-11-04 16:49:06 +00:00
linkmap.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
machine-gmon.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
Makefile Enable unwind info in libc-start.c and backtrace.c 2017-09-19 15:07:58 +01:00
math-tests.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
mcount.c Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
memchr.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
memcmp.S [AArch64] Optimized memcmp. 2017-08-10 17:00:38 +01:00
memcpy.S aarch64: Thunderx specific memcpy and memmove 2017-05-24 16:46:48 -07:00
memmove.S This is an optimized memcpy/memmove for AArch64. Copies are split into 3 main 2016-06-20 17:41:33 +01:00
memset.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
memusage.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
preconfigure Remove relro configure test. 2014-06-27 16:51:22 +00:00
rawmemchr.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
setjmp.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
sotruss-lib.c Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
stackinfo.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
start.S AArch64: use movz/movk instead of literal pools in start.S 2017-09-18 18:15:47 +01:00
stpcpy.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strchr.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strchrnul.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strcmp.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strcpy.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
string_private.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strlen.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strncmp.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strnlen.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
strrchr.S Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
sysdep.h AArch64: use movz/movk instead of literal pools in start.S 2017-09-18 18:15:47 +01:00
tls-macros.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
tlsdesc.c elf: Remove internal_function attribute 2017-08-31 16:59:37 +02:00
tlsdesc.sym aarch64: Use explicit offsets in _dl_tlsdesc_dynamic 2016-12-02 16:52:57 +01:00
tst-audit.h Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
Versions