glibc/sysdeps/x86_64/multiarch
H.J. Lu a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
..
bcopy.S Use IFUNC memmove/memset in x86-64 bcopy/bzero 2012-10-11 13:58:16 -07:00
ifunc-defines.sym Add _dl_x86_cpu_features to rtld_global 2015-08-13 03:41:22 -07:00
ifunc-impl-list.c Add x86-64 memset with unaligned store and rep stosb 2016-03-31 10:06:07 -07:00
Makefile Add x86-64 memset with unaligned store and rep stosb 2016-03-31 10:06:07 -07:00
memcmp-sse4.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memcmp-ssse3.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memcmp.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memcpy_chk.S Fixed typos in __memcpy_chk. 2016-01-16 14:42:26 +03:00
memcpy-avx-unaligned.S Implement x86-64 multiarch mempcpy in memcpy 2016-03-28 13:13:51 -07:00
memcpy-sse2-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memcpy-ssse3-back.S Implement x86-64 multiarch mempcpy in memcpy 2016-03-28 13:13:51 -07:00
memcpy-ssse3.S Implement x86-64 multiarch mempcpy in memcpy 2016-03-28 13:13:51 -07:00
memcpy.S [x86] Add a feature bit: Fast_Unaligned_Copy 2016-03-28 04:40:03 -07:00
memmove_chk.c Added memcpy/memmove family optimized with AVX512 for KNL hardware. 2016-01-16 00:49:45 +03:00
memmove-avx512-no-vzeroupper.S Make __memcpy_avx512_no_vzeroupper an alias 2016-03-28 13:16:22 -07:00
memmove-avx512-unaligned-erms.S X86-64: Use non-temporal store in memcpy on large data 2016-04-12 08:10:47 -07:00
memmove-avx-unaligned-erms.S X86-64: Use non-temporal store in memcpy on large data 2016-04-12 08:10:47 -07:00
memmove-avx-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memmove-sse2-unaligned-erms.S X86-64: Use non-temporal store in memcpy on large data 2016-04-12 08:10:47 -07:00
memmove-ssse3-back.S Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7 2010-06-30 08:26:11 -07:00
memmove-ssse3.S Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7 2010-06-30 08:26:11 -07:00
memmove-vec-unaligned-erms.S X86-64: Use non-temporal store in memcpy on large data 2016-04-12 08:10:47 -07:00
memmove.c Added memcpy/memmove family optimized with AVX512 for KNL hardware. 2016-01-16 00:49:45 +03:00
mempcpy_chk.S Added memcpy/memmove family optimized with AVX512 for KNL hardware. 2016-01-16 00:49:45 +03:00
mempcpy.S Added memcpy/memmove family optimized with AVX512 for KNL hardware. 2016-01-16 00:49:45 +03:00
memset_chk.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memset-avx2-unaligned-erms.S Don't put SSE2/AVX/AVX512 memmove/memset in ld.so 2016-04-03 14:35:38 -07:00
memset-avx2.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
memset-avx512-no-vzeroupper.S Group AVX512 functions in .text.avx512 section 2016-03-06 16:48:11 -08:00
memset-avx512-unaligned-erms.S Don't put SSE2/AVX/AVX512 memmove/memset in ld.so 2016-04-03 14:35:38 -07:00
memset-sse2-unaligned-erms.S Add a comment in memset-sse2-unaligned-erms.S 2016-04-05 05:19:18 -07:00
memset-vec-unaligned-erms.S X86-64: Prepare memset-vec-unaligned-erms.S 2016-04-06 09:10:35 -07:00
memset.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
sched_cpucount.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
stpcpy-sse2-unaligned.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
stpcpy-ssse3.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
stpcpy.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
stpncpy-c.c SSSE3 strcpy/stpcpy for x86-64 2009-07-02 03:39:03 -07:00
stpncpy-sse2-unaligned.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
stpncpy-ssse3.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
stpncpy.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strcasecmp_l-ssse3.S Fix x86-64 build without multiarch. 2010-08-14 14:56:32 -07:00
strcasecmp_l.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strcat-sse2-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcat-ssse3.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcat.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strchr-sse2-no-bsf.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strchr.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcmp-sse2-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcmp-sse42.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcmp-ssse3.S Remove NOT_IN_libc 2014-11-24 15:03:45 +05:30
strcmp.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcpy-sse2-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcpy-ssse3.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcpy.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcspn-c.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strcspn.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strncase_l-ssse3.S Add optimized strncasecmp versions for x86-64. 2010-08-14 22:04:01 -07:00
strncase_l.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strncat-c.c Improve 64 bit strcat functions with SSE2/SSSE3 2011-07-19 17:11:54 -04:00
strncat-sse2-unaligned.S Improve 64 bit strcat functions with SSE2/SSSE3 2011-07-19 17:11:54 -04:00
strncat-ssse3.S Improve 64 bit strcat functions with SSE2/SSSE3 2011-07-19 17:11:54 -04:00
strncat.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strncmp-ssse3.S Don't define x86-64 __strncmp_ssse3 in libc.a 2012-09-27 07:43:03 -07:00
strncmp.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strncpy-c.c SSSE3 strcpy/stpcpy for x86-64 2009-07-02 03:39:03 -07:00
strncpy-sse2-unaligned.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
strncpy-ssse3.S Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
strncpy.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strpbrk-c.c Don't define __strpbrk_sse42 in static library 2010-03-24 12:16:24 -07:00
strpbrk.S Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
strspn-c.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strspn.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strstr-sse2-unaligned.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
strstr.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
test-multiarch.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
varshift.c Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
varshift.h Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
wcscpy-c.c Remove NOT_IN_libc 2014-11-24 15:03:45 +05:30
wcscpy-ssse3.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
wcscpy.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
wmemcmp-c.c Remove NOT_IN_libc 2014-11-24 15:03:45 +05:30
wmemcmp-sse4.S Optimized memcmp and wmemcmp for x86-64 and x86-32 2011-10-15 11:10:08 -04:00
wmemcmp-ssse3.S Optimized memcmp and wmemcmp for x86-64 and x86-32 2011-10-15 11:10:08 -04:00
wmemcmp.S Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00