Commit Graph

488 Commits

Author SHA1 Message Date
H.J. Lu
7a499756ab x86-64: Test memmove_chk and memset_chk only in libc.so [BZ #21741]
Since there are no multiarch versions of memmove_chk and memset_chk,
test multiarch versions of memmove_chk and memset_chk only in libc.so.

	[BZ #21741]
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test memmove_chk and memset_chk only
	in libc.so.
2017-07-10 04:44:38 -07:00
H.J. Lu
58d021c836 x86-64: Update comments in IFUNC selectors
* sysdeps/x86_64/multiarch/memcmp.c: Update comments.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
2017-07-09 11:43:20 -07:00
H.J. Lu
4df54c89bb x86-64: Update comments in ifunc-impl-list.c
All x86-64 IFUNC selectors are written in C now.  Update comments to
reflect it.

	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update comments.
2017-07-09 11:38:37 -07:00
H.J. Lu
e94c310357 x86-64: Optimize memcmp-avx2-movbe.S for short difference
Check the first 32 bytes before checking size when size >= 32 bytes
to avoid unnecessary branch if the difference is in the first 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.

On Haswell, the new version is as fast as the previous one.  On Skylake,
the new version is a little bit faster.

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
	the first 32 bytes before checking size when size >= 32 bytes.
	Replace vpmovmskb/subl/jnz with vptest/jnc.
2017-06-27 07:55:00 -07:00
H.J. Lu
049816c3be x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.S
Turn

	movzbl	-1(%rdi, %rdx), %edi
	movzbl	-1(%rsi, %rdx), %esi
	orl	%edi, %eax
	orl	%esi, %ecx

into

	movb	-1(%rdi, %rdx), %al
	movb	-1(%rsi, %rdx), %cl

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
	Replace movzbl and orl with movb.
2017-06-23 12:46:12 -07:00
Florian Weimer
bc0382ae90 x86-64: Fix comment typo in memcmp-avx2-movbe.S 2017-06-23 19:00:58 +02:00
Florian Weimer
3ec7c02cc3 x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]
This code:

L(between_2_3):
	/* Load as big endian with overlapping loads and bswap to avoid
	   branches.  */
	movzwl	-2(%rdi, %rdx), %eax
	movzwl	-2(%rsi, %rdx), %ecx
	shll	$16, %eax
	shll	$16, %ecx
	movzwl	(%rdi), %edi
	movzwl	(%rsi), %esi
	orl	%edi, %eax
	orl	%esi, %ecx
	bswap	%eax
	bswap	%ecx
	subl	%ecx, %eax
	ret

needs a saturating subtract because the full register is used.
With this commit, only the lower 24 bits of the register are used,
so a regular subtraction suffices.

The test case change adds coverage for these kinds of bugs.
2017-06-23 17:24:40 +02:00
H.J. Lu
11ffcacb64 x86-64: Implement strcmp family IFUNC selectors in C
Implement strcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2,
	strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx,
	strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx.
	* sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file.
	* sysdeps/x86_64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>.
	(STRCMP_SSE42): New.  Defined to __strcmp_sse42 if not defined.
	[USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include
	"locale-defines.h".
	(UPDATE_STRNCMP_COUNTER): New.
	(SECTION): Likewise.
	(GLABEL): Likewise.
	(LABEL): Likewise.
	* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable
	for libc.a.
2017-06-21 12:11:06 -07:00
Zack Weinberg
c0b23001a8 Fix fallout from bits/string.h removal.
Remove one more string inline that was defined directly in string.h;
in the absence of the rest of the inlines, it broke the build.

Like other ifunc shims for these functions,
x86_64/multiarch/{mem,st}pcpy.c need to define __NO_STRING_INLINES and
NO_MEMPCPY_STPCPY_REDIRECT.

	* string/string.h (__mempcpy_inline): Delete.
	* sysdeps/x86_64/multiarch/mempcpy.c
	* sysdeps/x86_64/multiarch/stpcpy.c:
	Define NO_MEMPCPY_STPCPY_REDIRECT and __NO_STRING_INLINES
	before including string.h.
2017-06-20 09:39:08 -04:00
Zack Weinberg
09a596cc2c Remove bits/string.h.
These machine-dependent inline string functions have never been on by
default, and even if they were a good idea at the time they were
introduced, they haven't really been touched in ten to fifteen years
and probably aren't a good idea on current-gen processors.  Current
thinking is that this class of optimization is best left to the
compiler.

	* bits/string.h, string/bits/string.h
	* sysdeps/aarch64/bits/string.h
	* sysdeps/m68k/m680x0/m68020/bits/string.h
	* sysdeps/s390/bits/string.h, sysdeps/sparc/bits/string.h
	* sysdeps/x86/bits/string.h: Delete file.

	* string/string.h: Don't include bits/string.h.
	* string/bits/string3.h: Rename to bits/string_fortified.h.
	No need to undef various symbols that the removed headers
	might have defined as macros.
	* string/Makefile (headers): Remove bits/string.h, change
	bits/string3.h to bits/string_fortified.h.
	* string/string-inlines.c: Update commentary.  Remove definitions
	of various macros that nothing looks at anymore.  Don't directly
	include bits/string.h. Set _STRING_INLINE_unaligned here, based on
	compiler-predefined macros.
	* string/strncat.c: If STRNCAT is not defined, or STRNCAT_PRIMARY
	_is_ defined, provide internal hidden alias __strncat.
	* include/string.h: Declare internal hidden alias __strncat.
	Only forward __stpcpy to __builtin_stpcpy if __NO_STRING_INLINES is
	not defined.
	* include/bits/string3.h: Rename to bits/string_fortified.h,
	update to match above.

	* sysdeps/i386/string-inlines.c: Define compat symbols for
	everything formerly defined by sysdeps/x86/bits/string.h.
	Make existing definitions into compat symbols as well.
	Remove some no-longer-necessary messing around with macros.

	* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
	* sysdeps/s390/multiarch/mempcpy.c
	No need to define _HAVE_STRING_ARCH_mempcpy.
	Do define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT.

	* sysdeps/i386/i686/multiarch/strncat-c.c
	* sysdeps/s390/multiarch/strncat-c.c
	* sysdeps/x86_64/multiarch/strncat-c.c
	Define STRNCAT_PRIMARY.  Don't change definition of libc_hidden_def.
2017-06-20 08:21:24 -04:00
Siddhesh Poyarekar
629ebc873a Fix typo when undefining weak_alias
The macro directive #undef was miswritten as #undefine.

	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Fix typo.
2017-06-19 14:56:40 +05:30
H.J. Lu
70fe2eb794 x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in C
Implement strcspn/strpbrk/strspn IFUNC selectors in C

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcspn-sse2, strpbrk-sse2 and strspn-sse2.
	* sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed.
	Check USE_AS_STRPBRK instead of STRPBRK_P.
	* sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New.
	* sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file.
	* sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk.c: Likewise.
	* sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Removed.
	* sysdeps/x86_64/multiarch/strpbrk.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED"
	and "#endif".
2017-06-15 08:59:05 -07:00
H.J. Lu
9f4254b8bd x86-64: Implement wcscpy IFUNC selector in C
* sysdeps/x86_64/multiarch/wcscpy.S: Removed.
	* sysdeps/x86_64/multiarch/wcscpy.c: New file.
2017-06-15 08:57:52 -07:00
H.J. Lu
9ed0aa15d3 x86-64: Implement strcat family IFUNC selectors in C
Implement strcat family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcat family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-sse2.
	* sysdeps/x86_64/multiarch/strcat-sse2.S: New file.
	* sysdeps/x86_64/multiarch/strcat.c: Likewise.
	* sysdeps/x86_64/multiarch/strncat.c: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Removed.
	* sysdeps/x86_64/multiarch/strncat.S: Likewise.
2017-06-15 08:56:59 -07:00
H.J. Lu
b91a52d0d7 x86-64: Implement memcmp family IFUNC selectors in C
Implement memcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file.
	* sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Removed.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2017-06-15 08:49:57 -07:00
H.J. Lu
93e46f8773 x86-64: Implement memset family IFUNC selectors in C
Implement memset family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memset functions within libc.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Erich Elsen  <eriche@google.com>

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, and memset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add test for __memset_chk_erms.
	Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memset.h: New file.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Removed.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__memset_chk_erms): New function.
2017-06-15 08:33:35 -07:00
H.J. Lu
5c3e322d3b x86-64: Implement memmove family IFUNC selectors in C
Implement memmove family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memmove family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memcpy_chk-nonshared,
	mempcpy_chk-nonshared and memmove_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memmove_chk_erms,
	__memcpy_chk_erms and __mempcpy_chk_erms.  Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memmove.h: New file.
	* sysdeps/x86_64/multiarch/memcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Removed.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__mempcpy_chk_erms): New function.
	(__memmove_chk_erms): Likewise.
	(__memcpy_chk_erms): New alias.
2017-06-14 12:11:10 -07:00
H.J. Lu
5a103908c0 x86-64: Implement strcpy family IFUNC selectors in C
Implement strcpy family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcpy family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcpy-sse2 and stpcpy-sse2.
	* sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file.
	* sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.S: Removed.
	* sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New.
	(libc_hidden_def): Always defined as empty.
	* sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def):
	Always Defined as empty.
2017-06-12 09:06:09 -07:00
H.J. Lu
6b6710e55b x86-64: Correct comments in ifunc-impl-list.c
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Correct comments.
2017-06-09 05:53:45 -07:00
H.J. Lu
d2538b9156 x86-64: Optimize strrchr/wcsrchr with AVX2
Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector
instructions.  It is as fast as SSE2 version for small data sizes
and up to 1X faster for large data sizes on Haswell.  Select AVX2
version on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strrchr_avx2,
	__strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2.
	* sysdeps/x86_64/multiarch/strrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
2017-06-09 05:45:52 -07:00
H.J. Lu
5ac7aa1d7c x86-64: Optimize memrchr with AVX2
Optimize memrchr with AVX2 to search 32 bytes with a single vector
compare instruction.  It is as fast as SSE2 memrchr for small data
sizes and up to 1X faster for large data sizes on Haswell.  Select
AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memrchr-sse2 and memrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and
	__memrchr_sse2.
	* sysdeps/x86_64/multiarch/memrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
2017-06-09 05:44:41 -07:00
H.J. Lu
8fe57365bf x86-64: Optimize strchr/strchrnul/wcschr with AVX2
Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector
instructions.  It is as fast as SSE2 versions for size <= 16 bytes and up
to 1X faster for or size > 16 bytes on Haswell.  Select AVX2 version on
AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2,
	wcschr-sse2 and wcschr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strchr_avx2,
	__strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and
	__wcschr_sse2.
	* sysdeps/x86_64/multiarch/strchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Removed.
2017-06-09 05:42:29 -07:00
H.J. Lu
dc485ceb2a x86-64: Optimize strlen/strnlen/wcslen/wcsnlen with AVX2
Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with
a single vector compare instruction.  It is as fast as SSE2 versions for
size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell.
Select AVX2 version on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2,
	wcslen-sse2, wcslen-avx2 and wcsnlen-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strlen_avx2,
	__strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2,
	__wcslen_sse2 and __wcsnlen_avx2.
	* sysdeps/x86_64/multiarch/strlen-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New.
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where
	vzeroupper is preferred and AVX unaligned load is fast.
2017-06-09 05:18:18 -07:00
H.J. Lu
2f5d20ac99 x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2
SSE2 memchr is extended to support wmemchr.  AVX2 memchr/rawmemchr/wmemchr
are added to search 32 bytes with a single vector compare instruction.
AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr
for small sizes and up to 1.5X faster for larger sizes on Haswell and
Skylake.  Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/memchr.S (MEMCHR): New.  Depending on if
	USE_AS_WMEMCHR is defined.
	(PCMPEQ): Likewise.
	(memchr): Renamed to ...
	(MEMCHR): This.  Support wmemchr if USE_AS_WMEMCHR is defined.
	Replace pcmpeqb with PCMPEQ.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2,
	wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c.
	* sysdeps/x86_64/multiarch/ifunc-avx2.h: New file.
	* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2,
	__rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and
	__wmemchr_sse2.
2017-06-09 05:13:31 -07:00
H.J. Lu
5e1122827a x86-64: Rename wmemset.h to ifunc-wmemset.h
No code changes.

	* sysdeps/x86_64/multiarch/wmemset.c: Include ifunc-wmemset.h
	instead of wmemset.h.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.h: Renamed to ...
	* sysdeps/x86_64/multiarch/ifunc-wmemset.h: This.
2017-06-07 14:48:34 -07:00
H.J. Lu
2e87c7d158 x86-64: Fold ifunc-sse4_1.h into wcsnlen.c
Since ifunc-sse4_1.h is included only by wcsnlen.c, we can fold it
into wcsnlen.c.  No code changes in wcsnlen.o.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>

	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: Removed and folded
	into ...
	* sysdeps/x86_64/multiarch/wcsnlen.c: Here.  Don't include
	ifunc-sse4_1.h.
2017-06-07 09:04:40 -07:00
H.J. Lu
d4cc385c6e x86-64: Move wcsnlen.S to multiarch/wcsnlen-sse4_1.S
Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S
to multiarch/wcsnlen-sse4_1.S.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wcsnlen-sse4_1 and wcsnlen-c.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and
	__wcsnlen_sse2.
	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file.
	* sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/wcsnlen.S: Removed.
2017-06-06 06:12:32 -07:00
H.J. Lu
935971ba6b x86-64: Optimize memcmp/wmemcmp with AVX2 and MOVBE
Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

Key features:

1. For size from 2 to 7 bytes, load as big endian with movbe and bswap
   to avoid branches.
2. Use overlapping compare to avoid branch.
3. Use vector compare when size >= 4 bytes for memcmp or size >= 8
   bytes for wmemcmp.
4. If size is 8 * VEC_SIZE or less, unroll the loop.
5. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
8. Use 8 vector compares when size is 8 * VEC_SIZE or less.

	* sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-avx2 and wmemcmp-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2.
	* sysdeps/x86_64/multiarch/memcmp-avx2.S: New file.
	* sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
2017-06-05 12:52:55 -07:00
H.J. Lu
ef9c4cb6c7 x86-64: Optimize wmemset with SSE2/AVX2/AVX512
The difference between memset and wmemset is byte vs int.  Add stubs
to SSE2/AVX2/AVX512 memset for wmemset with updated constant and size:

SSE2 wmemset:
	shl    $0x2,%rdx
	movd   %esi,%xmm0
	mov    %rdi,%rax
	pshufd $0x0,%xmm0,%xmm0
	jmp	entry_from_wmemset

SSE2 memset:
	movd   %esi,%xmm0
	mov    %rdi,%rax
	punpcklbw %xmm0,%xmm0
	punpcklwd %xmm0,%xmm0
	pshufd $0x0,%xmm0,%xmm0
entry_from_wmemset:

Since the ERMS versions of wmemset requires "rep stosl" instead of
"rep stosb", only the vector store stubs of SSE2/AVX2/AVX512 wmemset
are added.  The SSE2 wmemset is about 3X faster and the AVX2 wmemset
is about 6X faster on Haswell.

	* include/wchar.h (__wmemset_chk): New.
	* sysdeps/x86_64/memset.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed
	to MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_CHK_SYMBOL): Likewise.
	(WMEMSET_SYMBOL): Likewise.
	(__wmemset): Add hidden definition.
	(wmemset): Add weak hidden definition.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wmemset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add __wmemset_sse2_unaligned,
	__wmemset_avx2_unaligned, __wmemset_avx512_unaligned,
	__wmemset_chk_sse2_unaligned, __wmemset_chk_avx2_unaligned
	and __wmemset_chk_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Updated.
	(WMEMSET_CHK_SYMBOL): New.
	(WMEMSET_CHK_SYMBOL (__wmemset_chk, unaligned)): Likewise.
	(WMEMSET_SYMBOL (__wmemset, unaligned)): Likewise.
	* sysdeps/x86_64/multiarch/memset.S (WMEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Also define __GI_wmemset and
	__GI___wmemset.
	(weak_alias): New.
	* sysdeps/x86_64/multiarch/wmemset.c: New file.
	* sysdeps/x86_64/multiarch/wmemset.h: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/wmemset.c: Likewise.
	* sysdeps/x86_64/wmemset_chk.c: Likewise.
2017-06-05 11:09:59 -07:00
H.J. Lu
a7fbedff76 Correct comments in x86_64/multiarch/memcmp.S
* sysdeps/x86_64/multiarch/memcmp.S (__GI_memcmp): Correct
	comments.
2017-05-18 14:02:02 -07:00
Zack Weinberg
7c3018f9e4 Suppress internal declarations for most of the testsuite.
This patch adds a new build module called 'testsuite'.
IS_IN (testsuite) implies _ISOMAC, as do IS_IN_build and __cplusplus
(which means several ad-hoc tests for __cplusplus can go away).
libc-symbols.h now suppresses almost all of *itself* when _ISOMAC is
defined; in particular, _ISOMAC mode does not get config.h
automatically anymore.

There are still quite a few tests that need to see internal gunk of
one variety or another.  For them, we now have 'tests-internal' and
'test-internal-extras'; files in this category will still be compiled
with MODULE_NAME=nonlib, and everything proceeds as it always has.
The bulk of this patch is moving tests from 'tests' to
'tests-internal'.  There is also 'tests-static-internal', which has
the same effect on files in 'tests-static', and 'modules-names-tests',
which has the *inverse* effect on files in 'modules-names' (it's
inverted because most of the things in modules-names are *not* tests).
For both of these, the file must appear in *both* the new variable and
the old one.

There is also now a special case for when libc-symbols.h is included
without MODULE_NAME being defined at all.  (This happens during the
creation of libc-modules.h, and also when preprocessing Versions
files.)  When this happens, IS_IN is set to be always false and
_ISOMAC is *not* defined, which was the status quo, but now it's
explicit.

The remaining changes to C source files in this patch seemed likely to
cause problems in the absence of the main change.  They should be
relatively self-explanatory.  In a few cases I duplicated a definition
from an internal header rather than move the test to tests-internal;
this was a judgement call each time and I'm happy to change those
however reviewers feel is more appropriate.

	* Makerules: New subdir configuration variables 'tests-internal'
	and 'test-internal-extras'.  Test files in these categories will
	still be compiled with MODULE_NAME=nonlib.  Test files in the
	existing categories (tests, xtests, test-srcs, test-extras) are
	now compiled with MODULE_NAME=testsuite.
	New subdir configuration variable 'modules-names-tests'.  Files
	which are in both 'modules-names' and 'modules-names-tests' will
	be compiled with MODULE_NAME=testsuite instead of
	MODULE_NAME=extramodules.
	(gen-as-const-headers): Move to tests-internal.
	(do-tests-clean, common-mostlyclean): Support tests-internal.
	* Makeconfig (built-modules): Add testsuite.
	* Makefile: Change libof-check-installed-headers-c and
	libof-check-installed-headers-cxx to 'testsuite'.
	* Rules: Likewise.  Support tests-internal.
	* benchtests/strcoll-inputs/filelist#en_US.UTF-8:
	Remove extra-modules.mk.

	* config.h.in: Don't check for __OPTIMIZE__ or __FAST_MATH__ here.
	* include/libc-symbols.h: Move definitions of _GNU_SOURCE,
	PASTE_NAME, PASTE_NAME1, IN_MODULE, IS_IN, and IS_IN_LIB to the
	very top of the file and rationalize their order.
	If MODULE_NAME is not defined at all, define IS_IN to always be
	false, and don't define _ISOMAC.
	If any of IS_IN (testsuite), IS_IN_build, or __cplusplus are
	true, define _ISOMAC and suppress everything else in this file,
	starting with the inclusion of config.h.
	Do check for inappropriate definitions of __OPTIMIZE__ and
	__FAST_MATH__ here, but only if _ISOMAC is not defined.
        Correct some out-of-date commentary.

	* include/math.h: If _ISOMAC is defined, undefine NO_LONG_DOUBLE
	and _Mlong_double_ before including math.h.
	* include/string.h: If _ISOMAC is defined, don't expose
	_STRING_ARCH_unaligned. Move a comment to a more appropriate
	location.

	* include/errno.h, include/stdio.h, include/stdlib.h, include/string.h
	* include/time.h, include/unistd.h, include/wchar.h: No need to
	check __cplusplus nor use __BEGIN_DECLS/__END_DECLS.

	* misc/sys/cdefs.h (__NTHNL): New macro.
	* sysdeps/m68k/m680x0/fpu/bits/mathinline.h
	(__m81_defun): Use __NTHNL to avoid errors with GCC 6.

	* elf/tst-env-setuid-tunables.c: Include config.h with _LIBC
	defined, for HAVE_TUNABLES.
	* inet/tst-checks-posix.c: No need to define _ISOMAC.
	* intl/tst-gettext2.c: Provide own definition of N_.
	* math/test-signgam-finite-c99.c: No need to define _ISOMAC.
	* math/test-signgam-main.c: No need to define _ISOMAC.
	* stdlib/tst-strtod.c: Convert to test-driver. Split locale_test to...
	* stdlib/tst-strtod1i.c: ...this new file.
	* stdlib/tst-strtod5.c: Convert to test-driver and add copyright notice.
        Split tests of __strtod_internal to...
	* stdlib/tst-strtod5i.c: ...this new file.
	* string/test-string.h: Include stdint.h. Duplicate definition of
	inhibit_loop_to_libcall here (from libc-symbols.h).
	* string/test-strstr.c: Provide dummy definition of
	libc_hidden_builtin_def when including strstr.c.
	* sysdeps/ia64/fpu/libm-symbols.h: Suppress entire file in _ISOMAC
	mode; no need to test __STRICT_ANSI__ nor __cplusplus as well.
	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.

	* elf/Makefile: Move tst-ptrguard1-static, tst-stackguard1-static,
	tst-tls1-static, tst-tls2-static, tst-tls3-static, loadtest,
	unload, unload2, circleload1, neededtest, neededtest2,
	neededtest3, neededtest4, tst-tls1, tst-tls2, tst-tls3,
	tst-tls6, tst-tls7, tst-tls8, tst-dlmopen2, tst-ptrguard1,
	tst-stackguard1, tst-_dl_addr_inside_object, and all of the
	ifunc tests to tests-internal.
	Don't add $(modules-names) to test-extras.
	* inet/Makefile: Move tst-inet6_scopeid_pton to tests-internal.
	Add tst-deadline to tests-static-internal.
	* malloc/Makefile: Move tst-mallocstate and tst-scratch_buffer to
	tests-internal.
	* misc/Makefile: Move tst-atomic and tst-atomic-long to tests-internal.
	* nptl/Makefile: Move tst-typesizes, tst-rwlock19, tst-sem11,
	tst-sem12, tst-sem13, tst-barrier5, tst-signal7, tst-tls3,
	tst-tls3-malloc, tst-tls5, tst-stackguard1, tst-sem11-static,
	tst-sem12-static, and tst-stackguard1-static to tests-internal.
        Link tests-internal with libpthread also.
	Don't add $(modules-names) to test-extras.
	* nss/Makefile: Move tst-field to tests-internal.
	* posix/Makefile: Move bug-regex5, bug-regex20, bug-regex33,
	tst-rfc3484, tst-rfc3484-2, and tst-rfc3484-3 to tests-internal.
	* stdlib/Makefile: Move tst-strtod1i, tst-strtod3, tst-strtod4,
	tst-strtod5i, tst-tls-atexit, and tst-tls-atexit-nodelete to
	tests-internal.
        * sunrpc/Makefile: Move tst-svc_register to tests-internal.
	* sysdeps/powerpc/Makefile: Move test-get_hwcap and
	test-get_hwcap-static to tests-internal.
	* sysdeps/unix/sysv/linux/Makefile: Move tst-setgetname to
	tests-internal.
	* sysdeps/x86_64/fpu/Makefile: Add all libmvec test modules to
	modules-names-tests.
2017-05-11 19:27:59 -04:00
H.J. Lu
4cb334c4d6 x86: Use AVX2 memcpy/memset on Skylake server [BZ #21396]
On Skylake server, AVX512 load/store instructions in memcpy/memset may
lead to lower CPU turbo frequency in certain situations.  Use of AVX2
in memcpy/memset has been observed to have improved overall performance
in many workloads due to the higher frequency.

Since AVX512ER is unique to Xeon Phi, this patch sets Prefer_No_AVX512
if AVX512ER isn't available so that AVX2 versions of memcpy/memset are
used on Skylake server.

	[BZ #21396]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Prefer_No_AVX512 if AVX512ER isn't available.
	* sysdeps/x86/cpu-features.h (bit_arch_Prefer_No_AVX512): New.
	(index_arch_Prefer_No_AVX512): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Don't use
	AVX512 version if Prefer_No_AVX512 is set.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S (__memmove_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/memset.S (memset): Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk):
	Likewise.
2017-04-18 14:01:45 -07:00
Joseph Myers
f7a51347a4 Revert header inclusion changes that break math/ testing on x86_64.
Revert:
	2017-02-16  Zack Weinberg  <zackw@panix.com>

	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.
2017-02-17 17:08:17 +00:00
Zack Weinberg
ceaa98897c Add missing header files throughout the testsuite.
* crypt/md5.h: Test _LIBC with #if defined, not #if.
	* dirent/opendir-tst1.c: Include sys/stat.h.
	* dirent/tst-fdopendir.c: Include sys/stat.h.
	* dirent/tst-fdopendir2.c: Include stdlib.h.
	* dirent/tst-scandir.c: Include stdbool.h.
	* elf/tst-auditmod1.c: Include link.h and stddef.h.
	* elf/tst-tls15.c: Include stdlib.h.
	* elf/tst-tls16.c: Include stdlib.h.
	* elf/tst-tls17.c: Include stdlib.h.
	* elf/tst-tls18.c: Include stdlib.h.
	* iconv/tst-iconv6.c: Include endian.h.
	* iconvdata/bug-iconv11.c: Include limits.h.
	* io/test-utime.c: Include stdint.h.
	* io/tst-faccessat.c: Include sys/stat.h.
	* io/tst-fchmodat.c: Include sys/stat.h.
	* io/tst-fchownat.c: Include sys/stat.h.
	* io/tst-fstatat.c: Include sys/stat.h.
	* io/tst-futimesat.c: Include sys/stat.h.
	* io/tst-linkat.c: Include sys/stat.h.
	* io/tst-mkdirat.c: Include sys/stat.h and stdbool.h.
	* io/tst-mkfifoat.c: Include sys/stat.h and stdbool.h.
	* io/tst-mknodat.c: Include sys/stat.h and stdbool.h.
	* io/tst-openat.c: Include stdbool.h.
	* io/tst-readlinkat.c: Include sys/stat.h.
	* io/tst-renameat.c: Include sys/stat.h.
	* io/tst-symlinkat.c: Include sys/stat.h.
	* io/tst-unlinkat.c: Include stdbool.h.
	* libio/bug-memstream1.c: Include stdlib.h.
	* libio/bug-wmemstream1.c: Include stdlib.h.
	* libio/tst-fwrite-error.c: Include stdlib.h.
	* libio/tst-memstream1.c: Include stdlib.h.
	* libio/tst-memstream2.c: Include stdlib.h.
	* libio/tst-memstream3.c: Include stdlib.h.
	* malloc/tst-interpose-aux.c: Include stdint.h.
	* misc/tst-preadvwritev-common.c: Include sys/stat.h.
	* nptl/tst-basic7.c: Include limits.h.
	* nptl/tst-cancel25.c: Include pthread.h, not pthreadP.h.
	* nptl/tst-cancel4.c: Include stddef.h, limits.h, and sys/stat.h.
	* nptl/tst-cancel4_1.c: Include stddef.h.
	* nptl/tst-cancel4_2.c: Include stddef.h.
	* nptl/tst-cond16.c: Include limits.h.
	Use sysconf(_SC_PAGESIZE) instead of __getpagesize.
	* nptl/tst-cond18.c: Include limits.h.
	Use sysconf(_SC_PAGESIZE) instead of __getpagesize.
	* nptl/tst-cond4.c: Include stdint.h.
	* nptl/tst-cond6.c: Include stdint.h.
	* nptl/tst-stack2.c: Include limits.h.
	* nptl/tst-stackguard1.c: Include stddef.h.
	* nptl/tst-tls4.c: Include stdint.h. Don't include tls.h.
	* nptl/tst-tls4moda.c: Include stddef.h.
	Don't include stdio.h, unistd.h, or tls.h.
	* nptl/tst-tls4modb.c: Include stddef.h.
	Don't include stdio.h, unistd.h, or tls.h.
	* nptl/tst-tls5.h: Include stddef.h. Don't include stdlib.h or tls.h.
	* posix/tst-getaddrinfo2.c: Include stdio.h.
	* posix/tst-getaddrinfo5.c: Include stdio.h.
	* posix/tst-pathconf.c: Include sys/stat.h.
	* posix/tst-posix_fadvise-common.c: Include stdint.h.
	* posix/tst-preadwrite-common.c: Include sys/stat.h.
	* posix/tst-regex.c: Include stdint.h.
	Don't include spawn.h or spawn_int.h.
	* posix/tst-regexloc.c: Don't include spawn.h or spawn_int.h.
	* posix/tst-vfork3.c: Include sys/stat.h.
	* resolv/tst-bug18665-tcp.c: Include stdlib.h.
	* resolv/tst-res_hconf_reorder.c: Include stdlib.h.
	* resolv/tst-resolv-search.c: Include stdlib.h.
	* stdio-common/tst-fmemopen2.c: Include stdint.h.
	* stdio-common/tst-vfprintf-width-prec.c: Include stdlib.h.
	* stdlib/test-canon.c: Include sys/stat.h.
	* stdlib/tst-tls-atexit.c: Include stdbool.h.
	* string/test-memchr.c: Include stdint.h.
	* string/tst-cmp.c: Include stdint.h.
	* sysdeps/pthread/tst-timer.c: Include stdint.h.
	* sysdeps/unix/sysv/linux/tst-sync_file_range.c: Include stdint.h.
	* sysdeps/wordsize-64/tst-writev.c: Include limits.h and stdint.h.
	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/tst-auditmod10b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod3b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod4b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod5b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod6b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod6c.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod7b.c: Include link.h and stddef.h.
	* time/clocktest.c: Include stdint.h.
	* time/tst-posixtz.c: Include stdint.h.
	* timezone/tst-timezone.c: Include stdint.h.
2017-02-16 17:33:18 -05:00
H.J. Lu
02b78ff749 Add VZEROUPPER to memset-vec-unaligned-erms.S [BZ #21081]
Since memset-vec-unaligned-erms.S has VDUP_TO_VEC0_AND_SET_RETURN at
function entry, memset optimized for AVX2 and AVX512 will always use
ymm/zmm register. VZEROUPPER should be placed before ret in

L(stosb):
        movq    %rdx, %rcx
        movzbl  %sil, %eax
        movq    %rdi, %rdx
        rep stosb
        movq    %rdx, %rax
        ret

since it can be reached from

L(stosb_more_2x_vec):
        cmpq    $REP_STOSB_THRESHOLD, %rdx
        ja      L(stosb)

	[BZ #21081]
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(L(stosb)): Add VZEROUPPER before ret.
2017-01-30 10:59:31 -08:00
Adhemerval Zanella
8dad72997a Fix x86 strncat optimized implementation for large sizes
Similar to BZ#19387, BZ#21014, and BZ#20971, both x86 sse2 strncat
optimized assembly implementations do not handle the size overflow
correctly.

The x86_64 one is in fact an issue with strcpy-sse2-unaligned, but
that is triggered also with strncat optimized implementation.

This patch uses a similar strategy used on 3daef2c8ee, where
saturared math is used for overflow case.

Checked on x86_64-linux-gnu and i686-linux-gnu.  It fixes BZ #19390.

	[BZ #19390]
	* string/test-strncat.c (test_main): Add tests with SIZE_MAX as
	maximum string size.
	* sysdeps/i386/i686/multiarch/strcat-sse2.S (STRCAT): Avoid overflow
	in pointer addition.
	* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S (STRCPY):
	Likewise.
2017-01-03 14:24:53 -02:00
Joseph Myers
bfff8b1bec Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
H.J. Lu
f43cb35c9b Require binutils 2.24 to build x86-64 glibc [BZ #20139]
If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used
to save the first 8 vector registers, which only saves the lower 256
bits of vector register, for lazy binding.  When it is called on AVX512
platform, the upper 256 bits of ZMM registers are clobbered.  Parameters
passed in ZMM registers will be wrong when the function is called the
first time.  This patch requires binutils 2.24, whose assembler can store
and load ZMM registers, to build x86-64 glibc.  Since mathvec library
needs assembler support for AVX512DQ,  we disable mathvec if assembler
doesn't support AVX512DQ.

	[BZ #20139]
	* config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ...
	(HAVE_AVX512DQ_ASM_SUPPORT): This.
	* sysdeps/x86_64/configure.ac: Require assembler from binutils
	2.24 or above.
	(HAVE_AVX512_ASM_SUPPORT): Removed.
	(HAVE_AVX512DQ_ASM_SUPPORT): New.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT
	check unconditional.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check
	HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT.
	* sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S:
	Likewise.
2016-07-01 06:03:05 -07:00
H.J. Lu
13efa86ece Check Prefer_ERMS in memmove/memcpy/mempcpy/memset
Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove,
memcpy, mempcpy and memset aren't used by the current processors, this
patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so
that they can be used in the future.

	* sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New.
	(index_arch_Prefer_ERMS): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return
	__memcpy_erms for Prefer_ERMS.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__memmove_erms): Enabled for libc.a.
	* ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return
	__memmove_erms or Prefer_ERMS.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return
	__mempcpy_erms for Prefer_ERMS.
	* sysdeps/x86_64/multiarch/memset.S (memset): Return
	__memset_erms for Prefer_ERMS.
2016-06-30 07:58:11 -07:00
H.J. Lu
c867597bff X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
the new ones.

No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
memcpy/memmove optimized with Enhanced REP MOVSB will be used for
processors with ERMS.  The new AVX512 memcpy/memmove will be used for
processors with AVX512 which prefer vzeroupper.

Since the new SSE2 memcpy/memmove are faster than the previous default
memcpy/memmove used in libc.a and ld.so, we also remove the previous
default memcpy/memmove and make them the default memcpy/memmove, except
that non-temporal store isn't used in ld.so.

Together, it reduces the size of libc.so by about 6 KB and the size of
ld.so by about 2 KB.

	[BZ #19776]
	* sysdeps/x86_64/memcpy.S: Make it dummy.
	* sysdeps/x86_64/mempcpy.S: Likewise.
	* sysdeps/x86_64/memmove.S: New file.
	* sysdeps/x86_64/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/memmove.c: Removed.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-sse2-unaligned, memmove-avx-unaligned,
	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Replace
	__memmove_chk_avx512_unaligned_2 with
	__memmove_chk_avx512_unaligned.  Remove
	__memmove_chk_avx_unaligned_2.  Replace
	__memmove_chk_sse2_unaligned_2 with
	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
	with __memmove_avx512_unaligned.  Replace
	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
	with __memcpy_chk_avx512_unaligned.  Remove
	__memcpy_chk_avx_unaligned_2.  Replace
	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
	Replace __memcpy_avx512_unaligned_2 with
	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
	with __mempcpy_chk_avx512_unaligned.  Remove
	__mempcpy_chk_avx_unaligned_2.  Replace
	__mempcpy_chk_sse2_unaligned_2 with
	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
	Replace __mempcpy_avx512_unaligned_2 with
	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
	Replace __mempcpy_sse2_unaligned_2 with
	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../memcpy.S.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
	__memcpy_chk_avx512_unaligned_erms and
	__memcpy_chk_avx512_unaligned.  Use
	__memcpy_chk_avx_unaligned_erms and
	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __memcpy_chk_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	Change function suffix from unaligned_2 to unaligned.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../mempcpy.S.
	(mempcpy): New.  Add a weak alias.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
	__mempcpy_chk_avx512_unaligned_erms and
	__mempcpy_chk_avx512_unaligned.  Use
	__mempcpy_chk_avx_unaligned_erms and
	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __mempcpy_chk_sse2_unaligned.
2016-06-08 13:58:08 -07:00
H.J. Lu
5e8c5bb1ac X86-64: Remove the previous SSE2/AVX2 memsets
Since the new SSE2/AVX2 memsets are faster than the previous ones, we
can remove the previous SSE2/AVX2 memsets and replace them with the
new ones.  This reduces the size of libc.so by about 900 bytes.

No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
optimized with Enhanced REP STOSB will be used for processors with
ERMS.  The new AVX512 memset will be used for processors with AVX512
which prefer vzeroupper.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
	into ...
	* sysdeps/x86_64/memset.S: This.
	(__bzero): Removed.
	(__memset_tail): Likewise.
	(__memset_chk): Likewise.
	(memset): Likewise.
	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
	defined.
	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
	(__memset_zero_constant_len_parameter): Check SHARED instead of
	PIC.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memset-avx2 and memset-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__bzero): Enabled.
	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	(memset): Removed.
	(__memset_chk): Likewise.
	(MEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Replace __memset_sse2 with
	__memset_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
	__memset_chk_sse2 and __memset_chk_avx2 with
	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
	Use __memset_chk_sse2_unaligned_erms or
	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_chk_avx512_unaligned_erms and
	__memset_chk_avx512_unaligned.
2016-06-08 13:56:14 -07:00
H.J. Lu
3f61232ab3 Fix a typo in comments in memmove-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Fix
	a typo in comments.
2016-06-06 16:03:21 -07:00
H.J. Lu
eb2c88c7c8 Remove alignments on jump targets in memset
X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
increases code sizes, but not necessarily improve performance.  As
memset benchtest data of align vs no align on various Intel and AMD
processors

https://sourceware.org/bugzilla/attachment.cgi?id=9277

shows that aligning jump targets isn't necessary.

	[BZ #20115]
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
	Remove alignments on jump targets.
2016-05-19 08:49:55 -07:00
H.J. Lu
2a1f15b1a9 Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
i686 and x86-64.

	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
	Remove ifunc-defines.sym.
	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
	Likewise.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
	rtld-global-offsets.sym.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
	* sysdeps/x86/cpu-features-offsets.sym: This.
	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
2016-05-11 05:51:39 -07:00
H.J. Lu
a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
H.J. Lu
a7d1c51482 X86-64: Prepare memmove-vec-unaligned-erms.S
Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
default memcpy, mempcpy and memmove.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(MEMCPY_SYMBOL): New.
	(MEMPCPY_SYMBOL): Likewise.
	(MEMMOVE_CHK_SYMBOL): Likewise.
	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
	Provide alias for memcpy in libc.a and ld.so.
2016-04-06 10:19:16 -07:00
H.J. Lu
4af1bb06c5 X86-64: Prepare memset-vec-unaligned-erms.S
Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
default memset.

	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
	Disabled fro now.
	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
	symbols.  Properly check USE_MULTIARCH on __memset symbols.
2016-04-06 09:10:35 -07:00
H.J. Lu
ec0cac9a1f Force 32-bit displacement in memset-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
	32-bit displacement to avoid long nop between instructions.
2016-04-05 05:21:19 -07:00
H.J. Lu
696ac77484 Add a comment in memset-sse2-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
	a comment on VMOVU and VMOVA.
2016-04-05 05:19:18 -07:00
H.J. Lu
5cd7af016d Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
and AVX512 memmove and memset in ld.so.

	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
	if not in libc.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
2016-04-03 14:35:38 -07:00
H.J. Lu
ea2785e96f Fix memmove-vec-unaligned-erms.S
__mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
and __memmove it breaks __memmove_chk.

Don't check source == destination first since it is less common.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
	with unaligned_erms.
	(__memmove_erms): Skip if source == destination.
	(__memmove_unaligned_erms): Don't check source == destination
	first.
2016-04-03 12:38:25 -07:00
H.J. Lu
830566307f Add x86-64 memset with unaligned store and rep stosb
Implement x86-64 memset with unaligned store and rep movsb.  Support
16-byte, 32-byte and 64-byte vector register sizes.  A single file
provides 2 implementations of memset, one with rep stosb and the other
without rep stosb.  They share the same codes when size is between 2
times of vector register size and REP_STOSB_THRESHOLD which defaults
to 2KB.

Key features:

1. Use overlapping store to avoid branch.
2. For size <= 4 times of vector register size, fully unroll the loop.
3. For size > 4 times of vector register size, store 4 times of vector
register size at a time.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
	memset-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
	__memset_sse2_unaligned_erms, __memset_erms,
	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
	Likewise.
2016-03-31 10:06:07 -07:00
H.J. Lu
88b57b8ed4 Add x86-64 memmove with unaligned load/store and rep movsb
Implement x86-64 memmove with unaligned load/store and rep movsb.
Support 16-byte, 32-byte and 64-byte vector register sizes.  When
size <= 8 times of vector register size, there is no check for
address overlap bewteen source and destination.  Since overhead for
overlap check is small when size > 8 times of vector register size,
memcpy is an alias of memmove.

A single file provides 2 implementations of memmove, one with rep movsb
and the other without rep movsb.  They share the same codes when size is
between 2 times of vector register size and REP_MOVSB_THRESHOLD which
is 2KB for 16-byte vector register size and scaled up by large vector
register size.

Key features:

1. Use overlapping load and store to avoid branch.
2. For size <= 8 times of vector register size, load  all sources into
registers and store them together.
3. If there is no address overlap bewteen source and destination, copy
from both ends with 4 times of vector register size at a time.
4. If address of destination > address of source, backward copy 8 times
of vector register size at a time.
5. Otherwise, forward copy 8 times of vector register size at a time.
6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
by fallbacking to backward copy 8 times of vector register size at a
time.
7. Skip when address of destination == address of source.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
	memmove-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test
	__memmove_chk_avx512_unaligned_2,
	__memmove_chk_avx512_unaligned_erms,
	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
	__memmove_chk_sse2_unaligned_2,
	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
	__memmove_avx512_unaligned_erms, __memmove_erms,
	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
	__memcpy_chk_avx512_unaligned_2,
	__memcpy_chk_avx512_unaligned_erms,
	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
	__mempcpy_chk_avx512_unaligned_erms,
	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
	__mempcpy_erms.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	Likwise.
2016-03-31 10:04:40 -07:00
H.J. Lu
064f01b10b Make __memcpy_avx512_no_vzeroupper an alias
Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
__memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
to reduce code size of libc.so.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
	to ...
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
	(MEMCPY): Don't define.
	(MEMCPY_CHK): Likewise.
	(MEMPCPY): Likewise.
	(MEMPCPY_CHK): Likewise.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMCPY_CHK): Renamed to ...
	(__memmove_chk_avx512_no_vzeroupper): This.
	(MEMCPY): Renamed to ...
	(__memmove_avx512_no_vzeroupper): This.
	(__memcpy_avx512_no_vzeroupper): New alias.
	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
2016-03-28 13:16:22 -07:00
H.J. Lu
c365e615f7 Implement x86-64 multiarch mempcpy in memcpy
Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
reduces code size of libc.so.

	[BZ #18858]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
	and mempcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
	New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
	(MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
2016-03-28 13:13:51 -07:00
H.J. Lu
e41b395523 [x86] Add a feature bit: Fast_Unaligned_Copy
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load.  A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
	processors.  Set Fast_Copy_Backward for AMD Excavator
	processors.
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
	New.
	(index_arch_Fast_Unaligned_Copy): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
2016-03-28 04:40:03 -07:00
H.J. Lu
3c9a4cd16c Don't set %rcx twice before "rep movsb"
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
	Don't set %rcx twice before "rep movsb".
2016-03-22 08:36:16 -07:00
H.J. Lu
fee9eb6200 Group AVX512 functions in .text.avx512 section
* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
	Replace .text with .text.avx512.
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
	Likewise.
2016-03-06 16:48:11 -08:00
H.J. Lu
14a1d7cc4c x86-64: Fix memcpy IFUNC selection
Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
order is updated with following selection order:

1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __memcpy_sse2 if SSSE3 isn't available.
4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
5. __memcpy_ssse3

	[BZ #18880]
	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
	instead of Slow_BSF, and also check for Fast_Copy_Backward to
	enable __memcpy_ssse3_back.
2016-03-04 08:39:07 -08:00
Andrew Senkevich
df782dc690 Fixed build with assembler w/o AVX-512 support.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Fixed build with
    assembler not supporting AVX-512.
2016-01-19 14:34:53 +03:00
Andrew Senkevich
214a44f394 Fixed typos in __memcpy_chk.
* sysdeps/x86_64/multiarch/memcpy_chk.S: Fixed typos.
2016-01-16 14:42:26 +03:00
Andrew Senkevich
72276d6e88 Added memcpy/memmove family optimized with AVX512 for KNL hardware.
Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
mempcpy_chk, memmove_chk.
It shows average improvement more than 30% over AVX versions on KNL
hardware (performance results in the thread
<https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).

    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove.c: Likewise.
    * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
2016-01-16 00:49:45 +03:00
Joseph Myers
f7a9f785e5 Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
Andrew Senkevich
83d776f979 Added memset optimized with AVX512 for KNL hardware.
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).

    * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
    * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
    index_Prefer_No_VZEROUPPER): New.
    * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
    Prefer_No_VZEROUPPER for Knights Landing.
2015-12-19 02:47:28 +03:00
Joseph Myers
c871b9b096 Remove -mavx2 configure tests.
There are configure tests for the -mavx2 compiler option.  AVX2
support was added in GCC 4.7, so these tests are now obsolete; this
patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by the patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* config.h.in (HAVE_AVX2_SUPPORT): Remove #undef.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 unconditionally instead of conditionally on
	[$(config-cflags-avx2) = yes].
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list) [HAVE_AVX2_SUPPORT]: Make code
	unconditional.
	* sysdeps/x86_64/multiarch/memset.S [HAVE_AVX2_SUPPORT]: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S
	[IS_IN (libc) && SHARED && HAVE_AVX2_SUPPORT]: Change conditional
	to [IS_IN (libc) && SHARED].
2015-10-28 13:29:03 +00:00
Joseph Myers
1b12cd7f4d Remove configure tests for AVX support.
GCC added support for -mavx and -msse2avx in version 4.4.  Thus the
configure tests for this support are obsolete, and this patch removes
them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by this patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_avx): Remove configure
	test.
	(libc_cv_cc_sse2avx): Likewise.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/i686/multiarch/Makefile
	[$(subdir)$(config-cflags-avx) = mathyes]: Change conditional to
	[$(subdir) = math].
	* sysdeps/i386/i686/multiarch/s_fma-fma.c [HAVE_AVX_SUPPORT]: Make
	code unconditional.
	* sysdeps/i386/i686/multiarch/s_fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/s_fmaf-fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_avx): Remove configure
	test.
	(libc_cv_cc_sse2avx): Likewise.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/Makefile [$(config-cflags-avx) = yes]: Make code
	unconditional.
	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_profile)
	[HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT]: Make code
	unconditional.
	(_dl_runtime_profile)
	[!(HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT)]: Remove
	conditional code.
	* sysdeps/x86_64/fpu/multiarch/Makefile
	[$(config-cflags-sse2avx) = yes]: Make code unconditional.
	* sysdeps/x86_64/fpu/multiarch/e_atan2.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_exp.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_log.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_atan.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sin.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_tan.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S [HAVE_AVX_SUPPORT]: Likewise.
	* config.h.in (HAVE_AVX_SUPPORT): Remove #undef.
	(HAVE_SSE2AVX_SUPPORT): Likewise.
2015-10-08 15:59:32 +00:00
Joseph Myers
3b7aa5bf59 Remove configure tests for SSE4 support.
GCC added support for -msse4 in version 4.3.  Thus the configure tests
for it are obsolete, and this patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by this patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/i686/multiarch/Makefile
	[$(config-cflags-sse4) = yes]: Make code unconditional.
	* sysdeps/i386/i686/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/strspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/multiarch/Makefile [$(config-cflags-sse4) = yes]:
	Make code unconditional.
	* sysdeps/x86_64/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/multiarch/strspn.S [HAVE_SSE4_SUPPORT]: Likewise.
	* config.h.in (HAVE_SSE4_SUPPORT): Remove #undef.
2015-10-06 20:47:40 +00:00
H.J. Lu
2339c6f4bd Remove x86-64 rtld-xxx.c and rtld-xxx.S
Since ld.so preserves vector registers now, we can use the regular,
non-ifunc string and memory functions in ld.so.

	* sysdeps/x86_64/rtld-memcmp.c: Removed.
	* sysdeps/x86_64/rtld-memset.S: Likewise.
	* sysdeps/x86_64/rtld-strchr.S: Likewise.
	* sysdeps/x86_64/rtld-strlen.S: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memset.S: Likewise.
2015-08-25 08:50:06 -07:00
H.J. Lu
daa4db69fc Remove the unused IFUNC files
sysdeps/i386/i686/multiarch/strcasestr-c.c became unused after

commit 1818483b15
Author: Andreas Schwab <schwab@suse.de>
Date:   Wed Dec 18 11:53:27 2013 +1000

    Remove use of SSE4.2 functions for strstr on i686

which contains

-sysdep_routines += strcspn-c strpbrk-c strspn-c strstr-c strcasestr-c
+sysdep_routines += strcspn-c strpbrk-c strspn-c

sysdeps/x86_64/multiarch/strcasestr.c became useless after

t 584b18eb4d
Author: Ondřej Bílka <neleai@seznam.cz>
Date:   Sat Dec 14 19:33:56 2013 +0100

    Add strstr with unaligned loads. Fixes bug 12100.

which changes sysdeps/x86_64/multiarch/strcasestr.c to

libc_ifunc (__strcasestr, __strcasestr_sse2);

This patch removes these file.

	* i386/i686/multiarch/strcasestr-c.c: Removed.
	* x86_64/multiarch/strcasestr.c: Likewise.
	* x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Remove strcasestr.
2015-08-20 12:47:20 -07:00
H.J. Lu
1ae6c72dc1 Move x86_64 init-arch.h to sysdeps/x86/init-arch.h
Move sysdeps/x86_64/multiarch/init-arch.h to sysdeps/x86/init-arch.h
which can be used for both i386 and x86_64.

	* sysdeps/i386/i686/multiarch/init-arch.h: Removed.
	* sysdeps/unix/sysv/linux/x86/init-arch.h: Likewise.
	* sysdeps/x86_64/cacheinfo.c: Include <init-arch.h> instead
	of "multiarch/init-arch.h".
	* sysdeps/x86_64/multiarch/init-arch.h: Renamed to ...
	* sysdeps/x86/init-arch.h: This.
2015-08-20 04:29:23 -07:00
H.J. Lu
0b5395f052 Update x86_64 multiarch functions for <cpu-features.h>
This patch updates x86_64 multiarch functions to use the newly defined
HAS_CPU_FEATURE, HAS_ARCH_FEATURE and LOAD_RTLD_GLOBAL_RO_RDX from
<cpu-features.h>.

	* sysdeps/x86_64/fpu/multiarch/e_asin.c: Replace HAS_XXX with
	HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX).
	* sysdeps/x86_64/fpu/multiarch/e_atan2.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_exp.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_log.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_pow.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_atan.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fma.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fmaf.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sin.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_tan.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_ceil.S: Use
	LOAD_RTLD_GLOBAL_RO_RDX and HAS_CPU_FEATURE (SSE4_1).
	* sysdeps/x86_64/fpu/multiarch/s_ceilf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floor.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floorf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_nearbyint.S : Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_nearbyintf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_rintf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_rintf.S : Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/x86_64/multiarch/sched_cpucount.c: Likewise.
	* sysdeps/x86_64/multiarch/strstr.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/test-multiarch.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Remove __init_cpu_features
	call.  Add LOAD_RTLD_GLOBAL_RO_RDX.  Replace HAS_XXX with
	HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX).
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2015-08-13 03:41:30 -07:00
H.J. Lu
e2e4f56056 Add _dl_x86_cpu_features to rtld_global
This patch adds _dl_x86_cpu_features to rtld_global in x86 ld.so
and initializes it early before __libc_start_main is called so that
cpu_features is always available when it is used and we can avoid
calling __init_cpu_features in IFUNC selectors.

	* sysdeps/i386/dl-machine.h: Include <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/i386/dl-procinfo.c (_dl_x86_cpu_features): New.
	* sysdeps/i386/i686/cacheinfo.c
	(DISABLE_PREFERRED_MEMORY_INSTRUCTION): Removed.
	* sysdeps/i386/i686/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/i386/i686/multiarch/Versions: Removed.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/i386/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/unix/sysv/linux/x86/Makefile
	(libpthread-sysdep_routines): Remove init-arch.
	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: Include
	<sysdeps/x86_64/dl-procinfo.c> instead of
	sysdeps/generic/dl-procinfo.c>.
	* sysdeps/x86/Makefile [$(subdir) == csu] (gen-as-const-headers):
	Add cpu-features-offsets.sym and rtld-global-offsets.sym.
	[$(subdir) == elf] (sysdep-dl-routines): Add dl-get-cpu-features.
	[$(subdir) == elf] (tests): Add tst-get-cpu-features.
	[$(subdir) == elf] (tests-static): Add
	tst-get-cpu-features-static.
	* sysdeps/x86/Versions: New file.
	* sysdeps/x86/cpu-features-offsets.sym: Likewise.
	* sysdeps/x86/cpu-features.c: Likewise.
	* sysdeps/x86/cpu-features.h: Likewise.
	* sysdeps/x86/dl-get-cpu-features.c: Likewise.
	* sysdeps/x86/libc-start.c: Likewise.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86/tst-get-cpu-features-static.c: Likewise.
	* sysdeps/x86/tst-get-cpu-features.c: Likewise.
	* sysdeps/x86_64/dl-procinfo.c: Likewise.
	* sysdeps/x86_64/cacheinfo.c (__cpuid_count): Removed.
	Assume USE_MULTIARCH is defined and don't check it.
	(is_intel): Replace __cpu_features with GLRO(dl_x86_cpu_features).
	(is_amd): Likewise.
	(max_cpuid): Likewise.
	(intel_check_word): Likewise.
	(__cache_sysconf): Don't call __init_cpu_features.
	(__x86_preferred_memory_instruction): Removed.
	(init_cacheinfo): Don't call __init_cpu_features. Replace
	__cpu_features with GLRO(dl_x86_cpu_features).
	* sysdeps/x86_64/dl-machine.h: <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/x86_64/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/x86_64/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/x86_64/multiarch/Versions: Removed.
	* sysdeps/x86_64/multiarch/cacheinfo.c: Likewise.
	* sysdeps/x86_64/multiarch/init-arch.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/x86_64/multiarch/init-arch.h: Rewrite.
2015-08-13 03:41:22 -07:00
H.J. Lu
3b8d2eb7f8 Compile {memcpy,strcmp}-sse2-unaligned.S only for libc
{memcpy,strcmp}-sse2-unaligned.S aren't needed in ld.so.

	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Compile
	only for libc.
	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S: Likewise.
2015-08-05 08:28:37 -07:00
Andrew Senkevich
5fe2a126d1 This patch adds detection of availability for AVX512F and AVX512DQ ISAs.
* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX512F_Usable,
    bit_AVX512DQ_Usable, bit_Opmask_state, bit_ZMM0_15_state,
    bit_ZMM16_31_state): New macro.
    * sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    Check and set bit_AVX512F_Usable, bit_AVX512DQ_Usable.
2015-06-08 14:07:59 +03:00
Ondřej Bílka
0f4840be25 Use strspn/strcspn/strpbrk ifunc in internal calls.
To make a strtok faster and improve performance in general we need to do one
additional change.

A comment:

/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
   The speedup we get from using SSE4.2 instruction is likely eaten away
   by the indirect call in the PLT.  */

Does not make sense at all because nobody bothered to check it. Gap
between these implementations is quite big, when haystack is empty a
sse2 is around 40 cycles slower because it needs to populate a lookup
table and difference only increases with size. That is much bigger than
plt slowdown which is few cycles.

Even benchtest show a gap which also may be reverse by branch
misprediction but my internal benchmark shown.

 simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625

This patch also handles strpbrk which is implemented by including a
x86_64/multiarch/strcspn.S file.

	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
2015-05-12 20:18:51 +02:00
H.J. Lu
5f3d0b78e0 Use AVX unaligned memcpy only if AVX2 is available
memcpy with unaligned 256-bit AVX register loads/stores are slow on older
processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
and sets it only when AVX2 is available.

	[BZ #17801]
	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
	New.
	(index_AVX_Fast_Unaligned_Load): Likewise.
	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.
2015-01-30 15:37:58 -08:00
H.J. Lu
972af9e8dd Also treat model numbers 0x5a/0x5d as Silvermont 2015-01-23 18:52:45 -08:00
H.J. Lu
ede0236c86 Treat model numbers 0x4a/0x4d as Silvermont
* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Treat model numbers 0x4a/0x4d as Intel Silvermont architecture.
2015-01-23 18:08:10 -08:00
Joseph Myers
b168057aaa Update copyright dates with scripts/update-copyrights. 2015-01-02 16:29:47 +00:00
Siddhesh Poyarekar
4f41c682f3 Remove NOT_IN_libc
Replace with !IS_IN (libc).  This completes the transition from
the IS_IN/NOT_IN macros to the IN_MODULE macro set.

The generated code is unchanged on x86_64.

	* stdlib/isomac.c (fmt): Replace NOT_IN_libc with IN_MODULE.
	(get_null_defines): Adjust.
	* sunrpc/Makefile: Adjust comment.
	* Makerules (CPPFLAGS-nonlib): Remove NOT_IN_libc.
	* elf/Makefile (CPPFLAGS-sotruss-lib): Likewise.
	(CFLAGS-interp.c): Likewise.
	(CFLAGS-ldconfig.c): Likewise.
	(CPPFLAGS-.os): Likewise.
	* elf/rtld-Rules (rtld-CPPFLAGS): Likewise.
	* extra-lib.mk (CPPFLAGS-$(lib)): Likewise.
	* extra-modules.mk (extra-modules.mk): Likewise.
	* iconv/Makefile (CPPFLAGS-iconvprogs): Likewise.
	* locale/Makefile (CPPFLAGS-locale_programs): Likewise.
	* malloc/Makefile (CPPFLAGS-memusagestat): Likewise.
	* nscd/Makefile (CPPFLAGS-nscd): Likewise.
	* nss/Makefile (CPPFLAGS-nss_test1): Likewise.
	* stdlib/Makefile (CFLAGS-tst-putenvmod.c): Likewise.
	* sysdeps/gnu/Makefile ($(objpfx)errlist-compat.c): Likewise.
	* sysdeps/unix/sysv/linux/Makefile (CPPFLAGS-lddlibc4): Likewise.
	* iconvdata/Makefile (CPPFLAGS): Likewise.
	(cpp-srcs-left): Add libof for all iconvdata routines.
	* bits/stdio-lock.h: Replace NOT_IN_libc with IS_IN.
	* include/assert.h: Likewise.
	* include/ctype.h: Likewise.
	* include/errno.h: Likewise.
	* include/libc-symbols.h: Likewise.
	* include/math.h: Likewise.
	* include/netdb.h: Likewise.
	* include/resolv.h: Likewise.
	* include/stdio.h: Likewise.
	* include/stdlib.h: Likewise.
	* include/string.h: Likewise.
	* include/sys/stat.h: Likewise.
	* include/wctype.h: Likewise.
	* intl/l10nflist.c: Likewise.
	* libidn/idn-stub.c: Likewise.
	* libio/libioP.h: Likewise.
	* nptl/libc_multiple_threads.c: Likewise.
	* nptl/pthreadP.h: Likewise.
	* posix/regex_internal.h: Likewise.
	* resolv/res_hconf.c: Likewise.
	* sysdeps/arm/armv7/multiarch/memcpy.S: Likewise.
	* sysdeps/arm/memmove.S: Likewise.
	* sysdeps/arm/sysdep.h: Likewise.
	* sysdeps/generic/_itoa.h: Likewise.
	* sysdeps/generic/symbol-hacks.h: Likewise.
	* sysdeps/gnu/errlist.awk: Likewise.
	* sysdeps/gnu/errlist.c: Likewise.
	* sysdeps/i386/i586/memcpy.S: Likewise.
	* sysdeps/i386/i586/memset.S: Likewise.
	* sysdeps/i386/i686/memcpy.S: Likewise.
	* sysdeps/i386/i686/memmove.S: Likewise.
	* sysdeps/i386/i686/mempcpy.S: Likewise.
	* sysdeps/i386/i686/memset.S: Likewise.
	* sysdeps/i386/i686/multiarch/bcopy.S: Likewise.
	* sysdeps/i386/i686/multiarch/bzero.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp-sse4.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/memmove.S: Likewise.
	* sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/mempcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset-sse2-rep.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/rawmemchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp-sse4.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcspn.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen.S: Likewise.
	* sysdeps/i386/i686/multiarch/strnlen.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strspn.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscmp-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/wmemcmp-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wmemcmp.S: Likewise.
	* sysdeps/ia64/fpu/libm-symbols.h: Likewise.
	* sysdeps/nptl/bits/libc-lock.h: Likewise.
	* sysdeps/nptl/bits/libc-lockP.h: Likewise.
	* sysdeps/nptl/bits/stdio-lock.h: Likewise.
	* sysdeps/posix/closedir.c: Likewise.
	* sysdeps/posix/opendir.c: Likewise.
	* sysdeps/posix/readdir.c: Likewise.
	* sysdeps/posix/rewinddir.c: Likewise.
	* sysdeps/powerpc/novmx-sigjmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/bsd-_setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/fpu/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/fpu/setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/bzero.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcpy-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memmove.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memrchr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memrchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memset-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memset.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/rawmemchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strcasecmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strlen-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strlen.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncase.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncase_l.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncmp-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strnlen.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wordcopy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power6/memset.S: Likewise.
	* sysdeps/powerpc/powerpc32/setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc64/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/bzero.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcmp-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memmove-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memmove.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memset-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memset.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpncpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcat.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strchrnul.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcmp-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcspn.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strlen-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strlen.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncase.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncase_l.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncat.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncmp-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncpy-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strnlen.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strpbrk.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strrchr-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strspn-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strspn.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wordcopy.c: Likewise.
	* sysdeps/powerpc/powerpc64/setjmp.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/ifunc-resolve.c: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcmp.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcpy.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memset.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/ifunc-resolve.c: Likewise.
	* sysdeps/s390/s390-64/multiarch/memcmp.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/memcpy.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/memset.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara1.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara2.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara4.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset-niagara1.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset-niagara4.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset.S: Likewise.
	* sysdeps/unix/alpha/sysdep.S: Likewise.
	* sysdeps/unix/alpha/sysdep.h: Likewise.
	* sysdeps/unix/make-syscalls.sh: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/alpha/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/alpha/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/arm/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/arm/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/getpid.c: Likewise.
	* sysdeps/unix/sysv/linux/hppa/nptl/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/hppa/nptl/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/lowlevellock-futex.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/bits/m68k-vdso.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/mips/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/not-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/longjmp_chk.c: Likewise.
	* sysdeps/unix/sysv/linux/s390/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/sh/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/sh/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/sparc/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/brk.S: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/waitpid.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/sysdep.h: Likewise.
	* sysdeps/wordsize-32/symbol-hacks.h: Likewise.
	* sysdeps/x86_64/memcpy.S: Likewise.
	* sysdeps/x86_64/memmove.c: Likewise.
	* sysdeps/x86_64/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/init-arch.h: Likewise.
	* sysdeps/x86_64/multiarch/memcmp-sse4.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr-sse2-no-bsf.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
	* sysdeps/x86_64/strcmp.S: Likewise.
2014-11-24 15:03:45 +05:30
Andreas Schwab
3574f2fdf3 Fix misdetected Slow_SSE4_2 cpu feature bit (bug 17501) 2014-10-27 10:44:28 +01:00
Ling Ma
05f3633da4 Improve 64bit memcpy performance for Haswell CPU with AVX instruction
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 2% to 10%.
2014-07-30 08:02:35 -07:00
H.J. Lu
f2fef657d8 Enable AVX2 optimized memset only if -mavx2 works
* config.h.in (HAVE_AVX2_SUPPORT): New #undef.
	* sysdeps/i386/configure.ac: Set HAVE_AVX2_SUPPORT and
	config-cflags-avx2.
	* sysdeps/x86_64/configure.ac: Likewise.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 only if config-cflags-avx2 is yes.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Tests for memset_chk and memset only if HAVE_AVX2_SUPPORT is
	defined.
	* sysdeps/x86_64/multiarch/memset.S: Define multiple versions
	only if HAVE_AVX2_SUPPORT is defined.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
2014-07-14 07:58:27 -07:00
H.J. Lu
d92d8f8a42 Add ifunc tests for x86_64 memset_chk and memset
This patch adds ifunc tests for x86_64 memset_chk and memset.  It also
defines HAS_AVX2 with AVX2_Usable since AVX2 may not be usable even if
processor has AVX2.

	* sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Add tests for memset_chk and memset.

	* sysdeps/x86_64/multiarch/init-arch.h (HAS_AVX2): Defined
	with AVX2_Usable.
2014-06-20 14:52:29 -07:00
H.J. Lu
224c6c51c5 Remove sysdeps/x86_64/multiarch/rtld-strlen.S
Since there is no sysdeps/x86_64/multiarch/strlen.S,
sysdeps/x86_64/rtld-strlen.S will be used.

	* sysdeps/x86_64/multiarch/rtld-strlen.S: Removed.
2014-06-20 08:10:07 -07:00
Ling Ma
5c74e47cd6 Add x86_64 memset optimized for AVX2
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx & avx2 instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 26% to 59%.

	* sysdeps/x86_64/multiarch/Makefile: Add memset-avx2.
	* sysdeps/x86_64/multiarch/memset-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memset.S: Likewise.
2014-06-19 15:14:08 -07:00
Carlos O'Donell
8f1df5cf9d Fix -Wundef warning for FEATURE_INDEX_1.
Define FEATURE_INDEX_1 and FEATURE_INDEX_MAX as macros
for use by both assembly and C code. This fixes the
-Wundef error for cases where FEATURE_INDEX_1 was not
defined but used the correct value of 0 for an undefined
macro.
2014-05-03 00:25:21 -04:00
Sihai Yao
f9281df995 Detect if AVX2 is usable
This patch checks and sets bit_AVX2_Usable in __cpu_features.feature.

	* sysdeps/x86_64/multiarch/ifunc-defines.sym (COMMON_CPUID_INDEX_7):
	New.
	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Check and set bit_AVX2_Usable.
	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX2_Usable): New
	macro.
	(bit_AVX2): Likewise.
	(index_AVX2_Usable): Likewise.
	(CPUID_AVX2): Likewise.
	(HAS_AVX2): Likewise.
2014-04-17 08:00:21 -07:00
Allan McRae
d4697bc93d Update copyright notices with scripts/update-copyrights 2014-01-01 22:00:23 +10:00
Allan McRae
6f8e37ebf8 Update file name in x86_64 ifunc list
File name update missed in commit 584b18eb.
2013-12-16 13:00:39 +10:00
Ondřej Bílka
584b18eb4d Add strstr with unaligned loads. Fixes bug 12100.
A sse42 version of strstr used pcmpistr instruction which is quite
ineffective. A faster way is look for pairs of characters which is uses
sse2, is faster than pcmpistr and for real strings a pairs we look for
are relatively rare.

For linear time complexity we use buy or rent technique which switches
to two-way algorithm when superlinear behaviour is detected.
2013-12-14 20:08:13 +01:00
Ondřej Bílka
e7044ea76b Use p2align instead ALIGN 2013-10-08 15:46:48 +02:00
Ondřej Bílka
dc1a95c730 Faster strrchr. 2013-09-26 19:23:01 +02:00
Ondřej Bílka
5905e7b3e2 Faster strchr implementation. 2013-09-11 17:07:38 +02:00
Ondřej Bílka
8f02859f17 Add unaligned strcmp. 2013-09-03 16:27:10 +02:00
Ondřej Bílka
382466e04e Fix typos. 2013-08-30 18:08:59 +02:00
Ondřej Bílka
0186c6e97e Fix rawmemchr regression on bulldozer. 2013-08-30 10:14:37 +02:00
Ondřej Bílka
c0c3f78afb Fix typos. 2013-08-21 19:48:48 +02:00
Liubov Dmitrieva
6308fd9a46 Skip SSE4.2 versions on Intel Silvermont
SSE2/SSSE3 versions are faster than SSE4.2 versions on Intel Silvermont.
2013-06-28 15:31:40 -07:00
Liubov Dmitrieva
11b8a0e1d7 Fix buffers overrun in x86_64 memcmp-ssse3.S 2013-06-26 12:31:51 -07:00
Liubov Dmitrieva
d086fc7ba0 Set fast unaligned load flag for new Intel microarchitecture
I have small patch for new Intel Silvermont machines.

http://newsroom.intel.com/community/intel_newsroom/blog/2013/05/06/intel-launches-low-power-high-performance-silvermont-microarchitecture

I checked this on my machine and see that strcpy, ... unaligned
versions are faster than ssse3 versions.
2013-06-14 20:46:15 +02:00
Ondrej Bilka
2d48b41c8f Faster memcpy on x64.
We add new memcpy version that uses unaligned loads which are fast
on modern processors. This allows second improvement which is avoiding
computed jump which is relatively expensive operation.

Tests available here:
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
2013-05-20 08:24:41 +02:00
Ondrej Bilka
37bb363f03 Faster strlen on x64. 2013-03-18 07:39:12 +01:00
Ondrej Bilka
80f844c9d8 Remove Prefer_SSE_for_memop on x64 2013-03-11 15:39:08 +01:00
Ondrej Bilka
87bd9bc4bd Revert " * sysdeps/x86_64/strlen.S: Replace with new SSE2 based implementation"
This reverts commit b79188d717.
2013-03-06 22:27:18 +01:00
Ondrej Bilka
b79188d717 * sysdeps/x86_64/strlen.S: Replace with new SSE2 based implementation
which is faster on all x86_64 architectures.
	Tested on AMD, Intel Nehalem, SNB, IVB.
2013-03-06 21:54:01 +01:00
Roland McGrath
f1d70dad53 Remove lots of inline keywords. 2013-02-07 14:44:18 -08:00
H.J. Lu
afec409af9 Change __x86_64 prefix in cache size to __x86 2013-01-05 16:00:38 -08:00
H.J. Lu
5d7dd1ca84 Add HAS_RTM 2013-01-03 09:38:20 -08:00
Joseph Myers
568035b787 Update copyright notices with scripts/update-copyrights. 2013-01-02 19:05:09 +00:00
Pino Toscano
94558d30b1 test-multiarch: terminate printf output with newline 2012-11-22 11:34:03 +01:00
H.J. Lu
f62c8abcfb Compile x86 rtld with -mno-sse -mno-mmx 2012-11-02 18:43:27 -07:00
H.J. Lu
ac49ecaf9d Add x86-64 __libc_ifunc_impl_list 2012-10-11 16:41:12 -07:00
H.J. Lu
9a387d1f78 Use IFUNC memmove/memset in x86-64 bcopy/bzero
Also add separate tests for bcopy and bzero.
2012-10-11 13:58:16 -07:00
H.J. Lu
0569936773 Define HAS_FMA with bit_FMA_Usable 2012-10-02 05:05:17 -07:00
H.J. Lu
31ed415328 Don't define x86-64 __strncmp_ssse3 in libc.a 2012-09-27 07:43:03 -07:00
Roland McGrath
7312ca90dc Clean up x86_64/multiarch/strstr-c.c include order. 2012-08-15 11:38:57 -07:00
Roland McGrath
9a0a54864b Clean up x86_64/multiarch/memmove.c include order. 2012-08-15 11:26:02 -07:00
H.J. Lu
f85fa27058 Avoid DWARF definition DIE on ifunc symbols 2012-08-09 16:04:37 -07:00
Carlos O'Donell
1a0994f535 BZ#14059: Fix AVX and FMA4 detection.
Fix AVX and FMA4 detection by following the guidelines
set out by Intel and AMD for detecting these features.
2012-05-17 06:59:28 -07:00
H.J. Lu
70bc83b910 Load pointers into RAX_LP in strcmp-sse42.S 2012-05-15 09:59:31 -07:00
H.J. Lu
9bc0b730a6 Load cache sizes into R*_LP in memcpy-ssse3.S 2012-05-15 09:58:28 -07:00
H.J. Lu
6d2850e7f5 Load cache sizes into R*_LP in memcpy-ssse3-back.S 2012-05-15 09:56:17 -07:00
H.J. Lu
8a17f34979 Load cache size into R8_LP 2012-05-15 09:35:43 -07:00
Paul Eggert
59ba27a63a Replace FSF snail mail address with URLs. 2012-02-09 23:18:22 +00:00
Ulrich Drepper
08cf777f9e Really fix AVX tests
There is no problem with strcmp, it doesn't use the YMM registers.
The math routines might since gcc perhaps generates such code.
Introduce bit_YMM_USBALE and use it in the math routines.
2012-01-26 09:45:54 -05:00
Ulrich Drepper
afc5ed09cb Reset bit_AVX in __cpu_features is OS support is missing 2012-01-26 07:45:14 -05:00
Liubov Dmitrieva
15db4de19d Fix overrun in destination buffer 2011-12-23 12:02:15 -05:00
Ulrich Drepper
370a7d88f7 WP fixes 2011-12-17 14:41:05 -05:00
Ulrich Drepper
1d3e4b618a Optimized wcschr and wcscpy for x86-64 and x86-32 2011-12-17 14:39:23 -05:00
Ulrich Drepper
aff2453df7 Fix more warnings 2011-12-03 21:49:35 -05:00
Ulrich Drepper
34372fc6d3 Fix test of non-ASCII locales in x86-64 strcasecmp et.al. 2011-11-01 16:46:23 -04:00
Ulrich Drepper
52e4b9eb62 More cleanups of x86-64 strstr 2011-10-28 19:01:48 -04:00
Ulrich Drepper
fd52bc6dc4 Clean up x86-64 strcasestr
Actually describe in the C code what is going on.
2011-10-28 18:18:04 -04:00
Ulrich Drepper
e0016b11d6 Add AVX optimized versions for some x86-64 math functions 2011-10-25 21:34:55 -04:00
Ulrich Drepper
618280a192 Optimize x86-64 SSE4.2+ strcmp a bit more 2011-10-25 14:50:31 -04:00
Ulrich Drepper
09229f3e1b Fix WS 2011-10-23 14:57:28 -04:00
Liubov Dmitrieva
ce7dd29f28 Optimized strnlen and wcscmp for x86-64 2011-10-23 14:56:04 -04:00
Ulrich Drepper
c196fed8f0 Fix compilation problems in x86-64 init-arch 2011-10-21 20:47:20 -04:00
Ulrich Drepper
ed72b6545f Check for FMA4 support and generate appropriate fma functions 2011-10-20 22:43:15 -04:00
Ulrich Drepper
8d4f46c613 Move fma routines to right place 2011-10-20 21:55:41 -04:00
Ulrich Drepper
855d156018 Optimize x86-64 rawmemchr and add test 2011-10-19 22:22:29 -04:00
Ulrich Drepper
d9a4d2ab27 Add optimized str{,n}casecmp for AVX on x86-64 2011-10-19 12:42:38 -04:00
Ulrich Drepper
2d1f3a4db6 Fix WS 2011-10-15 11:11:12 -04:00
Liubov Dmitrieva
be13f7bff6 Optimized memcmp and wmemcmp for x86-64 and x86-32 2011-10-15 11:10:08 -04:00
Liubov Dmitrieva
093ecf9299 Improve 64 bit memchr, memrchr, rawmemchr with SSE2 2011-10-07 11:49:10 -04:00
Ulrich Drepper
ceaa0c5dc3 Move Atom-optimized code out of the way and together 2011-09-06 21:53:03 -04:00
Ulrich Drepper
6d18b67f4d Fix whitespaces 2011-09-05 21:42:12 -04:00
Liubov Dmitrieva
a5f524e479 Add Atom-optimized strchr and strrchr for x86-64 2011-09-05 21:34:03 -04:00
Andreas Schwab
8c1a459f9a Fix inline strncat/strncmp on x86 2011-08-04 14:59:25 -04:00
Ulrich Drepper
21137f89c5 Fix overflow bug is optimized strncat for x86-64 2011-07-21 12:32:36 -04:00
Ulrich Drepper
8002999481 Fix whitespaces 2011-07-19 17:27:09 -04:00
Liubov Dmitrieva
99710781cc Improve 64 bit strcat functions with SSE2/SSSE3 2011-07-19 17:11:54 -04:00
H.J. Lu
8912479f9e Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
H.J. Lu
0b1cbaaef5 Optimized st{r,p}{,n}cpy for SSE2/SSSE3 on x86-32 2011-06-24 14:15:32 -04:00
H.J. Lu
3d29045b5e Assume Intel Core i3/i5/i7 processor if AVX is available 2011-06-03 07:01:25 -04:00
Mike Frysinger
4c559bcdf3 Fix static linking with checking x86/x86-64 memcpy. 2011-04-17 22:20:47 -04:00
H.J. Lu
0354e35501 Work around old buggy program which cannot cope with memcpy semantics. 2011-04-01 19:38:21 -04:00
H.J. Lu
c97a1282a4 Handle page boundaries in x86 SSE4.2 strncmp. 2011-03-21 05:35:38 -04:00
Harsha Jagasia
7e4ba49cd3 Enable SSE2 memset for AMD'supcoming Orochi processor.
This patch enables SSE2 memset for AMD's upcoming Orochi processor.
This patch also fixes the following bug:
For misaligned blocks larger than > 144 Bytes, memset branches into
the integer code path depending on the value of misalignment even if
the startup code chooses the SSE2 code path upfront, when multiarch
is enabled.
2011-03-04 23:30:08 -05:00
Roland McGrath
a0bf67cca2 Fix some warning nits. 2011-02-04 10:53:51 -08:00
H.J. Lu
13b695749a Support Intel processor model 6 and model 0x2. 2010-11-12 03:48:52 -05:00
H.J. Lu
8ca52c6e3b Fix one exit path in x86-64 SSE4.2 str{,n}casecmp. 2010-11-10 03:05:37 -05:00
H.J. Lu
ff02d5280b Use IFUNC on x86-64 memset 2010-11-08 03:41:34 -05:00
Richard Li
dbf3a06904 Fix x86-64 strchr propagation of search byte into all bytes of SSE register 2010-10-25 14:13:17 -04:00
Jakub Jelinek
5e908464b9 Implement accurate fma. 2010-10-13 22:27:03 -04:00
Jakub Jelinek
9ff8d36f27 Correct implementation of fmaf. 2010-10-11 09:27:05 -04:00
Ulrich Drepper
015a4c6193 Re-enable all strncasecmp versions. 2010-09-20 20:18:00 -07:00
Ulrich Drepper
8ffcee4a04 Fix limit detection in x86-64 SSE2 strncasecmp. 2010-09-20 14:02:23 -07:00
Ulrich Drepper
9ea3de11f1 Move slow Atom code to separate section. 2010-08-26 22:17:03 -07:00
H.J. Lu
623aac7f84 Unroll x86-64 strlen 2010-08-26 22:09:34 -07:00
H.J. Lu
b416a90085 Missing comma in last commit. 2010-08-26 13:18:46 -07:00
Roland McGrath
8b2b771538 Clean up warnings in new x86_64/multiarch code. 2010-08-25 12:13:08 -07:00
H.J. Lu
e73015f2d6 Unroll 32bit SSE strlen and handle slow bsf 2010-08-25 10:07:37 -07:00
Ulrich Drepper
1cdfe7242f Add missing copyright year updated and pretty printing. 2010-08-24 11:42:19 -07:00
Richard Henderson
73f27d5e72 Clean up SSE variable shifts 2010-08-24 11:35:01 -07:00
Ulrich Drepper
9da4bb316f Fix two typos in x86-64 SSE4.2 strncasecmp implementation. 2010-08-19 09:20:44 -07:00
Ulrich Drepper
1feccb6caf Fix fourth parameter of SSE4.2 strcmp for x86-64. 2010-08-15 20:46:09 -07:00
Ulrich Drepper
e9f82e0d1d Add optimized strncasecmp versions for x86-64. 2010-08-14 22:04:01 -07:00
Ulrich Drepper
ca6bb004eb Fix x86-64 build without multiarch. 2010-08-14 14:56:32 -07:00
Ulrich Drepper
73507d3ae0 Add support for SSSE3 and SSE4.2 versions of strcasecmp on x86-64. 2010-07-31 21:41:09 -07:00
Ulrich Drepper
66f6765a47 Pretty printing x86-64 SSE4.3 strcmp. 2010-07-30 12:54:37 -07:00
Ulrich Drepper
fe36dd025e Fix tolower operation in strcasestr. 2010-07-30 00:09:07 -07:00
Ulrich Drepper
880113d91e Avoid compiling unneeded file in ld.so. 2010-07-27 21:12:59 -07:00
Ulrich Drepper
8e96b93aa7 Speed up x86-64 strcasestr a bit moew.
Using the new SSE4.2 instructions is cool but not really the fastest.
Some older SSE instructions can do the trick faster.
2010-07-24 08:34:44 -07:00
Andreas Schwab
f6a31e0eb6 Add strcasestr-nonascii to i386 build 2010-07-21 07:26:18 -07:00
Ulrich Drepper
d02dc4ba08 Fix non-ASCII case of SSE4.2 strcasstr. 2010-07-16 16:00:22 -07:00
Ulrich Drepper
cc9f2e47a0 Speed up SSE4.2 strcasestr by avoiding indirect function call. 2010-07-16 15:37:38 -07:00
H.J. Lu
6fb8cbcb58 Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7
This patch includes optimized 64bit memcpy/memmove for Atom, Core 2 and
Core i7.  It improves memcpy by up to 3X on Atom, up to 4X on Core 2 and
up to 1X on Core i7.  It also improves memmove by up to 3X on Atom, up to
4X on Core 2 and up to 2X on Core i7.
2010-06-30 08:26:11 -07:00
H.J. Lu
3c88fe1e3a Incorrect x86 CPU family and model check. 2010-05-27 11:14:18 -07:00
H.J. Lu
df87f54923 Check DATA_CACHE_SIZE_HALF 2010-04-14 22:18:27 -07:00
H.J. Lu
dd37cd1a12 Optimie x86-64 SSE4 memcmp for unaligned data. 2010-04-14 17:53:44 -07:00
H.J. Lu
404a6e3201 x86-64 SSE4 optimized memcmp
This is 64bit SSE4 optimized memcmp. It improves memcmp by upto 3X
on Intel Core i7.
2010-04-14 00:12:53 -07:00
Ulrich Drepper
bbbdd77809 Update x86-64 cpu multiarch selection header. 2010-04-13 19:17:10 -07:00
Ulrich Drepper
22f4f44b67 Fix concurrent handling of __cpu_features. 2010-04-04 00:25:46 -07:00
H.J. Lu
7d9335ecd7 Don't define __strpbrk_sse42 in static library 2010-03-24 12:16:24 -07:00
H.J. Lu
5a7af22fbb Unroll the loop x86-64 SSE4.2 strlen. 2010-01-13 07:51:48 -08:00
H.J. Lu
3af48cbdfa Optimize 32bit memset/memcpy with SSE2/SSSE3. 2010-01-12 11:22:03 -08:00
H.J. Lu
2510d01ddb Define bit_SSE2 and index_SSE2. 2009-12-13 15:23:02 -08:00
H.J. Lu
51ddd2c01e Define bit_XXX and index_XXX.
This patch defines bit_XXX and index_XXX and use them to check processor
feature in assembly code.  It can prevent typos in processor feature
check.
2009-12-13 09:47:02 -08:00
Ulrich Drepper
823bc6da65 Fix whitespaces. 2009-10-22 22:50:00 -07:00
H.J. Lu
001659f4d5 Implement SSE4.2 optimized strchr and strrchr. 2009-10-22 22:47:12 -07:00
Roland McGrath
b0f3a2e43f Clean up unnecessary libc_hidden_builtin_def fiddling in x86 multiarch definitions. 2009-10-06 20:01:23 -07:00
Roland McGrath
9d6982d5d2 Clean up x86 multiarch HAS_FOO macros. 2009-10-06 19:59:03 -07:00
Jakub Jelinek
22bb992d51 Fix strstr/strcasestr/fma/fmaf on x86_64. 2009-09-02 19:43:04 -07:00
H.J. Lu
5a4eb7282e Remove ENABLE_SSSE3_ON_ATOM.
It turns that SSSE3 isn't slow on Atom. The problem is bsf. This patch
removes ENABLE_SSSE3_ON_ATOM.
2009-08-28 14:54:46 -07:00
Ulrich Drepper
8e436522e1 Move SSE4.2 functions together. 2009-08-08 09:38:32 -07:00
Ulrich Drepper
0fda545d5f Add SSSE3-optimized implementation of str{,n}cmp for x86-64. 2009-08-07 22:51:02 -07:00
Ulrich Drepper
57b378ac89 Avoid warning through fake initialization. 2009-08-07 16:19:54 -07:00
H.J. Lu
02cea47161 Add x86 32-bit SSE4.2 string functions.
This patch adds 32bit SSE4.2 string functions.  It uses -16L instead of
0xfffffffffffffff0L, which works for both 32bit and 64bit long.  Tested
on 32bit Core i7 and Core 2.
2009-08-04 12:13:43 -07:00
H.J. Lu
6f6f1215f6 Support multiarch for i686.
This patch adds multiarch support when configured for i686.  I modified
some x86-64 functions to support 32bit. I will contribute 32bit SSE string
and memory functions later.
2009-07-31 11:53:35 -07:00
Ulrich Drepper
78c4ef475d Add support for x86-64 fma instruction.
Use it to implement fma and fmaf, if possible.
2009-07-29 15:26:06 -07:00
Ulrich Drepper
9a1d2d4555 Prepare use if IFUNC functions outside libc.so.
We use a callback function into libc.so to get access to the data
structure with the information and have special versions of the test
macros which automatically use this function.
2009-07-29 15:22:28 -07:00
Ulrich Drepper
e83c1a8a72 Refine testing for xmm/ymm register use in x86-64 ld.so.
The test now takes the callgraph into account.  Only code called
during runtime relocation is affected by the limitation.  We now
determine the affected object files as closely as possible from
the outside.  This allowed to remove some the specializations
for some of the string functions as they are only used in other
code paths.
2009-07-27 13:40:27 -07:00
Ulrich Drepper
16d2ea4c82 Make sure no code in ld.so uses xmm/ymm registers on x86-64.
This patch introduces a test to make sure no function modifies the
xmm/ymm registers.  With the exception of the auditing functions.

The test is probably too pessimistic.  All code linked into ld.so
is checked.  Perhaps at some point the callgraph starting from
_dl_fixup and _dl_profile_fixup is checked and we can start using
faster SSE-using functions in parts of ld.so.
2009-07-26 16:10:00 -07:00
H.J. Lu
7956a3d27c Add SSE2 support to str{,n}cmp for x86-64. 2009-07-26 13:32:28 -07:00
H.J. Lu
4e5b5821bf Some some optimizations for x86-64 strcmp. 2009-07-25 19:15:14 -07:00
Ulrich Drepper
29e92fa5cd Optimize x86-64 SSE4.2 strcmp.
The file contained some code which was never used.  Don't compile it
in.
2009-07-25 12:02:47 -07:00
Ulrich Drepper
d28797e426 Perform test for Arom x86-64 in central place and handle it.
There will be more than one function which, in multiarch mode, wants
to use SSSE3.  We should not test in each of them for Atoms with
slow SSSE3.  Instead, disable the SSSE3 bit in the startup code for
such machines.
2009-07-23 13:15:17 -07:00
Ulrich Drepper
ae612b04cc Minor cleanups in x86-64 strstr. 2009-07-21 07:52:12 -07:00
Ulrich Drepper
a8f895ebe1 Better check for optimization in new x86-64 strstr/strcasestr. 2009-07-20 21:18:28 -07:00
H.J. Lu
2b7a8664fa SSE4.2 strstr/strcasestr for x86-64.
This patch implements SSE4.2 strstr/strcasestr, using Knuth-Morris-Pratt
string searching algorithm.
2009-07-20 21:06:50 -07:00
Ulrich Drepper
cea4329592 Minor cleanups in recently added files. 2009-07-03 03:23:01 -07:00
Ulrich Drepper
d6485c981b Align functions to 16-byte boundary.
Some of the new multi-arch string functions for x86-64 were
not aligned to 16 byte boundarie,s possibly creating unnecessary
cache line misses and delays.
2009-07-03 03:01:57 -07:00
H.J. Lu
06e51c8f3d Add SSE4.2 support for strcspn, strpbrk, and strspn on x86-64. 2009-07-03 02:48:56 -07:00
Ulrich Drepper
af263b8154 Whitespace fixes in last patch. 2009-07-02 03:43:05 -07:00
H.J. Lu
ab6a873fe0 SSSE3 strcpy/stpcpy for x86-64
This patch adds SSSE3 strcpy/stpcpy. I got up to 4X speed up on Core 2
and Core i7.  I disabled it on Atom since SSSE3 version is slower for
shorter (<64byte) data.
2009-07-02 03:39:03 -07:00
Ulrich Drepper
b38a2e2e64 Fix little checkin problem in last patch. 2009-06-30 04:41:38 -07:00
H.J. Lu
0181291385 Determine and store processor family and model on x86-64. 2009-06-30 04:39:09 -07:00
Ulrich Drepper
059215ae21 Clean up whitespaces in last patch. 2009-06-22 20:39:37 -07:00
H.J. Lu
772f4e6a1b Add SSE4.2 support for strcmp and strncmp on x86-64. 2009-06-22 20:38:41 -07:00
Ulrich Drepper
b77c932329 Add SSE4.2 optimized rawmemchr implementation for x86-64. 2009-06-05 16:54:50 -07:00
Ulrich Drepper
6f9eea15bf Forgot some more cleanups for the SSE4.2 strlen on x86-64. 2009-06-05 11:51:59 -07:00
Ulrich Drepper
f85a9e72e2 Add missing cleanups from SSE4.2 x86-64 strlen. 2009-06-05 11:39:45 -07:00
Ulrich Drepper
3ab2d57a4d Optimize x86-64 strlen for SSE4.2.
The SSE4.2 implementation is used in the DSO only.  The patch also adds
some infrastructure to be used in similar code later one.
2009-06-05 11:32:00 -07:00
Ulrich Drepper
8ea2372936 Fix up sched_cpucount in x86-64.
Now that static executables can handle IFUNC functions don't exclude
optimization for sched_cpucount for !SHARED.
2009-05-31 23:46:42 -07:00
Ulrich Drepper
963cb6fcb4 Simplify CPUID value handling.
SO far Intel and AMD use exactly the same bits meaning the same
things in CPUID index 1.  Simplify the code.  Should an architecture
come along which doesn't use the same semantics then it must use a
different index value than COMMON_CPUID_INDEX_1.
2009-05-31 17:52:05 -07:00
Ulrich Drepper
425ce2edb9 * config.h.in (USE_MULTIARCH): Define.
* configure.in: Handle --enable-multi-arch.
	* elf/dl-runtime.c (_dl_fixup): Handle STT_GNU_IFUNC.
	(_dl_fixup_profile): Likewise.
	* elf/do-lookup.c (dl_lookup_x): Likewise.
	* sysdeps/x86_64/dl-machine.h: Handle STT_GNU_IFUNC.
	* elf/elf.h (STT_GNU_IFUNC): Define.
	* include/libc-symbols.h (libc_ifunc): Define.
	* sysdeps/x86_64/cacheinfo.c: If USE_MULTIARCH is defined, use the
	framework in init-arch.h to get CPUID values.
	* sysdeps/x86_64/multiarch/Makefile: New file.
	* sysdeps/x86_64/multiarch/init-arch.c: New file.
	* sysdeps/x86_64/multiarch/init-arch.h: New file.
	* sysdeps/x86_64/multiarch/sched_cpucount.c: New file.

	* config.make.in (experimental-malloc): Define.
	* configure.in: Handle --enable-experimental-malloc.
	* malloc/Makefile: Handle experimental-malloc flag.
	* malloc/malloc.c: Implement PER_THREAD and ATOMIC_FASTBINS features.
	* malloc/arena.c: Likewise.
	* malloc/hooks.c: Likewise.
	* malloc/malloc.h: Define M_ARENA_TEST and M_ARENA_MAX.
2009-03-13 23:53:18 +00:00