Commit Graph

1116 Commits

Author SHA1 Message Date
H.J. Lu
6a094c0ff1 x86-64: Regenerate libm-test-ulps for AVX512 mathvec tests
Update libm-test-ulps for AVX512 mathvec tests by running
“make regen-ulps” on Intel Xeon processor with AVX512.

	* sysdeps/x86_64/fpu/libm-test-ulps: Regenerated.

(cherry picked from commit fcaaca412f)
2017-11-13 13:31:18 +01:00
Adhemerval Zanella
5712f8db26 nptl: Add tests for internal pthread_mutex_t offsets
This patch adds a new build test to check for internal fields
offsets for user visible internal field.  Although currently
the only field which is statically initialized to a non zero value
is pthread_mutex_t.__data.__kind value, the tests also check the
offset of __kind, __spins, __elision (if supported), and __list
internal member.  A internal header (pthread-offset.h) is added
to each major ABI with the reference value.

Checked on x86_64-linux-gnu and with a build check for all affected
ABIs (aarch64-linux-gnu, alpha-linux-gnu, arm-linux-gnueabihf,
hppa-linux-gnu, i686-linux-gnu, ia64-linux-gnu, m68k-linux-gnu,
microblaze-linux-gnu, mips64-linux-gnu, mips64-n32-linux-gnu,
mips-linux-gnu, powerpc64le-linux-gnu, powerpc-linux-gnu,
s390-linux-gnu, s390x-linux-gnu, sh4-linux-gnu, sparc64-linux-gnu,
sparcv9-linux-gnu, tilegx-linux-gnu, tilegx-linux-gnu-x32,
tilepro-linux-gnu, x86_64-linux-gnu, and x86_64-linux-x32).

	* nptl/pthreadP.h (ASSERT_PTHREAD_STRING,
	ASSERT_PTHREAD_INTERNAL_OFFSET): New macro.
	* nptl/pthread_mutex_init.c (__pthread_mutex_init): Add build time
	checks for internal pthread_mutex_t offsets.
	* sysdeps/aarch64/nptl/pthread-offsets.h
	(__PTHREAD_MUTEX_NUSERS_OFFSET, __PTHREAD_MUTEX_KIND_OFFSET,
	__PTHREAD_MUTEX_SPINS_OFFSET, __PTHREAD_MUTEX_ELISION_OFFSET,
	__PTHREAD_MUTEX_LIST_OFFSET): New macro.
	* sysdeps/alpha/nptl/pthread-offsets.h: Likewise.
	* sysdeps/arm/nptl/pthread-offsets.h: Likewise.
	* sysdeps/hppa/nptl/pthread-offsets.h: Likewise.
	* sysdeps/i386/nptl/pthread-offsets.h: Likewise.
	* sysdeps/ia64/nptl/pthread-offsets.h: Likewise.
	* sysdeps/m68k/nptl/pthread-offsets.h: Likewise.
	* sysdeps/microblaze/nptl/pthread-offsets.h: Likewise.
	* sysdeps/mips/nptl/pthread-offsets.h: Likewise.
	* sysdeps/nios2/nptl/pthread-offsets.h: Likewise.
	* sysdeps/powerpc/nptl/pthread-offsets.h: Likewise.
	* sysdeps/s390/nptl/pthread-offsets.h: Likewise.
	* sysdeps/sh/nptl/pthread-offsets.h: Likewise.
	* sysdeps/sparc/nptl/pthread-offsets.h: Likewise.
	* sysdeps/tile/nptl/pthread-offsets.h: Likewise.
	* sysdeps/x86_64/nptl/pthread-offsets.h: Likewise.

Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

(cherry picked from commit dff91cd45e)
2017-11-07 10:57:49 -02:00
H.J. Lu
4b692dffb9 x86-64: Don't set GLRO(dl_platform) to NULL [BZ #22299]
Since ld.so expands $PLATFORM with GLRO(dl_platform), don't set
GLRO(dl_platform) to NULL.

	[BZ #22299]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
	GLRO(dl_platform) to NULL.
	* sysdeps/x86_64/Makefile (tests): Add tst-platform-1.
	(modules-names): Add tst-platformmod-1 and
	x86_64/tst-platformmod-2.
	(CFLAGS-tst-platform-1.c): New.
	(CFLAGS-tst-platformmod-1.c): Likewise.
	(CFLAGS-tst-platformmod-2.c): Likewise.
	(LDFLAGS-tst-platformmod-2.so): Likewise.
	($(objpfx)tst-platform-1): Likewise.
	($(objpfx)tst-platform-1.out): Likewise.
	(tst-platform-1-ENV): Likewise.
	($(objpfx)x86_64/tst-platformmod-2.os): Likewise.
	* sysdeps/x86_64/tst-platform-1.c: New file.
	* sysdeps/x86_64/tst-platformmod-1.c: Likewise.
	* sysdeps/x86_64/tst-platformmod-2.c: Likewise.

(cherry picked from commit 4d916f0f12)
2017-10-26 07:36:47 -07:00
H.J. Lu
f82a6fc223 x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]
In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
mask and bound registers.  It simplifies _dl_runtime_resolve and supports
different calling conventions.  ld.so code size is reduced by more than
1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
than saving and restoring vector and bound registers individually.

Latency for _dl_runtime_resolve to lookup the function, foo, from one
shared library plus libc.so:

                             Before    After     Change

Westmere (SSE)/fxsave         345      866       151%
IvyBridge (AVX)/xsave         420      643       53%
Haswell (AVX)/xsave           713      1252      75%
Skylake (AVX+MPX)/xsavec      559      719       28%
Skylake (AVX512+MPX)/xsavec   145      272       87%
Ryzen (AVX)/xsavec            280      553       97%

This is the worst case where portion of time spent for saving and
restoring registers is bigger than majority of cases.  With smaller
_dl_runtime_resolve code size, overall performance impact is negligible.

On IvyBridge, differences in build and test time of binutils with lazy
binding GCC and binutils are noises.  On Westmere, differences in
bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
binutils are also noises.

	[BZ #21265]
	* sysdeps/x86/cpu-features-offsets.sym (XSAVE_STATE_SIZE_OFFSET):
	New.
	* sysdeps/x86/cpu-features.c: Include <libc-pointer-arith.h>.
	(get_common_indeces): Set xsave_state_size, xsave_state_full_size
	and bit_arch_XSAVEC_Usable if needed.
	(init_cpu_features): Remove bit_arch_Use_dl_runtime_resolve_slow
	and bit_arch_Use_dl_runtime_resolve_opt.
	* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
	Removed.
	(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
	(bit_arch_Prefer_No_AVX512): Updated.
	(bit_arch_MathVec_Prefer_No_AVX512): Likewise.
	(bit_arch_XSAVEC_Usable): New.
	(STATE_SAVE_OFFSET): Likewise.
	(STATE_SAVE_MASK): Likewise.
	[__ASSEMBLER__]: Include <cpu-features-offsets.h>.
	(cpu_features): Add xsave_state_size and xsave_state_full_size.
	(index_arch_Use_dl_runtime_resolve_opt): Removed.
	(index_arch_Use_dl_runtime_resolve_slow): Likewise.
	(index_arch_XSAVEC_Usable): New.
	* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
	Support XSAVEC_Usable.  Remove Use_dl_runtime_resolve_slow.
	* sysdeps/x86_64/Makefile (tst-x86_64-1-ENV): New if tunables
	is enabled.
	* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup):
	Replace _dl_runtime_resolve_sse, _dl_runtime_resolve_avx,
	_dl_runtime_resolve_avx_slow, _dl_runtime_resolve_avx_opt,
	_dl_runtime_resolve_avx512 and _dl_runtime_resolve_avx512_opt
	with _dl_runtime_resolve_fxsave, _dl_runtime_resolve_xsave and
	_dl_runtime_resolve_xsavec.
	* sysdeps/x86_64/dl-trampoline.S (DL_RUNTIME_UNALIGNED_VEC_SIZE):
	Removed.
	(DL_RUNTIME_RESOLVE_REALIGN_STACK): Check STATE_SAVE_ALIGNMENT
	instead of VEC_SIZE.
	(REGISTER_SAVE_BND0): Removed.
	(REGISTER_SAVE_BND1): Likewise.
	(REGISTER_SAVE_BND3): Likewise.
	(REGISTER_SAVE_RAX): Always defined to 0.
	(VMOV): Removed.
	(_dl_runtime_resolve_avx): Likewise.
	(_dl_runtime_resolve_avx_slow): Likewise.
	(_dl_runtime_resolve_avx_opt): Likewise.
	(_dl_runtime_resolve_avx512): Likewise.
	(_dl_runtime_resolve_avx512_opt): Likewise.
	(_dl_runtime_resolve_sse): Likewise.
	(_dl_runtime_resolve_sse_vex): Likewise.
	(USE_FXSAVE): New.
	(_dl_runtime_resolve_fxsave): Likewise.
	(USE_XSAVE): Likewise.
	(_dl_runtime_resolve_xsave): Likewise.
	(USE_XSAVEC): Likewise.
	(_dl_runtime_resolve_xsavec): Likewise.
	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx512):
	Removed.
	(_dl_runtime_resolve_avx512_opt): Likewise.
	(_dl_runtime_resolve_avx): Likewise.
	(_dl_runtime_resolve_avx_opt): Likewise.
	(_dl_runtime_resolve_sse): Likewise.
	(_dl_runtime_resolve_sse_vex): Likewise.
	(_dl_runtime_resolve_fxsave): New.
	(_dl_runtime_resolve_xsave): Likewise.
	(_dl_runtime_resolve_xsavec): Likewise.

(cherry picked from commit b52b0d793d)
2017-10-22 07:41:00 -07:00
H.J. Lu
b2c78ae69e x86: Add x86_64 to x86-64 HWCAP [BZ #22093]
Before glibc 2.26, ld.so set dl_platform to "x86_64" and searched the
"x86_64" subdirectory when loading a shared library.  ld.so in glibc
2.26 was changed to set dl_platform to "haswell" or "xeon_phi", based
on supported ISAs.  This led to shared library loading failure for
shared libraries placed under the "x86_64" subdirectory.

This patch adds "x86_64" to x86-64 dl_hwcap so that ld.so will always
search the "x86_64" subdirectory when loading a shared library.

NB: We can't set x86-64 dl_platform to "x86-64" since ld.so will skip
the "haswell" and "xeon_phi" subdirectories on "haswell" and "xeon_phi"
machines.

Tested on i686 and x86-64.

	[BZ #22093]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Initialize
	GLRO(dl_hwcap) to HWCAP_X86_64 for x86-64.
	* sysdeps/x86/dl-hwcap.h (HWCAP_COUNT): Updated.
	(HWCAP_IMPORTANT): Likewise.
	(HWCAP_X86_64): New enum.
	(HWCAP_X86_AVX512_1): Updated.
	* sysdeps/x86/dl-procinfo.c (_dl_x86_hwcap_flags): Add "x86_64".
	* sysdeps/x86_64/Makefile (tests): Add tst-x86_64-1.
	(modules-names): Add x86_64/tst-x86_64mod-1.
	(LDFLAGS-tst-x86_64mod-1.so): New.
	($(objpfx)tst-x86_64-1): Likewise.
	($(objpfx)x86_64/tst-x86_64mod-1.os): Likewise.
	(tst-x86_64-1-clean): Likewise.
	* sysdeps/x86_64/tst-x86_64-1.c: New file.
	* sysdeps/x86_64/tst-x86_64mod-1.c: Likewise.

(cherry picked from commit 45ff34638f)
2017-10-22 04:16:39 -07:00
Markus Trippelsdorf
5f5532caf8 Update x86_64 ulps for AMD Ryzen.
* sysdeps/x86_64/fpu/libm-test-ulps: Update for AMD Ryzen.

(cherry picked from commit 4c03a69680)
2017-09-08 20:00:10 +00:00
H.J. Lu
7a499756ab x86-64: Test memmove_chk and memset_chk only in libc.so [BZ #21741]
Since there are no multiarch versions of memmove_chk and memset_chk,
test multiarch versions of memmove_chk and memset_chk only in libc.so.

	[BZ #21741]
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test memmove_chk and memset_chk only
	in libc.so.
2017-07-10 04:44:38 -07:00
H.J. Lu
58d021c836 x86-64: Update comments in IFUNC selectors
* sysdeps/x86_64/multiarch/memcmp.c: Update comments.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
2017-07-09 11:43:20 -07:00
H.J. Lu
4df54c89bb x86-64: Update comments in ifunc-impl-list.c
All x86-64 IFUNC selectors are written in C now.  Update comments to
reflect it.

	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update comments.
2017-07-09 11:38:37 -07:00
H.J. Lu
031e519c95 x86-64: Align the stack in __tls_get_addr [BZ #21609]
This change forces realignment of the stack pointer in __tls_get_addr, so
that binaries compiled by GCCs older than GCC 4.9:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066

continue to work even if vector instructions are used in glibc which
require the ABI stack realignment.

__tls_get_addr_slow is added to handle the slow paths in the default
implementation of__tls_get_addr in elf/dl-tls.c.  The new __tls_get_addr
calls __tls_get_addr_slow after realigning the stack.  Internal calls
within ld.so go directly to the default implementation of __tls_get_addr
because they do not need stack realignment.

	[BZ #21609]
	* sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr.
	(gen-as-const-headers): Add rtld-offsets.sym.
	* sysdeps/x86_64/dl-tls.c: New file.
	* sysdeps/x86_64/rtld-offsets.sym: Likwise.
	* sysdeps/x86_64/tls_get_addr.S: Likewise.
	* sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards.
	* sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New.
	(TI_OFFSET_OFFSET): Likwise.
2017-07-06 04:43:20 -07:00
Joseph Myers
073e8fa773 Require binutils 2.25 or later to build glibc.
This patch implements a requirement of binutils >= 2.25 (up from 2.22)
to build glibc.  Tests for 2.24 or later on x86_64 and s390 are
removed.  It was already the case, as indicated by buildbot results,
that 2.24 was too old for building tests for 32-bit x86 (produced
internal linker errors linking elf/tst-gnu2-tls1mod.so).  I don't know
if any configure tests for binutils features are obsolete given the
increased version requirement.

Tested for x86_64.

	* configure.ac (AS): Require binutils 2.25 or later.
	(LD): Likewise.
	* configure: Regenerated.
	* sysdeps/s390/configure.ac (AS): Remove version check.
	* sysdeps/s390/configure: Regenerated.
	* sysdeps/x86_64/configure.ac (AS): Remove version check.
	* sysdeps/x86_64/configure: Regenerated.
	* manual/install.texi (Tools for Compilation): Document
	requirement for binutils 2.25 or later.
	* INSTALL: Regenerated.
2017-06-28 11:31:50 +00:00
H.J. Lu
e94c310357 x86-64: Optimize memcmp-avx2-movbe.S for short difference
Check the first 32 bytes before checking size when size >= 32 bytes
to avoid unnecessary branch if the difference is in the first 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.

On Haswell, the new version is as fast as the previous one.  On Skylake,
the new version is a little bit faster.

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
	the first 32 bytes before checking size when size >= 32 bytes.
	Replace vpmovmskb/subl/jnz with vptest/jnc.
2017-06-27 07:55:00 -07:00
Joseph Myers
c86ed71d63 Add float128 support for x86_64, x86.
This patch enables float128 support for x86_64 and x86.  All GCC
versions that can build glibc provide the required support, but since
GCC 6 and before don't provide __builtin_nanq / __builtin_nansq, sNaN
tests and some tests of NaN payloads need to be disabled with such
compilers (this does not affect the generated glibc binaries at all,
just the tests).  bits/floatn.h declares float128 support to be
available for GCC versions that provide the required libgcc support
(4.3 for x86_64, 4.4 for i386 GNU/Linux, 4.5 for i386 GNU/Hurd);
compilation-only support was present some time before then, but not
really useful without the libgcc functions.

fenv_private.h needed updating to avoid trying to put _Float128 values
in registers.  I make no assertion of optimality of the
math_opt_barrier / math_force_eval definitions for this case; they are
simply intended to be sufficient to work correctly.

Tested for x86_64 and x86, with GCC 7 and GCC 6.  (Testing for x32 was
compilation tests only with build-many-glibcs.py to verify the ABI
baseline updates.  I have not done any testing for Hurd, although the
float128 support is enabled there as for GNU/Linux.)

	* sysdeps/i386/Implies: Add ieee754/float128.
	* sysdeps/x86_64/Implies: Likewise.
	* sysdeps/x86/bits/floatn.h: New file.
	* sysdeps/x86/float128-abi.h: Likewise.
	* manual/math.texi (Mathematics): Document support for _Float128
	on x86_64 and x86.
	* sysdeps/i386/fpu/fenv_private.h: Include <bits/floatn.h>.
	(math_opt_barrier): Do not put _Float128 values in floating-point
	registers.
	(math_force_eval): Likewise.
	[__x86_64__] (SET_RESTORE_ROUNDF128): New macro.
	* sysdeps/x86/fpu/Makefile [$(subdir) = math] (CPPFLAGS): Append
	to Makefile variable.
	* sysdeps/x86/fpu/e_sqrtf128.c: New file.
	* sysdeps/x86/fpu/sfp-machine.h: Likewise.  Based on libgcc.
	* sysdeps/x86/math-tests.h: New file.
	* math/libm-test-support.h (XFAIL_FLOAT128_PAYLOAD): New macro.
	* math/libm-test-getpayload.inc (getpayload_test_data): Use
	XFAIL_FLOAT128_PAYLOAD.
	* math/libm-test-setpayload.inc (setpayload_test_data): Likewise.
	* math/libm-test-totalorder.inc (totalorder_test_data): Likewise.
	* math/libm-test-totalordermag.inc (totalordermag_test_data):
	Likewise.
	* sysdeps/unix/sysv/linux/i386/libc.abilist: Update.
	* sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libc.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.
	* sysdeps/i386/fpu/libm-test-ulps: Likewise.
	* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Likewise.
	* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
2017-06-26 22:02:24 +00:00
H.J. Lu
049816c3be x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.S
Turn

	movzbl	-1(%rdi, %rdx), %edi
	movzbl	-1(%rsi, %rdx), %esi
	orl	%edi, %eax
	orl	%esi, %ecx

into

	movb	-1(%rdi, %rdx), %al
	movb	-1(%rsi, %rdx), %cl

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
	Replace movzbl and orl with movb.
2017-06-23 12:46:12 -07:00
Florian Weimer
bc0382ae90 x86-64: Fix comment typo in memcmp-avx2-movbe.S 2017-06-23 19:00:58 +02:00
Florian Weimer
3ec7c02cc3 x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]
This code:

L(between_2_3):
	/* Load as big endian with overlapping loads and bswap to avoid
	   branches.  */
	movzwl	-2(%rdi, %rdx), %eax
	movzwl	-2(%rsi, %rdx), %ecx
	shll	$16, %eax
	shll	$16, %ecx
	movzwl	(%rdi), %edi
	movzwl	(%rsi), %esi
	orl	%edi, %eax
	orl	%esi, %ecx
	bswap	%eax
	bswap	%ecx
	subl	%ecx, %eax
	ret

needs a saturating subtract because the full register is used.
With this commit, only the lower 24 bits of the register are used,
so a regular subtraction suffices.

The test case change adds coverage for these kinds of bugs.
2017-06-23 17:24:40 +02:00
H.J. Lu
11ffcacb64 x86-64: Implement strcmp family IFUNC selectors in C
Implement strcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2,
	strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx,
	strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx.
	* sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file.
	* sysdeps/x86_64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>.
	(STRCMP_SSE42): New.  Defined to __strcmp_sse42 if not defined.
	[USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include
	"locale-defines.h".
	(UPDATE_STRNCMP_COUNTER): New.
	(SECTION): Likewise.
	(GLABEL): Likewise.
	(LABEL): Likewise.
	* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable
	for libc.a.
2017-06-21 12:11:06 -07:00
Zack Weinberg
af85385f31 Use locale_t, not __locale_t, throughout glibc
<locale.h> is specified to define locale_t in POSIX.1-2008, and so are
all of the headers that define functions that take locale_t arguments.
Under _GNU_SOURCE, the additional headers that define such functions
have also always defined locale_t.  Therefore, there is no need to use
__locale_t in public function prototypes, nor in any internal code.

	* ctype/ctype-c99_l.c, ctype/ctype.h, ctype/ctype_l.c
	* include/monetary.h, include/stdlib.h, include/time.h
	* include/wchar.h, locale/duplocale.c, locale/freelocale.c
	* locale/global-locale.c, locale/langinfo.h, locale/locale.h
	* locale/localeinfo.h, locale/newlocale.c
	* locale/nl_langinfo_l.c, locale/uselocale.c
	* localedata/bug-usesetlocale.c, localedata/tst-xlocale2.c
	* stdio-common/vfscanf.c, stdlib/monetary.h, stdlib/stdlib.h
	* stdlib/strfmon_l.c, stdlib/strtod_l.c, stdlib/strtof_l.c
	* stdlib/strtol.c, stdlib/strtol_l.c, stdlib/strtold_l.c
	* stdlib/strtoll_l.c, stdlib/strtoul_l.c, stdlib/strtoull_l.c
	* string/strcasecmp.c, string/strcoll_l.c, string/string.h
	* string/strings.h, string/strncase.c, string/strxfrm_l.c
	* sysdeps/ieee754/float128/strtof128_l.c
	* sysdeps/ieee754/float128/wcstof128.c
	* sysdeps/ieee754/float128/wcstof128_l.c
	* sysdeps/ieee754/ldbl-128ibm/strtold_l.c
	* sysdeps/ieee754/ldbl-64-128/strtold_l.c
	* sysdeps/ieee754/ldbl-opt/nldbl-compat.c
	* sysdeps/ieee754/ldbl-opt/nldbl-strfmon_l.c
	* sysdeps/ieee754/ldbl-opt/nldbl-strtold_l.c
	* sysdeps/ieee754/ldbl-opt/nldbl-wcstold_l.c
	* sysdeps/powerpc/powerpc32/power7/strcasecmp.S
	* sysdeps/powerpc/powerpc64/power7/strcasecmp.S
	* sysdeps/x86_64/strcasecmp_l-nonascii.c
	* sysdeps/x86_64/strncase_l-nonascii.c, time/strftime_l.c
	* time/strptime_l.c, time/time.h, wcsmbs/mbsrtowcs_l.c
	* wcsmbs/wchar.h, wcsmbs/wcscasecmp.c, wcsmbs/wcsncase.c
	* wcsmbs/wcstod.c, wcsmbs/wcstod_l.c, wcsmbs/wcstof.c
	* wcsmbs/wcstof_l.c, wcsmbs/wcstol_l.c, wcsmbs/wcstold.c
	* wcsmbs/wcstold_l.c, wcsmbs/wcstoll_l.c, wcsmbs/wcstoul_l.c
	* wcsmbs/wcstoull_l.c, wctype/iswctype_l.c
	* wctype/towctrans_l.c, wctype/wcfuncs_l.c
	* wctype/wctrans_l.c, wctype/wctype.h, wctype/wctype_l.c:
	Change all uses of __locale_t to locale_t.
2017-06-20 20:30:06 -04:00
Zack Weinberg
c0b23001a8 Fix fallout from bits/string.h removal.
Remove one more string inline that was defined directly in string.h;
in the absence of the rest of the inlines, it broke the build.

Like other ifunc shims for these functions,
x86_64/multiarch/{mem,st}pcpy.c need to define __NO_STRING_INLINES and
NO_MEMPCPY_STPCPY_REDIRECT.

	* string/string.h (__mempcpy_inline): Delete.
	* sysdeps/x86_64/multiarch/mempcpy.c
	* sysdeps/x86_64/multiarch/stpcpy.c:
	Define NO_MEMPCPY_STPCPY_REDIRECT and __NO_STRING_INLINES
	before including string.h.
2017-06-20 09:39:08 -04:00
Zack Weinberg
09a596cc2c Remove bits/string.h.
These machine-dependent inline string functions have never been on by
default, and even if they were a good idea at the time they were
introduced, they haven't really been touched in ten to fifteen years
and probably aren't a good idea on current-gen processors.  Current
thinking is that this class of optimization is best left to the
compiler.

	* bits/string.h, string/bits/string.h
	* sysdeps/aarch64/bits/string.h
	* sysdeps/m68k/m680x0/m68020/bits/string.h
	* sysdeps/s390/bits/string.h, sysdeps/sparc/bits/string.h
	* sysdeps/x86/bits/string.h: Delete file.

	* string/string.h: Don't include bits/string.h.
	* string/bits/string3.h: Rename to bits/string_fortified.h.
	No need to undef various symbols that the removed headers
	might have defined as macros.
	* string/Makefile (headers): Remove bits/string.h, change
	bits/string3.h to bits/string_fortified.h.
	* string/string-inlines.c: Update commentary.  Remove definitions
	of various macros that nothing looks at anymore.  Don't directly
	include bits/string.h. Set _STRING_INLINE_unaligned here, based on
	compiler-predefined macros.
	* string/strncat.c: If STRNCAT is not defined, or STRNCAT_PRIMARY
	_is_ defined, provide internal hidden alias __strncat.
	* include/string.h: Declare internal hidden alias __strncat.
	Only forward __stpcpy to __builtin_stpcpy if __NO_STRING_INLINES is
	not defined.
	* include/bits/string3.h: Rename to bits/string_fortified.h,
	update to match above.

	* sysdeps/i386/string-inlines.c: Define compat symbols for
	everything formerly defined by sysdeps/x86/bits/string.h.
	Make existing definitions into compat symbols as well.
	Remove some no-longer-necessary messing around with macros.

	* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
	* sysdeps/s390/multiarch/mempcpy.c
	No need to define _HAVE_STRING_ARCH_mempcpy.
	Do define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT.

	* sysdeps/i386/i686/multiarch/strncat-c.c
	* sysdeps/s390/multiarch/strncat-c.c
	* sysdeps/x86_64/multiarch/strncat-c.c
	Define STRNCAT_PRIMARY.  Don't change definition of libc_hidden_def.
2017-06-20 08:21:24 -04:00
Siddhesh Poyarekar
629ebc873a Fix typo when undefining weak_alias
The macro directive #undef was miswritten as #undefine.

	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Fix typo.
2017-06-19 14:56:40 +05:30
H.J. Lu
70fe2eb794 x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in C
Implement strcspn/strpbrk/strspn IFUNC selectors in C

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcspn-sse2, strpbrk-sse2 and strspn-sse2.
	* sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed.
	Check USE_AS_STRPBRK instead of STRPBRK_P.
	* sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New.
	* sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file.
	* sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk.c: Likewise.
	* sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Removed.
	* sysdeps/x86_64/multiarch/strpbrk.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED"
	and "#endif".
2017-06-15 08:59:05 -07:00
H.J. Lu
9f4254b8bd x86-64: Implement wcscpy IFUNC selector in C
* sysdeps/x86_64/multiarch/wcscpy.S: Removed.
	* sysdeps/x86_64/multiarch/wcscpy.c: New file.
2017-06-15 08:57:52 -07:00
H.J. Lu
9ed0aa15d3 x86-64: Implement strcat family IFUNC selectors in C
Implement strcat family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcat family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-sse2.
	* sysdeps/x86_64/multiarch/strcat-sse2.S: New file.
	* sysdeps/x86_64/multiarch/strcat.c: Likewise.
	* sysdeps/x86_64/multiarch/strncat.c: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Removed.
	* sysdeps/x86_64/multiarch/strncat.S: Likewise.
2017-06-15 08:56:59 -07:00
H.J. Lu
b91a52d0d7 x86-64: Implement memcmp family IFUNC selectors in C
Implement memcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file.
	* sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Removed.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2017-06-15 08:49:57 -07:00
H.J. Lu
93e46f8773 x86-64: Implement memset family IFUNC selectors in C
Implement memset family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memset functions within libc.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Erich Elsen  <eriche@google.com>

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, and memset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add test for __memset_chk_erms.
	Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memset.h: New file.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Removed.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__memset_chk_erms): New function.
2017-06-15 08:33:35 -07:00
H.J. Lu
5c3e322d3b x86-64: Implement memmove family IFUNC selectors in C
Implement memmove family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memmove family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memcpy_chk-nonshared,
	mempcpy_chk-nonshared and memmove_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memmove_chk_erms,
	__memcpy_chk_erms and __mempcpy_chk_erms.  Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memmove.h: New file.
	* sysdeps/x86_64/multiarch/memcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Removed.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__mempcpy_chk_erms): New function.
	(__memmove_chk_erms): Likewise.
	(__memcpy_chk_erms): New alias.
2017-06-14 12:11:10 -07:00
Zack Weinberg
fd860eaaa8 Remove __need macros from errno.h (__need_Emath, __need_error_t).
This is fairly complicated, not because the users of __need_Emath and
__need_error_t have complicated requirements, but because the core
changes had a lot of fallout.

__need_error_t exists for gnulib compatibility in argz.h and argp.h.
error_t itself is a Hurdism, an enum containing all the E-constants,
so you can do 'p (error_t) errno' in gdb and get a symbolic value.
argz.h and argp.h use it for function return values, and they want to
fall back to 'int' when that's not available.  There is no reason why
these nonstandard headers cannot just go ahead and include all of
errno.h; so we do that.

__need_Emath is defined only by .S files; what they _really_ need is
for errno.h to avoid declaring anything other than the E-constants
(e.g. 'extern int __errno_location(void);' is a syntax error in
assembly language). This is replaced with a check for __ASSEMBLER__ in
errno.h, plus a carefully documented requirement for bits/errno.h not
to define anything other than macros.  That in turn has the
consequence that bits/errno.h must not define errno - fortunately, all
live ports use the same definition of errno, so I've moved it to
errno.h.  The Hurd bits/errno.h must also take care not to define
error_t when __ASSEMBLER__ is defined, which involves repeating all of
the definitions twice, but it's a generated file so that's okay.

	* stdlib/errno.h: Remove __need_Emath and __need_error_t logic.
	Reorganize file.  Declare errno here.  When __ASSEMBLER__ is
	defined, don't declare anything other than the E-constants.

	* include/errno.h: Change conditional for exposing internal
	declarations to (not _ISOMAC and not __ASSEMBLER__).
	* bits/errno.h: Remove logic for __need_Emath.  Document
	requirements for a port-specific bits/errno.h.

	* sysdeps/unix/sysv/linux/bits/errno.h
	* sysdeps/unix/sysv/linux/alpha/bits/errno.h
	* sysdeps/unix/sysv/linux/hppa/bits/errno.h
	* sysdeps/unix/sysv/linux/mips/bits/errno.h
	* sysdeps/unix/sysv/linux/sparc/bits/errno.h:
	Add multiple-include guard and check against improper inclusion.
	Remove __need_Emath logic.  Don't declare errno here.  Ensure all
	constants are defined as simple integer literals.  Consistent
	formatting.
	* sysdeps/mach/hurd/errnos.awk: Likewise.  Only define error_t and
	enum __error_t_codes if __ASSEMBLER__ is not defined.
	* sysdeps/mach/hurd/bits/errno.h: Regenerate.

	* argp/argp.h, string/argz.h: Don't define __need_error_t before
	including errno.h.
	* sysdeps/i386/i686/fpu/multiarch/s_cosf-sse2.S
	* sysdeps/i386/i686/fpu/multiarch/s_sincosf-sse2.S
	* sysdeps/i386/i686/fpu/multiarch/s_sinf-sse2.S
	* sysdeps/x86_64/fpu/s_cosf.S
	* sysdeps/x86_64/fpu/s_sincosf.S
	* sysdeps/x86_64/fpu/s_sinf.S:
	Just include errno.h; don't define __need_Emath or include
	bits/errno.h directly.
2017-06-14 08:14:34 -04:00
Alan Modra
0572433b5b PowerPC64 ELFv2 PPC64_OPT_LOCALENTRY
ELFv2 functions with localentry:0 are those with a single entry point,
ie. global entry == local entry, that have no requirement on r2 or
r12 and guarantee r2 is unchanged on return.  Such an external
function can be called via the PLT without saving r2 or restoring it
on return, avoiding a common load-hit-store for small functions.

This patch implements the ld.so changes necessary for this
optimization.  ld.so needs to check that an optimized plt call
sequence is in fact calling a function implemented with localentry:0,
end emit a fatal error otherwise.

The elf/testobj6.c change is to stop "error while loading shared
libraries: expected localentry:0 `preload'" when running
elf/preloadtest, which we'd get otherwise.

	* elf/elf.h (PPC64_OPT_LOCALENTRY): Define.
	* sysdeps/alpha/dl-machine.h (elf_machine_fixup_plt): Add
	refsym and sym parameters.  Adjust callers.
	* sysdeps/aarch64/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/arm/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/generic/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/i386/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/ia64/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/m68k/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/microblaze/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/mips/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/nios2/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_fixup_plt):
	Likewise.
	* sysdeps/s390/s390-32/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/s390/s390-64/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/sh/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/sparc/sparc32/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/sparc/sparc64/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/tile/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/x86_64/dl-machine.h (elf_machine_fixup_plt): Likewise.
	* sysdeps/powerpc/powerpc64/dl-machine.c (_dl_error_localentry): New.
	(_dl_reloc_overflow): Increase buffser size.  Formatting.
	* sysdeps/powerpc/powerpc64/dl-machine.h (ppc64_local_entry_offset):
	Delete reloc param, add refsym and sym.  Check optimized plt
	call stubs for localentry:0 functions.  Adjust callers.
	(elf_machine_fixup_plt, elf_machine_plt_conflict): Add refsym
	and sym parameters.  Adjust callers.
	(_dl_reloc_overflow): Move attribute.
	(_dl_error_localentry): Declare.
	* elf/dl-runtime.c (_dl_fixup): Save original sym.  Pass
	refsym and sym to elf_machine_fixup_plt.
	* elf/testobj6.c (preload): Call printf.
2017-06-14 10:47:25 +09:30
H.J. Lu
5a103908c0 x86-64: Implement strcpy family IFUNC selectors in C
Implement strcpy family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcpy family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcpy-sse2 and stpcpy-sse2.
	* sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file.
	* sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.S: Removed.
	* sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New.
	(libc_hidden_def): Always defined as empty.
	* sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def):
	Always Defined as empty.
2017-06-12 09:06:09 -07:00
H.J. Lu
6b6710e55b x86-64: Correct comments in ifunc-impl-list.c
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Correct comments.
2017-06-09 05:53:45 -07:00
H.J. Lu
d2538b9156 x86-64: Optimize strrchr/wcsrchr with AVX2
Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector
instructions.  It is as fast as SSE2 version for small data sizes
and up to 1X faster for large data sizes on Haswell.  Select AVX2
version on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strrchr_avx2,
	__strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2.
	* sysdeps/x86_64/multiarch/strrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
2017-06-09 05:45:52 -07:00
H.J. Lu
5ac7aa1d7c x86-64: Optimize memrchr with AVX2
Optimize memrchr with AVX2 to search 32 bytes with a single vector
compare instruction.  It is as fast as SSE2 memrchr for small data
sizes and up to 1X faster for large data sizes on Haswell.  Select
AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memrchr-sse2 and memrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and
	__memrchr_sse2.
	* sysdeps/x86_64/multiarch/memrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
2017-06-09 05:44:41 -07:00
H.J. Lu
8fe57365bf x86-64: Optimize strchr/strchrnul/wcschr with AVX2
Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector
instructions.  It is as fast as SSE2 versions for size <= 16 bytes and up
to 1X faster for or size > 16 bytes on Haswell.  Select AVX2 version on
AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2,
	wcschr-sse2 and wcschr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strchr_avx2,
	__strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and
	__wcschr_sse2.
	* sysdeps/x86_64/multiarch/strchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Removed.
2017-06-09 05:42:29 -07:00
H.J. Lu
dc485ceb2a x86-64: Optimize strlen/strnlen/wcslen/wcsnlen with AVX2
Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with
a single vector compare instruction.  It is as fast as SSE2 versions for
size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell.
Select AVX2 version on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2,
	wcslen-sse2, wcslen-avx2 and wcsnlen-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strlen_avx2,
	__strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2,
	__wcslen_sse2 and __wcsnlen_avx2.
	* sysdeps/x86_64/multiarch/strlen-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New.
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where
	vzeroupper is preferred and AVX unaligned load is fast.
2017-06-09 05:18:18 -07:00
H.J. Lu
2f5d20ac99 x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2
SSE2 memchr is extended to support wmemchr.  AVX2 memchr/rawmemchr/wmemchr
are added to search 32 bytes with a single vector compare instruction.
AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr
for small sizes and up to 1.5X faster for larger sizes on Haswell and
Skylake.  Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/memchr.S (MEMCHR): New.  Depending on if
	USE_AS_WMEMCHR is defined.
	(PCMPEQ): Likewise.
	(memchr): Renamed to ...
	(MEMCHR): This.  Support wmemchr if USE_AS_WMEMCHR is defined.
	Replace pcmpeqb with PCMPEQ.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2,
	wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c.
	* sysdeps/x86_64/multiarch/ifunc-avx2.h: New file.
	* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2,
	__rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and
	__wmemchr_sse2.
2017-06-09 05:13:31 -07:00
H.J. Lu
5e1122827a x86-64: Rename wmemset.h to ifunc-wmemset.h
No code changes.

	* sysdeps/x86_64/multiarch/wmemset.c: Include ifunc-wmemset.h
	instead of wmemset.h.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.h: Renamed to ...
	* sysdeps/x86_64/multiarch/ifunc-wmemset.h: This.
2017-06-07 14:48:34 -07:00
H.J. Lu
2e87c7d158 x86-64: Fold ifunc-sse4_1.h into wcsnlen.c
Since ifunc-sse4_1.h is included only by wcsnlen.c, we can fold it
into wcsnlen.c.  No code changes in wcsnlen.o.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>

	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: Removed and folded
	into ...
	* sysdeps/x86_64/multiarch/wcsnlen.c: Here.  Don't include
	ifunc-sse4_1.h.
2017-06-07 09:04:40 -07:00
H.J. Lu
d4cc385c6e x86-64: Move wcsnlen.S to multiarch/wcsnlen-sse4_1.S
Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S
to multiarch/wcsnlen-sse4_1.S.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wcsnlen-sse4_1 and wcsnlen-c.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and
	__wcsnlen_sse2.
	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file.
	* sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/wcsnlen.S: Removed.
2017-06-06 06:12:32 -07:00
Stefan Liebler
12d2dd7060 Optimize generic spinlock code and use C11 like atomic macros.
This patch optimizes the generic spinlock code.

The type pthread_spinlock_t is a typedef to volatile int on all archs.
Passing a volatile pointer to the atomic macros which are not mapped to the
C11 atomic builtins can lead to extra stores and loads to stack if such
a macro creates a temporary variable by using "__typeof (*(mem)) tmp;".
Thus, those macros which are used by spinlock code - atomic_exchange_acquire,
atomic_load_relaxed, atomic_compare_exchange_weak - have to be adjusted.
According to the comment from  Szabolcs Nagy, the type of a cast expression is
unqualified (see http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_423.htm):
__typeof ((__typeof (*(mem)) *(mem)) tmp;
Thus from spinlock perspective the variable tmp is of type int instead of
type volatile int.  This patch adjusts those macros in include/atomic.h.
With this construct GCC >= 5 omits the extra stores and loads.

The atomic macros are replaced by the C11 like atomic macros and thus
the code is aligned to it.  The pthread_spin_unlock implementation is now
using release memory order instead of sequentially consistent memory order.
The issue with passed volatile int pointers applies to the C11 like atomic
macros as well as the ones used before.

I've added a glibc_likely hint to the first atomic exchange in
pthread_spin_lock in order to return immediately to the caller if the lock is
free.  Without the hint, there is an additional jump if the lock is free.

I've added the atomic_spin_nop macro within the loop of plain reads.
The plain reads are also realized by C11 like atomic_load_relaxed macro.

The new define ATOMIC_EXCHANGE_USES_CAS determines if the first try to acquire
the spinlock in pthread_spin_lock or pthread_spin_trylock is an exchange
or a CAS.  This is defined in atomic-machine.h for all architectures.

The define SPIN_LOCK_READS_BETWEEN_CMPXCHG is now removed.
There is no technical reason for throwing in a CAS every now and then,
and so far we have no evidence that it can improve performance.
If that would be the case, we have to adjust other spin-waiting loops
elsewhere, too!  Using a CAS loop without plain reads is not a good idea
on many targets and wasn't used by one.  Thus there is now no option to
do so.

Architectures are now using the generic spinlock automatically if they
do not provide an own implementation.  Thus the pthread_spin_lock.c files
in sysdeps folder are deleted.

ChangeLog:

	* NEWS: Mention new spinlock implementation.
	* include/atomic.h:
	(__atomic_val_bysize): Cast type to omit volatile qualifier.
	(atomic_exchange_acq): Likewise.
	(atomic_load_relaxed): Likewise.
	(ATOMIC_EXCHANGE_USES_CAS): Check definition.
	* nptl/pthread_spin_init.c (pthread_spin_init):
	Use atomic_store_relaxed.
	* nptl/pthread_spin_lock.c (pthread_spin_lock):
	Use C11-like atomic macros.
	* nptl/pthread_spin_trylock.c (pthread_spin_trylock):
	Likewise.
	* nptl/pthread_spin_unlock.c (pthread_spin_unlock):
	Use atomic_store_release.
	* sysdeps/aarch64/nptl/pthread_spin_lock.c: Delete File.
	* sysdeps/arm/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/hppa/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/m68k/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/microblaze/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/mips/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/nios2/nptl/pthread_spin_lock.c: Likewise.
	* sysdeps/aarch64/atomic-machine.h (ATOMIC_EXCHANGE_USES_CAS): Define.
	* sysdeps/alpha/atomic-machine.h: Likewise.
	* sysdeps/arm/atomic-machine.h: Likewise.
	* sysdeps/i386/atomic-machine.h: Likewise.
	* sysdeps/ia64/atomic-machine.h: Likewise.
	* sysdeps/m68k/coldfire/atomic-machine.h: Likewise.
	* sysdeps/m68k/m680x0/m68020/atomic-machine.h: Likewise.
	* sysdeps/microblaze/atomic-machine.h: Likewise.
	* sysdeps/mips/atomic-machine.h: Likewise.
	* sysdeps/powerpc/powerpc32/atomic-machine.h: Likewise.
	* sysdeps/powerpc/powerpc64/atomic-machine.h: Likewise.
	* sysdeps/s390/atomic-machine.h: Likewise.
	* sysdeps/sparc/sparc32/atomic-machine.h: Likewise.
	* sysdeps/sparc/sparc32/sparcv9/atomic-machine.h: Likewise.
	* sysdeps/sparc/sparc64/atomic-machine.h: Likewise.
	* sysdeps/tile/tilegx/atomic-machine.h: Likewise.
	* sysdeps/tile/tilepro/atomic-machine.h: Likewise.
	* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h: Likewise.
	* sysdeps/unix/sysv/linux/nios2/atomic-machine.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/atomic-machine.h: Likewise.
	* sysdeps/x86_64/atomic-machine.h: Likewise.
2017-06-06 09:41:56 +02:00
H.J. Lu
935971ba6b x86-64: Optimize memcmp/wmemcmp with AVX2 and MOVBE
Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

Key features:

1. For size from 2 to 7 bytes, load as big endian with movbe and bswap
   to avoid branches.
2. Use overlapping compare to avoid branch.
3. Use vector compare when size >= 4 bytes for memcmp or size >= 8
   bytes for wmemcmp.
4. If size is 8 * VEC_SIZE or less, unroll the loop.
5. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
8. Use 8 vector compares when size is 8 * VEC_SIZE or less.

	* sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-avx2 and wmemcmp-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2.
	* sysdeps/x86_64/multiarch/memcmp-avx2.S: New file.
	* sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
2017-06-05 12:52:55 -07:00
H.J. Lu
ef9c4cb6c7 x86-64: Optimize wmemset with SSE2/AVX2/AVX512
The difference between memset and wmemset is byte vs int.  Add stubs
to SSE2/AVX2/AVX512 memset for wmemset with updated constant and size:

SSE2 wmemset:
	shl    $0x2,%rdx
	movd   %esi,%xmm0
	mov    %rdi,%rax
	pshufd $0x0,%xmm0,%xmm0
	jmp	entry_from_wmemset

SSE2 memset:
	movd   %esi,%xmm0
	mov    %rdi,%rax
	punpcklbw %xmm0,%xmm0
	punpcklwd %xmm0,%xmm0
	pshufd $0x0,%xmm0,%xmm0
entry_from_wmemset:

Since the ERMS versions of wmemset requires "rep stosl" instead of
"rep stosb", only the vector store stubs of SSE2/AVX2/AVX512 wmemset
are added.  The SSE2 wmemset is about 3X faster and the AVX2 wmemset
is about 6X faster on Haswell.

	* include/wchar.h (__wmemset_chk): New.
	* sysdeps/x86_64/memset.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed
	to MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_CHK_SYMBOL): Likewise.
	(WMEMSET_SYMBOL): Likewise.
	(__wmemset): Add hidden definition.
	(wmemset): Add weak hidden definition.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wmemset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add __wmemset_sse2_unaligned,
	__wmemset_avx2_unaligned, __wmemset_avx512_unaligned,
	__wmemset_chk_sse2_unaligned, __wmemset_chk_avx2_unaligned
	and __wmemset_chk_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Updated.
	(WMEMSET_CHK_SYMBOL): New.
	(WMEMSET_CHK_SYMBOL (__wmemset_chk, unaligned)): Likewise.
	(WMEMSET_SYMBOL (__wmemset, unaligned)): Likewise.
	* sysdeps/x86_64/multiarch/memset.S (WMEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Also define __GI_wmemset and
	__GI___wmemset.
	(weak_alias): New.
	* sysdeps/x86_64/multiarch/wmemset.c: New file.
	* sysdeps/x86_64/multiarch/wmemset.h: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/wmemset.c: Likewise.
	* sysdeps/x86_64/wmemset_chk.c: Likewise.
2017-06-05 11:09:59 -07:00
H.J. Lu
30cb625a21 x86-64: Update strlen.S to support wcslen/wcsnlen
The difference between strlen and wcslen is byte vs int.  We can
replace pminub and pcmpeqb with pminud and pcmpeqd to turn strlen
into wcslen.

	* sysdeps/x86_64/strlen.S (PMINU): New.
	(PCMPEQ): Likewise.
	(SHIFT_RETURN): Likewise.
	(FIND_ZERO): Replace pcmpeqb with PCMPEQ.
	(strlen): Add SHIFT_RETURN before ret.  Replace pcmpeqb and
	pminub with PCMPEQ and PMINU.
	* sysdeps/x86_64/wcsnlen.S: New file.
2017-06-05 07:58:23 -07:00
H.J. Lu
7395928b95 x86_64: Remove redundant REX bytes from memrchr.S
By x86-64 specification, 32-bit destination registers are zero-extended
to 64 bits.  There is no need to use 64-bit registers when only the lower
32 bits are non-zero.  Also 2 instructions in:

	mov	%rdi, %rcx
	and	$15, %rcx
	jz	L(length_less16_offset0)

	mov	%rdi, %rcx		<<< redundant
	and	$15, %rcx		<<< redundant

are redundant.

	* sysdeps/x86_64/memrchr.S (__memrchr): Use 32-bit registers for
	the lower 32 bits.  Remove redundant instructions.
2017-06-05 07:41:26 -07:00
H.J. Lu
4f26ef1b67 x86_64: Remove redundant REX bytes from memchr.S
By x86-64 specification, 32-bit destination registers are zero-extended
to 64 bits.  There is no need to use 64-bit registers when only the lower
32 bits are non-zero.

	* sysdeps/x86_64/memchr.S (MEMCHR): Use 32-bit registers for
	the lower 32 bits.
2017-05-30 12:39:14 -07:00
H.J. Lu
1f655beb08 x86_64: Remove L(return_null) from rawmemchr.S
L(return_null) is unused.

	* sysdeps/x86_64/rawmemchr.S (L(return_null)): Removed.
2017-05-20 06:13:38 -07:00
H.J. Lu
402bf06952 x86: Optimize SSE2 memchr overflow calculation
SSE2 memchr computes "edx + ecx - 16" where ecx is less than 16.  Use
"edx - (16 - ecx)", instead of satured math, to avoid possible addition
overflow.  This replaces

	add	%ecx, %edx
	sbb	%eax, %eax
	or	%eax, %edx
	sub	$16, %edx

with

	neg	%ecx
	add	$16, %ecx
	sub	%ecx, %edx

It is the same for x86_64, except for rcx/rdx, instead of ecx/edx.

	* sysdeps/i386/i686/multiarch/memchr-sse2.S (MEMCHR): Use
	"edx + ecx - 16" to avoid possible addition overflow.
	* sysdeps/x86_64/memchr.S (memchr): Likewise.
2017-05-19 10:48:45 -07:00
H.J. Lu
a7fbedff76 Correct comments in x86_64/multiarch/memcmp.S
* sysdeps/x86_64/multiarch/memcmp.S (__GI_memcmp): Correct
	comments.
2017-05-18 14:02:02 -07:00
Zack Weinberg
7c3018f9e4 Suppress internal declarations for most of the testsuite.
This patch adds a new build module called 'testsuite'.
IS_IN (testsuite) implies _ISOMAC, as do IS_IN_build and __cplusplus
(which means several ad-hoc tests for __cplusplus can go away).
libc-symbols.h now suppresses almost all of *itself* when _ISOMAC is
defined; in particular, _ISOMAC mode does not get config.h
automatically anymore.

There are still quite a few tests that need to see internal gunk of
one variety or another.  For them, we now have 'tests-internal' and
'test-internal-extras'; files in this category will still be compiled
with MODULE_NAME=nonlib, and everything proceeds as it always has.
The bulk of this patch is moving tests from 'tests' to
'tests-internal'.  There is also 'tests-static-internal', which has
the same effect on files in 'tests-static', and 'modules-names-tests',
which has the *inverse* effect on files in 'modules-names' (it's
inverted because most of the things in modules-names are *not* tests).
For both of these, the file must appear in *both* the new variable and
the old one.

There is also now a special case for when libc-symbols.h is included
without MODULE_NAME being defined at all.  (This happens during the
creation of libc-modules.h, and also when preprocessing Versions
files.)  When this happens, IS_IN is set to be always false and
_ISOMAC is *not* defined, which was the status quo, but now it's
explicit.

The remaining changes to C source files in this patch seemed likely to
cause problems in the absence of the main change.  They should be
relatively self-explanatory.  In a few cases I duplicated a definition
from an internal header rather than move the test to tests-internal;
this was a judgement call each time and I'm happy to change those
however reviewers feel is more appropriate.

	* Makerules: New subdir configuration variables 'tests-internal'
	and 'test-internal-extras'.  Test files in these categories will
	still be compiled with MODULE_NAME=nonlib.  Test files in the
	existing categories (tests, xtests, test-srcs, test-extras) are
	now compiled with MODULE_NAME=testsuite.
	New subdir configuration variable 'modules-names-tests'.  Files
	which are in both 'modules-names' and 'modules-names-tests' will
	be compiled with MODULE_NAME=testsuite instead of
	MODULE_NAME=extramodules.
	(gen-as-const-headers): Move to tests-internal.
	(do-tests-clean, common-mostlyclean): Support tests-internal.
	* Makeconfig (built-modules): Add testsuite.
	* Makefile: Change libof-check-installed-headers-c and
	libof-check-installed-headers-cxx to 'testsuite'.
	* Rules: Likewise.  Support tests-internal.
	* benchtests/strcoll-inputs/filelist#en_US.UTF-8:
	Remove extra-modules.mk.

	* config.h.in: Don't check for __OPTIMIZE__ or __FAST_MATH__ here.
	* include/libc-symbols.h: Move definitions of _GNU_SOURCE,
	PASTE_NAME, PASTE_NAME1, IN_MODULE, IS_IN, and IS_IN_LIB to the
	very top of the file and rationalize their order.
	If MODULE_NAME is not defined at all, define IS_IN to always be
	false, and don't define _ISOMAC.
	If any of IS_IN (testsuite), IS_IN_build, or __cplusplus are
	true, define _ISOMAC and suppress everything else in this file,
	starting with the inclusion of config.h.
	Do check for inappropriate definitions of __OPTIMIZE__ and
	__FAST_MATH__ here, but only if _ISOMAC is not defined.
        Correct some out-of-date commentary.

	* include/math.h: If _ISOMAC is defined, undefine NO_LONG_DOUBLE
	and _Mlong_double_ before including math.h.
	* include/string.h: If _ISOMAC is defined, don't expose
	_STRING_ARCH_unaligned. Move a comment to a more appropriate
	location.

	* include/errno.h, include/stdio.h, include/stdlib.h, include/string.h
	* include/time.h, include/unistd.h, include/wchar.h: No need to
	check __cplusplus nor use __BEGIN_DECLS/__END_DECLS.

	* misc/sys/cdefs.h (__NTHNL): New macro.
	* sysdeps/m68k/m680x0/fpu/bits/mathinline.h
	(__m81_defun): Use __NTHNL to avoid errors with GCC 6.

	* elf/tst-env-setuid-tunables.c: Include config.h with _LIBC
	defined, for HAVE_TUNABLES.
	* inet/tst-checks-posix.c: No need to define _ISOMAC.
	* intl/tst-gettext2.c: Provide own definition of N_.
	* math/test-signgam-finite-c99.c: No need to define _ISOMAC.
	* math/test-signgam-main.c: No need to define _ISOMAC.
	* stdlib/tst-strtod.c: Convert to test-driver. Split locale_test to...
	* stdlib/tst-strtod1i.c: ...this new file.
	* stdlib/tst-strtod5.c: Convert to test-driver and add copyright notice.
        Split tests of __strtod_internal to...
	* stdlib/tst-strtod5i.c: ...this new file.
	* string/test-string.h: Include stdint.h. Duplicate definition of
	inhibit_loop_to_libcall here (from libc-symbols.h).
	* string/test-strstr.c: Provide dummy definition of
	libc_hidden_builtin_def when including strstr.c.
	* sysdeps/ia64/fpu/libm-symbols.h: Suppress entire file in _ISOMAC
	mode; no need to test __STRICT_ANSI__ nor __cplusplus as well.
	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.

	* elf/Makefile: Move tst-ptrguard1-static, tst-stackguard1-static,
	tst-tls1-static, tst-tls2-static, tst-tls3-static, loadtest,
	unload, unload2, circleload1, neededtest, neededtest2,
	neededtest3, neededtest4, tst-tls1, tst-tls2, tst-tls3,
	tst-tls6, tst-tls7, tst-tls8, tst-dlmopen2, tst-ptrguard1,
	tst-stackguard1, tst-_dl_addr_inside_object, and all of the
	ifunc tests to tests-internal.
	Don't add $(modules-names) to test-extras.
	* inet/Makefile: Move tst-inet6_scopeid_pton to tests-internal.
	Add tst-deadline to tests-static-internal.
	* malloc/Makefile: Move tst-mallocstate and tst-scratch_buffer to
	tests-internal.
	* misc/Makefile: Move tst-atomic and tst-atomic-long to tests-internal.
	* nptl/Makefile: Move tst-typesizes, tst-rwlock19, tst-sem11,
	tst-sem12, tst-sem13, tst-barrier5, tst-signal7, tst-tls3,
	tst-tls3-malloc, tst-tls5, tst-stackguard1, tst-sem11-static,
	tst-sem12-static, and tst-stackguard1-static to tests-internal.
        Link tests-internal with libpthread also.
	Don't add $(modules-names) to test-extras.
	* nss/Makefile: Move tst-field to tests-internal.
	* posix/Makefile: Move bug-regex5, bug-regex20, bug-regex33,
	tst-rfc3484, tst-rfc3484-2, and tst-rfc3484-3 to tests-internal.
	* stdlib/Makefile: Move tst-strtod1i, tst-strtod3, tst-strtod4,
	tst-strtod5i, tst-tls-atexit, and tst-tls-atexit-nodelete to
	tests-internal.
        * sunrpc/Makefile: Move tst-svc_register to tests-internal.
	* sysdeps/powerpc/Makefile: Move test-get_hwcap and
	test-get_hwcap-static to tests-internal.
	* sysdeps/unix/sysv/linux/Makefile: Move tst-setgetname to
	tests-internal.
	* sysdeps/x86_64/fpu/Makefile: Add all libmvec test modules to
	modules-names-tests.
2017-05-11 19:27:59 -04:00
H.J. Lu
1432d38ea0 x86: Set dl_platform and dl_hwcap from CPU features [BZ #21391]
dl_platform and dl_hwcap are set from AT_PLATFORM and AT_HWCAP very
early during startup.  They are used by dynamic linker to determine
platform and build an array of hardware capability names, which are
added to search path when loading shared object.  dl_platform and
dl_hwcap are unused on x86-64.  On i386, i386, i486, i586 and i686
platforms were supported and only SSE2 capability was used.

On x86, usage of AT_PLATFORM and AT_HWCAP to determine platform and
processor capabilities is obsolete since all information is available
in dl_x86_cpu_features.  This patch sets dl_platform and dl_hwcap from
dl_x86_cpu_features in dynamic linker.  On i386, the available plaforms
are changed to i586 and i686 since i386 has been deprecated.  On x86-64,
the available plaforms are haswell, which is for Haswell class processors
with BMI1, BMI2, LZCNT, MOVBE, POPCNT, AVX2 and FMA, and xeon_phi, which
is for Xeon Phi class processors with AVX512F, AVX512CD, AVX512ER and
AVX512PF.  A capability, avx512_1, is also added to x86-64 for AVX512
ISAs: AVX512F, AVX512CD, AVX512BW, AVX512DQ and AVX512VL.

	[BZ #21391]
	* sysdeps/i386/dl-machine.h (dl_platform_init) [IS_IN (rtld)]:
	Only call init_cpu_features.
	[!IS_IN (rtld)]: Only set GLRO(dl_platform) to NULL if needed.
	* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.
	* sysdeps/i386/dl-procinfo.h: Removed.
	* sysdeps/unix/sysv/linux/i386/dl-procinfo.h: Don't include
	<sysdeps/i386/dl-procinfo.h> nor <ldsodefs.h>.  Include
	<sysdeps/x86/dl-procinfo.h>.
	(_dl_procinfo): Replace _DL_HWCAP_COUNT with 32.
	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.h [!IS_IN (ldconfig)]:
	Include <sysdeps/x86/dl-procinfo.h> instead of
	 <sysdeps/generic/dl-procinfo.h>.
	* sysdeps/x86/cpu-features.c: Include <dl-hwcap.h>.
	(init_cpu_features): Set dl_platform, dl_hwcap and dl_hwcap_mask.
	* sysdeps/x86/cpu-features.h (bit_cpu_LZCNT): New.
	(bit_cpu_MOVBE): Likewise.
	(bit_cpu_BMI1): Likewise.
	(bit_cpu_BMI2): Likewise.
	(index_cpu_BMI1): Likewise.
	(index_cpu_BMI2): Likewise.
	(index_cpu_LZCNT): Likewise.
	(index_cpu_MOVBE): Likewise.
	(index_cpu_POPCNT): Likewise.
	(reg_BMI1): Likewise.
	(reg_BMI2): Likewise.
	(reg_LZCNT): Likewise.
	(reg_MOVBE): Likewise.
	(reg_POPCNT): Likewise.
	* sysdeps/x86/dl-hwcap.h: New file.
	* sysdeps/x86/dl-procinfo.h: Likewise.
	* sysdeps/x86/dl-procinfo.c (_dl_x86_hwcap_flags): New.
	(_dl_x86_platforms): Likewise.
2017-05-03 13:44:35 -07:00