Commit Graph

10141 Commits

Author SHA1 Message Date
Samuel Thibault
6f8222a1c5 Fix gprof timing
* sysdeps/mach/hurd/profil.c (__profile_frequency): Return tick
	frequency instead of tick length in us.
2016-04-19 23:27:27 +02:00
Samuel Thibault
593285ac15 hurd: fix profiling short-living processes
* sysdeps/mach/hurd/profil.c (update_waiter): Initialize
	profil_reply_port.
	(profile_waiter): Do not initialize profil_reply_port.
2016-04-19 00:54:24 +02:00
Carlos Eduardo Seo
1b045ee53e powerpc: Optimization for strlen for POWER8.
This implementation takes advantage of vectorization to improve performance of
the loop over the current strlen implementation for POWER7.
2016-04-15 17:19:19 -03:00
H.J. Lu
2e2d9796da Detect Intel Goldmont and Airmont processors
Updated from the model numbers of Goldmont and Airmont processors in
Intel64 And IA-32 Processor Architectures Software Developer's Manual
Volume 3 Revision 058.

	* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
	Goldmont and Airmont processors.
2016-04-15 05:23:06 -07:00
Adhemerval Zanella
41e77f36d4 Fix pread consolidation on ports that require argument alignment
This patch fixes the __ALIGNMENT_{ARG,COUNT} definition for ports that
define __ASSUME_ALIGNED_REGISTER_PAIRS by including the kernel-features.h
(where it is defined if the case).

This was shown on arm with failing cases:

FAIL: debug/tst-chk1
FAIL: debug/tst-chk2
FAIL: debug/tst-chk3
FAIL: debug/tst-chk4
FAIL: debug/tst-chk5
FAIL: debug/tst-chk6
FAIL: debug/tst-lfschk1
FAIL: debug/tst-lfschk2
FAIL: debug/tst-lfschk3
FAIL: debug/tst-lfschk4
FAIL: debug/tst-lfschk5
FAIL: debug/tst-lfschk6
FAIL: posix/tst-preadwrite
FAIL: posix/tst-preadwrite64

The patches fixes it.  Tested on armhf.

	* sysdeps/unix/sysv/linux/sysdep.h: Include kernel-features.h.
2016-04-14 16:49:40 -03:00
Florian Weimer
ae9e94e744 malloc: Remove unused definitions of thread_atfork, thread_atfork_static 2016-04-14 09:17:36 +02:00
Florian Weimer
29d794863c malloc: Run fork handler as late as possible [BZ #19431]
Previously, a thread M invoking fork would acquire locks in this order:

  (M1) malloc arena locks (in the registered fork handler)
  (M2) libio list lock

A thread F invoking flush (NULL) would acquire locks in this order:

  (F1) libio list lock
  (F2) individual _IO_FILE locks

A thread G running getdelim would use this order:

  (G1) _IO_FILE lock
  (G2) malloc arena lock

After executing (M1), (F1), (G1), none of the threads can make progress.

This commit changes the fork lock order to:

  (M'1) libio list lock
  (M'2) malloc arena locks

It explicitly encodes the lock order in the implementations of fork,
and does not rely on the registration order, thus avoiding the deadlock.
2016-04-14 09:17:02 +02:00
Florian Weimer
b49ab5f450 Remove union wait [BZ #19613]
The overloading approach in the W* macros was incompatible with
integer expressions of a type different from int.  Applications
using union wait and these macros will have to migrate to the
POSIX-specified int status type.
2016-04-14 08:54:57 +02:00
Andreas Schwab
b4bcb3aec6 Register extra test objects
This makes sure that the extra test objects are compiled with the correct
MODULE_NAME and dependencies are tracked.
2016-04-13 17:07:13 +02:00
H.J. Lu
a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
Matthew Fortune
b39d84adff VDSO support for MIPS
This patch adds support for using the implementations of gettimeofday()
and clock_gettime() provided by the kernel in the VDSO. The VDSO will
always provide clock_gettime() as CLOCK_{REALTIME,MONOTONIC}_COARSE can
be implemented regardless of platform. CLOCK_{REALTIME,MONOTONIC}, along
with gettimeofday(), are only implemented on platforms which make use of
either the CP0 count or GIC as their clocksource. On other platforms,
the VDSO does not provide the __vdso_gettimeofday symbol, as it is
never useful.

The VDSO functions return ENOSYS when they encounter an unsupported
request, in which case glibc should fall back to the standard syscall.

Tested with upstream kernel 4.5 and QEMU emulating Malta.

./vdsotest gettimeofday bench
gettimeofday: syscall: 1021 nsec/call
gettimeofday:    libc: 262 nsec/call
gettimeofday:    vdso: 174 nsec/call

	* sysdeps/unix/sysv/linux/mips/Makefile (sysdep_routines):
	Include dl-vdso.
	* sysdeps/unix/sysv/linux/mips/Versions: Add
	__vdso_clock_gettime.
	* sysdeps/unix/sysv/linux/mips/init-first.c: New file.
	* sysdeps/unix/sysv/linux/mips/libc-vdso.h: New file.
	* sysdeps/unix/sysv/linux/mips/mips32/sysdep.h:
	(INTERNAL_VSYSCALL_CALL): Define to be compatible with MIPS
	definitions of INTERNAL_SYSCALL_{ERROR_P,ERRNO}.
	(HAVE_CLOCK_GETTIME_VSYSCALL): Define.
	(HAVE_GETTIMEOFDAY_VSYSCALL): Define.
	* sysdeps/unix/sysv/linux/mips/mips64/n32/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/n64/sysdep.h: Likewise.
2016-04-12 11:05:13 +01:00
Adhemerval Zanella
071af4769f Consolidate pwrite/pwrite64 implementations
This patch consolidates all the pwrite/pwrite64 implementation for Linux
in only one (sysdeps/unix/sysv/linux/pwrite{64}.c).  It also removes the
syscall from the auto-generation using assembly macros.

For pwrite{64} offset argument placement the new SYSCALL_LL{64} macro
is used.  For pwrite ports that do not define __NR_pwrite will use
__NR_pwrite64 and for pwrite64 ports that dot define __NR_pwrite64 will
use __NR_pwrite for the syscall.

Checked on x86_64, x32, i386, aarch64, and ppc64le.

	* sysdeps/unix/sysv/linux/arm/pwrite.c: Remove file.
	* sysdeps/unix/sysv/linux/arm/pwrite64.c: Likewise.
	* sysdeps/unix/sysv/linux/generic/wordsize-32/pwrite.c: Likewise.
	* sysdeps/unix/sysv/linux/generic/wordsize-32/pwrite64.c: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/pwrite.c: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/pwrite64.c: Likewise.
	* sysdeps/unix/sysv/linux/wordsize-64/pwrite64.c: Likewise.
	* sysdeps/unix/sysv/linux/wordsize-64/syscalls.list (prite): Remove
	syscalls generation.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/sysdep.h
	[__NR_pwrite64] (__NR_write): Remove define.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep.h
	[__NR_pwrite64] (__NR_write): Remove define.
	* sysdeps/unix/sysv/linux/pwrite.c [__NR_pwrite64] (__NR_pwrite):
	Remove define.
	(__libc_pwrite): Use SYSCALL_LL macro on offset argument.
	* sysdeps/unix/sysv/linux/pwrite64.c [__NR_pwrite64] (__NR_pwrite):
	Remove define.
	(__libc_pwrite64): Use SYSCALL_LL64 macro on offset argument.
	* sysdeps/unix/sysv/linux/sh/pwrite.c: Rewrite using default
	Linux implementation as base.
	* sysdeps/unix/sysv/linux/sh/pwrite64.c: Likewise.
	* sysdeps/unix/sysv/linux/mips/pwrite.c: Likewise.
	* sysdeps/unix/sysv/linux/mips/pwrite64.c: Likewise.
2016-04-11 10:08:01 -03:00
Adhemerval Zanella
77a4fbd536 Consolidate pread/pread64 implementations
This patch consolidates all the pread/pread64 implementation for Linux
in only one (sysdeps/unix/sysv/linux/pread.c).  It also removes the
syscall from the auto-generation using assembly macros.

For pread{64} offset argument placement the new SYSCALL_LL{64} macro
is used.  For pread ports that do not define __NR_pread will use
__NR_pread64 and for pread64 ports that dot define __NR_pread64 will
use __NR_pread for the syscall.

Checked on x86_64, x32, i386, aarch64, and ppc64le.

	* sysdeps/unix/sysv/linux/arm/pread.c: Remove file.
	* sysdeps/unix/sysv/linux/arm/pread64.c: Likewise.
	* sysdeps/unix/sysv/linux/generic/wordsize-32/pread.c: Likewise.
	* sysdeps/unix/sysv/linux/generic/wordsize-32/pread64.c: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/pread.c: Likewise,
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/pread64.c: Likewise.
	* sysdeps/unix/sysv/linux/wordsize-64/pread64.c: Likewise.
	* sysdeps/unix/sysv/linux/wordsize-64/syscalls.list (pread): Remove
	syscall generation.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/sysdep.h
	[__NR_pread64] (__NR_pread): Remove define.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep.h:
	[__NR_pread64] (__NR_pread): Likewise.
	* sysdeps/unix/sysv/linux/pread.c [__NR_pread64] (__NR_pread): Remove
	define.
	(__libc_pread): Use SYSCALL_LL macro on offset argument.
	* sysdeps/unix/sysv/linux/pread64.c [__NR_pread64] (__NR_pread):
	Remove define.
	(__libc_pread64): Use SYSCALL_LL64 macro on offset argument.
	* sysdeps/unix/sysv/linux/sh/pread.c: Rewrite using default
	Linux implementation as base.
	* sysdeps/unix/sysv/linux/sh/pread64.c: Likewise.
	* sysdeps/unix/sysv/linux/mips/pread.c: Likewise.
	* sysdeps/unix/sysv/linux/mips/pread64.c: Likewise.
2016-04-11 10:08:01 -03:00
Adhemerval Zanella
eeddfa91cb Consolidate off_t/off64_t syscall argument passing
This patch add three new macros (SYSCALL_LL, SYSCALL_LL64, and
__ASSUME_WORDSIZE64_ILP32) to use along with off_t and off64_t argument
syscalls.  The rationale for this change is:

1. Remove multiple implementations for the same syscall for different
   architectures (for instance, pread have 6 different implementations).

2. Also remove the requirement to use syscall wrappers for cancellable
   entrypoints.

The macro usage should be used along __ALIGNMENT_ARG to follow ABI constrains
for architecture where it applies.  For instance, pread can be rewritten as:

  return SYSCALL_CANCEL (pread, fd, buf, count,
                         __ALIGNMENT_ARG SYSCALL_LL (offset));

Another macro, SYSCALL_LL64, is provided for off64_t.  The macro
__ASSUME_WORDSIZE64_ILP32 is used by the ABI to define is uses 64-bit register
even if ABI is ILP32 (for instance x32 and mips64-n32).

The changes itself are not currently used in any implementation, so no
code change is expected.

	* sysdeps/unix/sysv/linux/generic/sysdep.h (__ALIGNMENT_ARG): Move
	definition.
	(__ALIGNMENT_COUNT): Likewise.
	* sysdeps/unix/sysv/linux/sysdep.h (__ALIGNMENT_ARG): To here.
	(__ALIGNMENT_COUNT): Likewise.
	(SYSCALL_LL): New define.
	(SYSCALL_LL64): Likewise.
	* sysdeps/unix/sysv/linux/mips/kernel-features.h:
	[_MIPS_SIM == _ABIO32] (__ASSUME_WORDSIZE64_ILP32): Define.
	* sysdeps/unix/sysv/linux/x86_64/kernel-features.h:
	[ILP32] (__ASUME_WORDSIZE64_ILP32): Likewise.
2016-04-11 10:07:53 -03:00
Adhemerval Zanella
482b2f87a8 Define __ASSUME_ALIGNED_REGISTER_PAIRS for missing ports
This patch defines __ASSUME_ALIGNED_REGISTER_PAIRS for the missing
ports that require 64-bit value (e.g., long long) to be aligned to
an even register pair in argument passing.

No code change is expected, tested with builds for powerpc32,
mips-o32, and armhf.

	* sysdeps/unix/sysv/linux/arm/kernel-features.h
	(__ASSUME_ALIGNED_REGISTER_PAIRS): Define.
	* sysdeps/unix/sysv/linux/mips/kernel-features.h
	[_MIPS_SIM == _ABIO32] (__ASSUME_ALIGNED_REGISTER_PAIRS): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/kernel-features.h
	[!__powerpc64__] (__ASSUME_ALIGNED_REGISTER_PAIRS): Likewise.
2016-04-11 09:15:11 -03:00
Samuel Thibault
e1ef505659 Fix build with HAVE_AUX_VECTOR
* sysdeps/unix/sysv/linux/ldsodefs.h (HAVE_AUX_VECTOR): Define before
	including <ldsodefs.h>.
	* sysdeps/nacl/ldsodefs.h (HAVE_AUX_VECTOR): Likewise.
2016-04-11 10:27:25 +02:00
Samuel Thibault
0cdc5e930a Fix crash on getauxval call without HAVE_AUX_VECTOR
* sysdeps/generic/ldsodefs.h (struct rtld_global_ro)
	[!HAVE_AUX_VECTOR]: Do not define _dl_auxv field.
	* misc/getauxval.c (__getauxval) [!HAVE_AUX_VECTOR]: Do not go through
	GLRO(dl_auxv) list.
2016-04-10 23:58:43 +02:00
Khem Raj
1a5d01e79e When disabling SSE, make sure -fpmath is not set to use SSE either
This fixes errors when we inject sse options through CFLAGS and now
that we have -Werror turned on by default this warning turns into an
error on x86:

$ gcc -m32 -march=core2 -mtune=core2 -msse3 -mfpmath=sse -x c /dev/null -S -mno-sse -mno-mmx
/dev/null:1:0: warning: SSE instruction set disabled, using 387 arithmetics

Where as:

$ gcc -m32 -march=core2 -mtune=core2 -msse3 -mfpmath=sse -x c /dev/null -S -mno-sse -mno-mmx -mfpmath=387

Generates no warnings.
2016-04-09 22:14:24 -04:00
Mike Frysinger
b2d4456b33 configure: fix test == usage
POSIX defines the = operator, but not ==.  Fix the few places where we
incorrectly used ==.
2016-04-09 20:05:13 -04:00
Joseph Myers
fb3227b95c Fix termios.h XCASE namespace (bug 19925).
bits/termios.h (various versions under sysdeps/unix/sysv/linux)
defines XCASE if defined __USE_MISC || defined __USE_XOPEN.  This
macro was removed in the 2001 edition of POSIX, and is not otherwise
reserved, so should not be defined for 2001 and later versions of
POSIX.  This patch fixes the conditions accordingly (leaving the macro
defined for __USE_MISC, so still in the default namespace).

Tested for x86_64 and x86 (testsuite, and that installed shared
libraries are unchanged by the patch).

	[BZ #19925]
	* sysdeps/unix/sysv/linux/alpha/bits/termios.h (XCASE): Do not
	define if [!__USE_MISC && __USE_XOPEN2K].
	* sysdeps/unix/sysv/linux/bits/termios.h (XCASE): Likewise.
	* sysdeps/unix/sysv/linux/mips/bits/termios.h (XCASE): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/bits/termios.h (XCASE):
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/bits/termios.h (XCASE): Likewise.
	* conform/Makefile (test-xfail-XOPEN2K/termios.h/conform): Remove
	variable.
	(test-xfail-XOPEN2K8/termios.h/conform): Likewise.
2016-04-08 18:16:09 +00:00
Paul E. Murphy
25dba0ad05 powerpc: Add optimized P8 strspn
This utilizes vectors and bitmasks.  For small needle, large
haystack, the performance improvement is upto 8x.  For short
strings (0-4B), the cost of computing the bitmask dominates,
and is a tad slower.
2016-04-07 15:51:28 -05:00
H.J. Lu
a7d1c51482 X86-64: Prepare memmove-vec-unaligned-erms.S
Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
default memcpy, mempcpy and memmove.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(MEMCPY_SYMBOL): New.
	(MEMPCPY_SYMBOL): Likewise.
	(MEMMOVE_CHK_SYMBOL): Likewise.
	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
	Provide alias for memcpy in libc.a and ld.so.
2016-04-06 10:19:16 -07:00
H.J. Lu
4af1bb06c5 X86-64: Prepare memset-vec-unaligned-erms.S
Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
default memset.

	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
	Disabled fro now.
	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
	symbols.  Properly check USE_MULTIARCH on __memset symbols.
2016-04-06 09:10:35 -07:00
H.J. Lu
ec0cac9a1f Force 32-bit displacement in memset-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
	32-bit displacement to avoid long nop between instructions.
2016-04-05 05:21:19 -07:00
H.J. Lu
696ac77484 Add a comment in memset-sse2-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
	a comment on VMOVU and VMOVA.
2016-04-05 05:19:18 -07:00
H.J. Lu
5cd7af016d Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
and AVX512 memmove and memset in ld.so.

	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
	if not in libc.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
2016-04-03 14:35:38 -07:00
H.J. Lu
ea2785e96f Fix memmove-vec-unaligned-erms.S
__mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
and __memmove it breaks __memmove_chk.

Don't check source == destination first since it is less common.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
	with unaligned_erms.
	(__memmove_erms): Skip if source == destination.
	(__memmove_unaligned_erms): Don't check source == destination
	first.
2016-04-03 12:38:25 -07:00
H.J. Lu
27d3ce1467 Remove Fast_Copy_Backward from Intel Core processors
Intel Core i3, i5 and i7 processors have fast unaligned copy and
copy backward is ignored.  Remove Fast_Copy_Backward from Intel Core
processors to avoid confusion.

	* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
	bit_arch_Fast_Copy_Backward for Intel Core proessors.
2016-04-01 15:09:14 -07:00
Adhemerval Zanella
528ffb3a04 Remove powerpc64 strspn, strcspn, and strpbrk implementation
This patch removes the powerpc64 optimized strspn, strcspn, and
strpbrk assembly implementation now that the default C one
implements the same strategy.  On internal glibc benchtests
current implementations shows similar performance with -O2.

Tested on powerpc64le (POWER8).

	* sysdeps/powerpc/powerpc64/strcspn.S: Remove file.
	* sysdeps/powerpc/powerpc64/strpbrk.S: Remove file.
	* sysdeps/powerpc/powerpc64/strspn.S: Remove file.
2016-04-01 10:44:45 -03:00
Wilco Dijkstra
d3496c9f4f Improve generic strcspn performance
Improve strcspn performance using a much faster algorithm.  It is kept simple
so it works well on most targets.  It is generally at least 10 times faster
than the existing implementation on bench-strcspn on a few AArch64
implementations, and for some tests 100 times as fast (repeatedly calling
strchr on a small string is extremely slow...).

In fact the string/bits/string2.h inlines make no longer sense, as GCC
already uses strlen if reject is an empty string, strchrnul is 5 times as
fast as __strcspn_c1, while __strcspn_c2 and __strcspn_c3 are slower than
the strcspn main loop for large strings (though reject length 2-4 could be
special cased in the future to gain even more performance).

Tested on x86_64, i686, and aarch64.

	* string/Version (libc): Add GLIBC_2.24.
	* string/strcspn.c (strcspn): Rewrite function.
	* string/bits/string2.h (strcspn): Use __builtin_strcspn.
	(__strcspn_c1): Remove inline function.
	(__strcspn_c2): Likewise.
	(__strcspn_c3): Likewise.
	* string/string-inline.c
	[SHLIB_COMPAT(libc, GLIBC_2_1_1, GLIBC_2_24)] (__strcspn_c1): Add
	compatibility symbol.
	[SHLIB_COMPAT(libc, GLIBC_2_1_1, GLIBC_2_24)] (__strcspn_c2):
	Likewise.
	[SHLIB_COMPAT(libc, GLIBC_2_1_1, GLIBC_2_24)] (__strcspn_c3):
	Likewise.
	* sysdeps/i386/string-inlines.c: Include generic string-inlines.c.
2016-04-01 10:44:40 -03:00
Stefan Liebler
d8a012c5c9 S390: Use ahi instead of aghi in 32bit _dl_runtime_resolve.
This patch uses ahi instead of aghi in 32bit _dl_runtime_resolve
to adjust the stack pointer. This is no functional change,
but a cosmetic one.

ChangeLog:

	* sysdeps/s390/s390-32/dl-trampoline.h (_dl_runtime_resolve):
	Use ahi instead of aghi to adjust stack pointer.
2016-04-01 10:42:54 +02:00
Paul E. Murphy
37a4c70bd4 Increase internal precision of ldbl-128ibm decimal printf [BZ #19853]
When the signs differ, the precision of the conversion sometimes
drops below 106 bits.  This strategy is identical to the
hexadecimal variant.

I've refactored tst-sprintf3 to enable testing a value with more
than 30 significant digits in order to demonstrate this failure
and its solution.

Additionally, this implicitly fixes a typo in the shift
quantities when subtracting from the high mantissa to compute
the difference.
2016-03-31 12:14:33 -05:00
H.J. Lu
830566307f Add x86-64 memset with unaligned store and rep stosb
Implement x86-64 memset with unaligned store and rep movsb.  Support
16-byte, 32-byte and 64-byte vector register sizes.  A single file
provides 2 implementations of memset, one with rep stosb and the other
without rep stosb.  They share the same codes when size is between 2
times of vector register size and REP_STOSB_THRESHOLD which defaults
to 2KB.

Key features:

1. Use overlapping store to avoid branch.
2. For size <= 4 times of vector register size, fully unroll the loop.
3. For size > 4 times of vector register size, store 4 times of vector
register size at a time.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
	memset-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
	__memset_sse2_unaligned_erms, __memset_erms,
	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
	Likewise.
2016-03-31 10:06:07 -07:00
H.J. Lu
88b57b8ed4 Add x86-64 memmove with unaligned load/store and rep movsb
Implement x86-64 memmove with unaligned load/store and rep movsb.
Support 16-byte, 32-byte and 64-byte vector register sizes.  When
size <= 8 times of vector register size, there is no check for
address overlap bewteen source and destination.  Since overhead for
overlap check is small when size > 8 times of vector register size,
memcpy is an alias of memmove.

A single file provides 2 implementations of memmove, one with rep movsb
and the other without rep movsb.  They share the same codes when size is
between 2 times of vector register size and REP_MOVSB_THRESHOLD which
is 2KB for 16-byte vector register size and scaled up by large vector
register size.

Key features:

1. Use overlapping load and store to avoid branch.
2. For size <= 8 times of vector register size, load  all sources into
registers and store them together.
3. If there is no address overlap bewteen source and destination, copy
from both ends with 4 times of vector register size at a time.
4. If address of destination > address of source, backward copy 8 times
of vector register size at a time.
5. Otherwise, forward copy 8 times of vector register size at a time.
6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
by fallbacking to backward copy 8 times of vector register size at a
time.
7. Skip when address of destination == address of source.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
	memmove-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test
	__memmove_chk_avx512_unaligned_2,
	__memmove_chk_avx512_unaligned_erms,
	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
	__memmove_chk_sse2_unaligned_2,
	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
	__memmove_avx512_unaligned_erms, __memmove_erms,
	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
	__memcpy_chk_avx512_unaligned_2,
	__memcpy_chk_avx512_unaligned_erms,
	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
	__mempcpy_chk_avx512_unaligned_erms,
	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
	__mempcpy_erms.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	Likwise.
2016-03-31 10:04:40 -07:00
Stefan Liebler
5cdd1989d1 S390: Extend structs La_s390_regs / La_s390_retval with vector-registers.
Starting with z13, vector registers can also occur as argument registers.
Thus the passed input/output register structs for
la_s390_[32|64]_gnu_plt[enter|exit] functions should reflect those new
registers. This patch extends these structs La_s390_regs and La_s390_retval
and adjusts _dl_runtime_profile() to handle those fields in case of
running on a z13 machine.

ChangeLog:

	* sysdeps/s390/bits/link.h: (La_s390_vr) New typedef.
	(La_s390_32_regs): Append vector register lr_v24-lr_v31.
	(La_s390_64_regs): Likewise.
	(La_s390_32_retval): Append vector register lrv_v24.
	(La_s390_64_retval): Likeweise.
	* sysdeps/s390/s390-32/dl-trampoline.h (_dl_runtime_profile):
	Handle extended structs La_s390_32_regs and La_s390_32_retval.
	* sysdeps/s390/s390-64/dl-trampoline.h (_dl_runtime_profile):
	Handle extended structs La_s390_64_regs and La_s390_64_retval.
2016-03-31 17:37:16 +02:00
Stefan Liebler
4603c51ef7 S390: Save and restore fprs/vrs while resolving symbols.
On s390, no fpr/vrs were saved while resolving a symbol
via _dl_runtime_resolve/_dl_runtime_profile.

According to the abi, the fpr-arguments are defined as call clobbered.
In leaf-functions, gcc 4.9 and newer can use fprs for saving/restoring gprs
instead of saving them to the stack.
If gcc do this in one of the resolver-functions, then the floating point
arguments of a library-function are invalid for the first library-function-call.
Thus, this patch saves/restores the fprs around the resolving code.

The same could occur for vector registers. Furthermore an ifunc-resolver
could also clobber the vector/floating point argument registers.
Thus this patch provides the further variants _dl_runtime_resolve_vx/
_dl_runtime_profile_vx, which are used if the kernel claims, that
we run on a machine with vector registers.

Furthermore, if _dl_runtime_profile calls _dl_call_pltexit,
the pointers to inregs-/outregs-structs were setup invalid.
Now they point to the correct location in the stack-frame.
Before branching back to the caller, the return values are now
restored instead of containing the return values of the
_dl_call_pltexit() call.
On s390-32, an endless loop occurs if _dl_call_pltexit() should be called.
Now, this code-path branches to this function instead of just after the
preceding basr-instruction.

ChangeLog:

	* sysdeps/s390/s390-32/dl-trampoline.S: Include dl-trampoline.h twice
	to create a non-vector/vector version for _dl_runtime_resolve and
	_dl_runtime_profile. Move implementation to ...
	* sysdeps/s390/s390-32/dl-trampoline.h: ... here.
	(_dl_runtime_resolve) Save and restore fpr/vrs.
	(_dl_runtime_profile) Save and restore vrs and fix some issues
	if _dl_call_pltexit is called.
	* sysdeps/s390/s390-32/dl-machine.h (elf_machine_runtime_setup):
	Choose the correct resolver function if running on a machine with vx.
	* sysdeps/s390/s390-64/dl-trampoline.S: Include dl-trampoline.h twice
	to create a non-vector/vector version for _dl_runtime_resolve and
	_dl_runtime_profile. Move implementation to ...
	* sysdeps/s390/s390-64/dl-trampoline.h: ... here.
	(_dl_runtime_resolve) Save and restore fpr/vrs.
	(_dl_runtime_profile) Save and restore vrs and fix some issues
	* sysdeps/s390/s390-64/dl-machine.h: (elf_machine_runtime_setup):
	Choose the correct resolver function if running on a machine with vx.
2016-03-31 17:37:16 +02:00
Joseph Myers
258ec8abc1 [microblaze] Remove __ASSUME_FUTIMESAT.
MicroBlaze has a special version of futimesat.c because it gained the
futimesat syscall later than other non-asm-generic architectures.  Now
the minimum kernel is recent enough that this syscall can always be
assumed to be present for MicroBlaze, so this patch removes the
special version and the __ASSUME_FUTIMESAT macro, resulting in the
sysdeps/unix/sysv/linux/futimesat.c version being used.

Untested.

	* sysdeps/unix/sysv/linux/microblaze/kernel-features.h
	(__ASSUME_FUTIMESAT): Remove macro.
	* sysdeps/unix/sysv/linux/microblaze/futimesat.c: Remove file.
2016-03-29 22:13:36 +00:00
H.J. Lu
0791f91dff Initial Enhanced REP MOVSB/STOSB (ERMS) support
The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
has a feature bit in CPUID.  This patch adds the Enhanced REP MOVSB/STOSB
(ERMS) bit to x86 cpu-features.

	* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
	(index_cpu_ERMS): Likewise.
	(reg_ERMS): Likewise.
2016-03-28 19:23:31 -07:00
Aurelien Jarno
9ff9351d02 Synchronize <sys/personality.h> with kernel headers
<sys/personality.h> is out of sync with kernel headers, missing the
UNAME26, FDPIC_FUNCPTRS and PER_LINUX_FDPIC entries. Fix that.

Changelog:
	* sysdeps/unix/sysv/linux/sys/personality.h (UNAME26, FDPIC_FUNCPTRS,
	PER_LINUX_FDPIC): Add.
2016-03-28 22:42:52 +02:00
H.J. Lu
064f01b10b Make __memcpy_avx512_no_vzeroupper an alias
Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
__memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
to reduce code size of libc.so.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
	to ...
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
	(MEMCPY): Don't define.
	(MEMCPY_CHK): Likewise.
	(MEMPCPY): Likewise.
	(MEMPCPY_CHK): Likewise.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMCPY_CHK): Renamed to ...
	(__memmove_chk_avx512_no_vzeroupper): This.
	(MEMCPY): Renamed to ...
	(__memmove_avx512_no_vzeroupper): This.
	(__memcpy_avx512_no_vzeroupper): New alias.
	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
2016-03-28 13:16:22 -07:00
H.J. Lu
c365e615f7 Implement x86-64 multiarch mempcpy in memcpy
Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
reduces code size of libc.so.

	[BZ #18858]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
	and mempcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
	New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
	(MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
2016-03-28 13:13:51 -07:00
H.J. Lu
e41b395523 [x86] Add a feature bit: Fast_Unaligned_Copy
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load.  A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
	processors.  Set Fast_Copy_Backward for AMD Excavator
	processors.
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
	New.
	(index_arch_Fast_Unaligned_Copy): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
2016-03-28 04:40:03 -07:00
Florian Weimer
f327f5b47b tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
[BZ# 19860]
	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
	zero if the compiler does not provide the AVX512F bit.
2016-03-25 11:11:42 +01:00
Joseph Myers
c898991d8b Fix x86_64 / x86 powl inaccuracy for integer exponents (bug 19848).
Bug 19848 reports cases where powl on x86 / x86_64 has error
accumulation, for small integer exponents, larger than permitted by
glibc's accuracy goals, at least in some rounding modes.  This patch
further restricts the exponent range for which the
small-integer-exponent logic is used to limit the possible error
accumulation.

Tested for x86_64 and x86 and ulps updated accordingly.

	[BZ #19848]
	* sysdeps/i386/fpu/e_powl.S (p3): Rename to p2 and change value
	from 8 to 4.
	(__ieee754_powl): Compare integer exponent against 4 not 8.
	* sysdeps/x86_64/fpu/e_powl.S (p3): Rename to p2 and change value
	from 8 to 4.
	(__ieee754_powl): Compare integer exponent against 4 not 8.
	* math/auto-libm-test-in: Add more tests of pow.
	* math/auto-libm-test-out: Regenerated.
	* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Update.
	* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
2016-03-24 01:32:52 +00:00
Aurelien Jarno
7e1ff08c26 Assume __NR_utimensat is always defined
With the 2.6.32 minimum kernel on x86 and 3.2 on other architectures,
__NR_utimensat is always defined.

Changelog:
	* sysdeps/unix/sysv/linux/futimens.c (futimens) [__NR_utimensat]:
	Make code unconditional.
	[!__NR_utimensat]: Remove conditional code.
	* sysdeps/unix/sysv/linux/lutimes.c (lutimes) [__NR_utimensat]:
	Make code unconditional.
	[!__NR_utimensat]: Remove conditional code.
	* sysdeps/unix/sysv/linux/utimensat.c (utimensat) [__NR_utimensat]:
	Make code unconditional.
	[!__NR_utimensat]: Remove conditional code.
2016-03-23 23:35:08 +01:00
Aurelien Jarno
16d94f67e5 Assume __NR_openat is always defined
With the 2.6.32 minimum kernel on x86 and 3.2 on other architectures,
__NR_openat is always defined.

Changelog:
	* sysdeps/unix/sysv/linux/dl-openat64.c (openat64) [__NR_openat]:
	Make code unconditional.
2016-03-23 23:35:08 +01:00
Nick Alcock
7a25d6a84d x86, pthread_cond_*wait: Do not depend on %eax not being clobbered
The x86-specific versions of both pthread_cond_wait and
pthread_cond_timedwait have (in their fall-back-to-futex-wait slow
paths) calls to __pthread_mutex_cond_lock_adjust followed by
__pthread_mutex_unlock_usercnt, which load the parameters before the
first call but then assume that the first parameter, in %eax, will
survive unaffected.  This happens to have been true before now, but %eax
is a call-clobbered register, and this assumption is not safe: it could
change at any time, at GCC's whim, and indeed the stack-protector canary
checking code clobbers %eax while checking that the canary is
uncorrupted.

So reload %eax before calling __pthread_mutex_unlock_usercnt.  (Do this
unconditionally, even when stack-protection is not in use, because it's
the right thing to do, it's a slow path, and anything else is dicing
with death.)

	* sysdeps/unix/sysv/linux/i386/pthread_cond_timedwait.S: Reload
	call-clobbered %eax on retry path.
	* sysdeps/unix/sysv/linux/i386/pthread_cond_wait.S: Likewise.
2016-03-23 13:40:14 +01:00
H.J. Lu
3c9a4cd16c Don't set %rcx twice before "rep movsb"
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
	Don't set %rcx twice before "rep movsb".
2016-03-22 08:36:16 -07:00
H.J. Lu
f781a9e961 Set index_arch_AVX_Fast_Unaligned_Load only for Intel processors
Since only Intel processors with AVX2 have fast unaligned load, we
should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.

Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
and call get_common_indeces for other processors.

Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
GLRO(dl_x86_cpu_features) in cpu-features.c.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
	inline.  Check family before setting family, model and
	extended_model.  Set AVX, AVX2, AVX512, FMA and FMA4 usable
	bits here.
	(init_cpu_features): Replace HAS_CPU_FEATURE and
	HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
	CPU_FEATURES_ARCH_P.  Set index_arch_AVX_Fast_Unaligned_Load
	for Intel processors with usable AVX2.  Call get_common_indeces
	for other processors with family == NULL.
	* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
	(CPU_FEATURES_ARCH_P): Likewise.
	(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
	(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
2016-03-22 07:47:20 -07:00
Joseph Myers
37ad347359 Remove __ASSUME_GETDENTS64_SYSCALL.
This patch removes the __ASSUME_GETDENTS64_SYSCALL macro, as its
definition is constant given the new kernel version requirements (and
was constant anyway before those requirements except for MIPS n32).

Note that the "#ifdef __NR_getdents64" conditional *is* still needed,
because MIPS n64 only has the getdents syscall (being a 64-bit ABI,
that syscall is 64-bit; the difference between the two on 64-bit
architectures is where d_type goes).  If MIPS n64 were to gain the
getdents64 syscall and we wanted to use it conditionally on the kernel
version at runtime we'd have to revert this patch, but I think that's
unlikely (and in any case, we could follow the simpler approach of
undefining __NR_getdents64 if the syscall can't be assumed, just like
we do for accept4 / recvmmsg / sendmmsg syscalls on architectures
where socketcall support came first).

Most of the getdents.c changes are reindentation.

Tested for x86_64 and x86 that installed stripped shared libraries are
unchanged by the patch.

	* sysdeps/unix/sysv/linux/kernel-features.h
	(__ASSUME_GETDENTS64_SYSCALL): Remove macro.
	* sysdeps/unix/sysv/linux/getdents.c
	[!__ASSUME_GETDENTS64_SYSCALL]: Remove conditional code.
	[!have_no_getdents64_defined]: Likewise.
	(__GETDENTS): Remove __have_no_getdents64 conditional.
2016-03-22 00:32:20 +00:00