Commit Graph

1016 Commits

Author SHA1 Message Date
Andrew Senkevich
df2258c6cb Added tests to ensure linkage through libmvec *_finite aliases which are
defined in libmvec_nonshared.a (bug 19654).

    [BZ #19654]
    * sysdeps/x86_64/fpu/Makefile: Added new tests.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx-main.c: New.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-double-libmvec-alias.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-main.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias-mod.c: Likewise.
    * sysdeps/x86_64/fpu/test-float-libmvec-alias.c: Likewise.
    * sysdeps/x86_64/fpu/test-libmvec-alias-mod.c: Likewise.
2016-06-20 21:15:50 +03:00
Florian Weimer
aca1daef29 elf: Consolidate machine-agnostic DTV definitions in <dl-dtv.h>
Identical definitions of dtv_t and TLS_DTV_UNALLOCATED were
repeated for all architectures using DTVs.
2016-06-20 14:31:40 +02:00
Joseph Myers
f4015c8a86 Use generic fdim on more architectures (bug 6796, bug 20255, bug 20256).
Some architectures have their own versions of fdim functions, which
are missing errno setting (bug 6796) and may also return sNaN instead
of qNaN for sNaN input, in the case of the x86 / x86_64 long double
versions (bug 20256).

These versions are not actually doing anything that a compiler
couldn't generate, just straightforward comparisons / arithmetic (and,
in the x86 / x86_64 case, testing for NaNs with fxam, which isn't
actually needed once you use an unordered comparison and let the NaNs
pass through the same subtraction as non-NaN inputs).  This patch
removes the x86 / x86_64 / powerpc versions, so that those
architectures use the generic C versions, which correctly handle
setting errno and deal properly with sNaN inputs.  This seems better
than dealing with setting errno in lots of .S versions.

The i386 versions also return results with excess range and precision,
which is not appropriate for a function exactly defined by reference
to IEEE operations.  For errno setting to work correctly on overflow,
it's necessary to remove excess range with math_narrow_eval, which
this patch duly does in the float and double versions so that the
tests can reliably pass on x86.  For float, this avoids any double
rounding issues as the long double precision is more than twice that
of float.  For double, double rounding issues will need to be
addressed separately, so this patch does not fully fix bug 20255.

Tested for x86_64, x86 and powerpc.

	[BZ #6796]
	[BZ #20255]
	[BZ #20256]
	* math/s_fdim.c: Include <math_private.h>.
	(__fdim): Use math_narrow_eval on result.
	* math/s_fdimf.c: Include <math_private.h>.
	(__fdimf): Use math_narrow_eval on result.
	* sysdeps/i386/fpu/s_fdim.S: Remove file.
	* sysdeps/i386/fpu/s_fdimf.S: Likewise.
	* sysdeps/i386/fpu/s_fdiml.S: Likewise.
	* sysdeps/i386/i686/fpu/s_fdim.S: Likewise.
	* sysdeps/i386/i686/fpu/s_fdimf.S: Likewise.
	* sysdeps/i386/i686/fpu/s_fdiml.S: Likewise.
	* sysdeps/powerpc/fpu/s_fdim.c: Likewise.
	* sysdeps/powerpc/fpu/s_fdimf.c: Likewise.
	* sysdeps/powerpc/powerpc32/fpu/s_fdim.c: Likewise.
	* sysdeps/powerpc/powerpc64/fpu/s_fdim.c: Likewise.
	* sysdeps/x86_64/fpu/s_fdiml.S: Likewise.
	* math/libm-test.inc (fdim_test_data): Expect errno setting on
	overflow.  Add sNaN tests.
2016-06-14 16:04:19 +00:00
Joseph Myers
f00faa4a43 Fix i386/x86_64 log2l (sNaN) (bug 20235).
The i386/x86_64 versions of log2l return sNaN for sNaN input.  This
patch fixes them to add NaN inputs to themselves so that qNaN is
returned in this case.

Tested for x86_64 and x86.

	[BZ #20235]
	* sysdeps/i386/fpu/e_log2l.S (__ieee754_log2l): Add NaN input to
	itself.
	* sysdeps/x86_64/fpu/e_log2l.S (__ieee754_log2l): Likewise.
	* math/libm-test.inc (log2_test_data): Add sNaN tests.
2016-06-09 18:04:30 +00:00
H.J. Lu
ac187dc4ab Always indirect branch to __libc_start_main via GOT
Since __libc_start_main in libc.so is called very early, lazy binding
isn't relevant.  Always call __libc_start_main with indirect branch via
GOT to avoid extra branch to PLT slot.  In case of static executable,
ld in binutils 2.26 or above can convert indirect branch into direct
branch:

0000000000400a80 <_start>:
  400a80:       31 ed                   xor    %ebp,%ebp
  400a82:       49 89 d1                mov    %rdx,%r9
  400a85:       5e                      pop    %rsi
  400a86:       48 89 e2                mov    %rsp,%rdx
  400a89:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
  400a8d:       50                      push   %rax
  400a8e:       54                      push   %rsp
  400a8f:       49 c7 c0 20 1b 40 00    mov    $0x401b20,%r8
  400a96:       48 c7 c1 90 1a 40 00    mov    $0x401a90,%rcx
  400a9d:       48 c7 c7 c0 03 40 00    mov    $0x4003c0,%rdi
  400aa4:       67 e8 96 09 00 00       addr32 callq 401440 <__libc_start_main>
  400aaa:       f4                      hlt

	* sysdeps/x86_64/start.S (_start): Always indirect branch to
	__libc_start_main via GOT.
2016-06-09 04:43:31 -07:00
H.J. Lu
75437079e4 X86-64: Add dummy memcopy.h and wordcopy.c
Since x86-64 no longer uses memory copy functions, add dummy memcopy.h
and wordcopy.c to reduce code size.  It reduces the size of libc.so by
about 1 KB.

	* sysdeps/x86_64/memcopy.h: New file.
	* sysdeps/x86_64/wordcopy.c: Likewise.
2016-06-09 04:38:34 -07:00
Joseph Myers
8c010e2f71 Fix i386/x86_64 log1pl (sNaN) (bug 20229).
The i386/x86_64 versions of log1pl return sNaN for sNaN input.  This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.

Tested for x86_64 and x86.

	[BZ #20229]
	* sysdeps/i386/fpu/s_log1pl.S (__log1pl): Add NaN input to itself.
	* sysdeps/x86_64/fpu/s_log1pl.S (__log1pl): Likewise.
	* math/libm-test.inc (log1p_test_data): Add sNaN tests.
2016-06-08 23:11:42 +00:00
Joseph Myers
09096b3615 Fix i386/x86_64 log10l (sNaN) (bug 20228).
The i386/x86_64 versions of log10l return sNaN for sNaN input.  This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.

Tested for x86_64 and x86.

	[BZ #20228]
	* sysdeps/i386/fpu/e_log10l.S (__ieee754_log10l): Add NaN input to
	itself.
	* sysdeps/x86_64/fpu/e_log10l.S (__ieee754_log10l): Likewise.
	* math/libm-test.inc (log10_test_data): Add sNaN tests.
2016-06-08 22:59:18 +00:00
Joseph Myers
df179d8808 Fix i386/x86_64 logl (sNaN) (bug 20227).
The i386/x86_64 versions of logl return sNaN for sNaN input.  This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.

Tested for x86_64 and x86 (including a build for i586 to cover the
non-i686 logl version).

	[BZ #20227]
	* sysdeps/i386/fpu/e_logl.S (__ieee754_logl): Add NaN input to
	itself.
	* sysdeps/i386/i686/fpu/e_logl.S (__ieee754_logl): Likewise.
	* sysdeps/x86_64/fpu/e_logl.S (__ieee754_logl): Likewise.
	* math/libm-test.inc (log_test_data): Add sNaN tests.
2016-06-08 22:24:06 +00:00
Joseph Myers
9bd3ef8e19 Fix i386/x86_64 expl, exp10l, expm1l for sNaN input (bug 20226).
The i386 and x86_64 implementations of expl, exp10l and expm1l (code
shared between the functions) return sNaN for sNaN input.  This patch
fixes them to add NaN inputs to themselves so that qNaN is returned in
this case.

Tested for x86_64 and x86.

	[BZ #20226]
	* sysdeps/i386/fpu/e_expl.S (IEEE754_EXPL): Add NaN argument to
	itself.
	* sysdeps/x86_64/fpu/e_expl.S (IEEE754_EXPL): Likewise.
	* math/libm-test.inc (exp_test_data): Add sNaN tests.
	(exp10_test_data): Likewise.
	(expm1_test_data): Likewise.
2016-06-08 21:55:06 +00:00
H.J. Lu
c867597bff X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
the new ones.

No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
memcpy/memmove optimized with Enhanced REP MOVSB will be used for
processors with ERMS.  The new AVX512 memcpy/memmove will be used for
processors with AVX512 which prefer vzeroupper.

Since the new SSE2 memcpy/memmove are faster than the previous default
memcpy/memmove used in libc.a and ld.so, we also remove the previous
default memcpy/memmove and make them the default memcpy/memmove, except
that non-temporal store isn't used in ld.so.

Together, it reduces the size of libc.so by about 6 KB and the size of
ld.so by about 2 KB.

	[BZ #19776]
	* sysdeps/x86_64/memcpy.S: Make it dummy.
	* sysdeps/x86_64/mempcpy.S: Likewise.
	* sysdeps/x86_64/memmove.S: New file.
	* sysdeps/x86_64/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/memmove.c: Removed.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-sse2-unaligned, memmove-avx-unaligned,
	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Replace
	__memmove_chk_avx512_unaligned_2 with
	__memmove_chk_avx512_unaligned.  Remove
	__memmove_chk_avx_unaligned_2.  Replace
	__memmove_chk_sse2_unaligned_2 with
	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
	with __memmove_avx512_unaligned.  Replace
	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
	with __memcpy_chk_avx512_unaligned.  Remove
	__memcpy_chk_avx_unaligned_2.  Replace
	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
	Replace __memcpy_avx512_unaligned_2 with
	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
	with __mempcpy_chk_avx512_unaligned.  Remove
	__mempcpy_chk_avx_unaligned_2.  Replace
	__mempcpy_chk_sse2_unaligned_2 with
	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
	Replace __mempcpy_avx512_unaligned_2 with
	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
	Replace __mempcpy_sse2_unaligned_2 with
	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../memcpy.S.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
	__memcpy_chk_avx512_unaligned_erms and
	__memcpy_chk_avx512_unaligned.  Use
	__memcpy_chk_avx_unaligned_erms and
	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __memcpy_chk_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	Change function suffix from unaligned_2 to unaligned.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../mempcpy.S.
	(mempcpy): New.  Add a weak alias.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
	__mempcpy_chk_avx512_unaligned_erms and
	__mempcpy_chk_avx512_unaligned.  Use
	__mempcpy_chk_avx_unaligned_erms and
	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __mempcpy_chk_sse2_unaligned.
2016-06-08 13:58:08 -07:00
H.J. Lu
5e8c5bb1ac X86-64: Remove the previous SSE2/AVX2 memsets
Since the new SSE2/AVX2 memsets are faster than the previous ones, we
can remove the previous SSE2/AVX2 memsets and replace them with the
new ones.  This reduces the size of libc.so by about 900 bytes.

No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
optimized with Enhanced REP STOSB will be used for processors with
ERMS.  The new AVX512 memset will be used for processors with AVX512
which prefer vzeroupper.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
	into ...
	* sysdeps/x86_64/memset.S: This.
	(__bzero): Removed.
	(__memset_tail): Likewise.
	(__memset_chk): Likewise.
	(memset): Likewise.
	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
	defined.
	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
	(__memset_zero_constant_len_parameter): Check SHARED instead of
	PIC.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memset-avx2 and memset-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__bzero): Enabled.
	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	(memset): Removed.
	(__memset_chk): Likewise.
	(MEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Replace __memset_sse2 with
	__memset_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
	__memset_chk_sse2 and __memset_chk_avx2 with
	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
	Use __memset_chk_sse2_unaligned_erms or
	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_chk_avx512_unaligned_erms and
	__memset_chk_avx512_unaligned.
2016-06-08 13:56:14 -07:00
H.J. Lu
3f61232ab3 Fix a typo in comments in memmove-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Fix
	a typo in comments.
2016-06-06 16:03:21 -07:00
Joseph Myers
5ff81530dd Do not raise "inexact" from x86_64 SSE4.1 ceil, floor (bug 15479).
Continuing fixes for ceil and floor functions not to raise the
"inexact" exception, this patch fixes the x86_64 SSE4.1 versions.  The
roundss / roundsd instructions take an immediate operand that
determines the rounding mode and whether to raise "inexact"; this just
needs bit 3 set to disable "inexact", which this patch does.

Remark: we don't have an SSE4.1 version of trunc / truncf (using this
instruction with operand 11); I'd expect one to make sense, but of
course it should be benchmarked against the existing C code.  I'll
file a bug in Bugzilla for the lack of such a version.

Tested for x86_64.

	[BZ #15479]
	* sysdeps/x86_64/fpu/multiarch/s_ceil.S (__ceil_sse41): Set bit 3
	of immediate operand to rounding instruction.
	* sysdeps/x86_64/fpu/multiarch/s_ceilf.S (__ceilf_sse41):
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floor.S (__floor_sse41):
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floorf.S (__floorf_sse41):
	Likewise.
2016-05-24 21:11:18 +00:00
H.J. Lu
6901def689 Avoid an extra branch to PLT for -z now
When --enable-bind-now is used to configure glibc build, we can avoid
an extra branch to the PLT entry by using indirect branch via the GOT
slot instead, which is similar to the first instructuon in the PLT
entry.  Changes in the shared library sizes in text sections:

Shared library    Before (bytes)   After (bytes)
libm.so             1060813          1060797
libmvec.so           160881           160805
libpthread.so         94992            94984
librt.so              25064            25048

	* config.h.in (BIND_NOW): New.
	* configure.ac (BIND_NOW): New.  Defined for --enable-bind-now.
	* configure: Regenerated.
	* sysdeps/x86_64/sysdep.h (JUMPTARGET)[BIND_NOW]: Defined to
	indirect branch via the GOT slot.
2016-05-24 08:44:23 -07:00
H.J. Lu
eb2c88c7c8 Remove alignments on jump targets in memset
X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
increases code sizes, but not necessarily improve performance.  As
memset benchtest data of align vs no align on various Intel and AMD
processors

https://sourceware.org/bugzilla/attachment.cgi?id=9277

shows that aligning jump targets isn't necessary.

	[BZ #20115]
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
	Remove alignments on jump targets.
2016-05-19 08:49:55 -07:00
H.J. Lu
4facca0b0e Call init_cpu_features only if SHARED is defined
In static executable, since init_cpu_features is called early from
__libc_start_main, there is no need to call it again in dl_platform_init.

	[BZ #20072]
	* sysdeps/i386/dl-machine.h (dl_platform_init): Call
	init_cpu_features only if SHARED is defined.
	* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.
2016-05-13 08:29:33 -07:00
H.J. Lu
2a1f15b1a9 Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
i686 and x86-64.

	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
	Remove ifunc-defines.sym.
	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
	Likewise.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
	rtld-global-offsets.sym.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
	* sysdeps/x86/cpu-features-offsets.sym: This.
	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
2016-05-11 05:51:39 -07:00
H.J. Lu
a9558b49b3 Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86
Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86.  No code changes on x86
and x86_64.

	* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
	instead of <sysdeps/x86_64/cacheinfo.c>.
	* sysdeps/x86_64/cacheinfo.c: Moved to ...
	* sysdeps/x86/cacheinfo.c: Here.
2016-05-08 08:49:18 -07:00
Andreas Schwab
b4bcb3aec6 Register extra test objects
This makes sure that the extra test objects are compiled with the correct
MODULE_NAME and dependencies are tracked.
2016-04-13 17:07:13 +02:00
H.J. Lu
a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
Mike Frysinger
b2d4456b33 configure: fix test == usage
POSIX defines the = operator, but not ==.  Fix the few places where we
incorrectly used ==.
2016-04-09 20:05:13 -04:00
H.J. Lu
a7d1c51482 X86-64: Prepare memmove-vec-unaligned-erms.S
Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
default memcpy, mempcpy and memmove.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(MEMCPY_SYMBOL): New.
	(MEMPCPY_SYMBOL): Likewise.
	(MEMMOVE_CHK_SYMBOL): Likewise.
	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
	Provide alias for memcpy in libc.a and ld.so.
2016-04-06 10:19:16 -07:00
H.J. Lu
4af1bb06c5 X86-64: Prepare memset-vec-unaligned-erms.S
Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
default memset.

	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
	Disabled fro now.
	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
	symbols.  Properly check USE_MULTIARCH on __memset symbols.
2016-04-06 09:10:35 -07:00
H.J. Lu
ec0cac9a1f Force 32-bit displacement in memset-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
	32-bit displacement to avoid long nop between instructions.
2016-04-05 05:21:19 -07:00
H.J. Lu
696ac77484 Add a comment in memset-sse2-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
	a comment on VMOVU and VMOVA.
2016-04-05 05:19:18 -07:00
H.J. Lu
5cd7af016d Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
and AVX512 memmove and memset in ld.so.

	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
	if not in libc.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
2016-04-03 14:35:38 -07:00
H.J. Lu
ea2785e96f Fix memmove-vec-unaligned-erms.S
__mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
and __memmove it breaks __memmove_chk.

Don't check source == destination first since it is less common.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
	with unaligned_erms.
	(__memmove_erms): Skip if source == destination.
	(__memmove_unaligned_erms): Don't check source == destination
	first.
2016-04-03 12:38:25 -07:00
H.J. Lu
830566307f Add x86-64 memset with unaligned store and rep stosb
Implement x86-64 memset with unaligned store and rep movsb.  Support
16-byte, 32-byte and 64-byte vector register sizes.  A single file
provides 2 implementations of memset, one with rep stosb and the other
without rep stosb.  They share the same codes when size is between 2
times of vector register size and REP_STOSB_THRESHOLD which defaults
to 2KB.

Key features:

1. Use overlapping store to avoid branch.
2. For size <= 4 times of vector register size, fully unroll the loop.
3. For size > 4 times of vector register size, store 4 times of vector
register size at a time.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
	memset-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
	__memset_sse2_unaligned_erms, __memset_erms,
	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
	Likewise.
2016-03-31 10:06:07 -07:00
H.J. Lu
88b57b8ed4 Add x86-64 memmove with unaligned load/store and rep movsb
Implement x86-64 memmove with unaligned load/store and rep movsb.
Support 16-byte, 32-byte and 64-byte vector register sizes.  When
size <= 8 times of vector register size, there is no check for
address overlap bewteen source and destination.  Since overhead for
overlap check is small when size > 8 times of vector register size,
memcpy is an alias of memmove.

A single file provides 2 implementations of memmove, one with rep movsb
and the other without rep movsb.  They share the same codes when size is
between 2 times of vector register size and REP_MOVSB_THRESHOLD which
is 2KB for 16-byte vector register size and scaled up by large vector
register size.

Key features:

1. Use overlapping load and store to avoid branch.
2. For size <= 8 times of vector register size, load  all sources into
registers and store them together.
3. If there is no address overlap bewteen source and destination, copy
from both ends with 4 times of vector register size at a time.
4. If address of destination > address of source, backward copy 8 times
of vector register size at a time.
5. Otherwise, forward copy 8 times of vector register size at a time.
6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
by fallbacking to backward copy 8 times of vector register size at a
time.
7. Skip when address of destination == address of source.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
	memmove-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test
	__memmove_chk_avx512_unaligned_2,
	__memmove_chk_avx512_unaligned_erms,
	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
	__memmove_chk_sse2_unaligned_2,
	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
	__memmove_avx512_unaligned_erms, __memmove_erms,
	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
	__memcpy_chk_avx512_unaligned_2,
	__memcpy_chk_avx512_unaligned_erms,
	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
	__mempcpy_chk_avx512_unaligned_erms,
	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
	__mempcpy_erms.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	Likwise.
2016-03-31 10:04:40 -07:00
H.J. Lu
064f01b10b Make __memcpy_avx512_no_vzeroupper an alias
Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
__memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
to reduce code size of libc.so.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
	to ...
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
	(MEMCPY): Don't define.
	(MEMCPY_CHK): Likewise.
	(MEMPCPY): Likewise.
	(MEMPCPY_CHK): Likewise.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMCPY_CHK): Renamed to ...
	(__memmove_chk_avx512_no_vzeroupper): This.
	(MEMCPY): Renamed to ...
	(__memmove_avx512_no_vzeroupper): This.
	(__memcpy_avx512_no_vzeroupper): New alias.
	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
2016-03-28 13:16:22 -07:00
H.J. Lu
c365e615f7 Implement x86-64 multiarch mempcpy in memcpy
Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
reduces code size of libc.so.

	[BZ #18858]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
	and mempcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
	New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
	(MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
2016-03-28 13:13:51 -07:00
H.J. Lu
e41b395523 [x86] Add a feature bit: Fast_Unaligned_Copy
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load.  A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
	processors.  Set Fast_Copy_Backward for AMD Excavator
	processors.
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
	New.
	(index_arch_Fast_Unaligned_Copy): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
2016-03-28 04:40:03 -07:00
Florian Weimer
f327f5b47b tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]
[BZ# 19860]
	* sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return
	zero if the compiler does not provide the AVX512F bit.
2016-03-25 11:11:42 +01:00
Joseph Myers
c898991d8b Fix x86_64 / x86 powl inaccuracy for integer exponents (bug 19848).
Bug 19848 reports cases where powl on x86 / x86_64 has error
accumulation, for small integer exponents, larger than permitted by
glibc's accuracy goals, at least in some rounding modes.  This patch
further restricts the exponent range for which the
small-integer-exponent logic is used to limit the possible error
accumulation.

Tested for x86_64 and x86 and ulps updated accordingly.

	[BZ #19848]
	* sysdeps/i386/fpu/e_powl.S (p3): Rename to p2 and change value
	from 8 to 4.
	(__ieee754_powl): Compare integer exponent against 4 not 8.
	* sysdeps/x86_64/fpu/e_powl.S (p3): Rename to p2 and change value
	from 8 to 4.
	(__ieee754_powl): Compare integer exponent against 4 not 8.
	* math/auto-libm-test-in: Add more tests of pow.
	* math/auto-libm-test-out: Regenerated.
	* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Update.
	* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
2016-03-24 01:32:52 +00:00
H.J. Lu
3c9a4cd16c Don't set %rcx twice before "rep movsb"
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
	Don't set %rcx twice before "rep movsb".
2016-03-22 08:36:16 -07:00
H.J. Lu
86ed888255 Use JUMPTARGET in x86-64 mathvec
When PLT may be used, JUMPTARGET should be used instead calling the
function directly.

	* sysdeps/x86_64/fpu/multiarch/svml_d_cos2_core_sse4.S
	(_ZGVbN2v_cos_sse4): Use JUMPTARGET to call cos.
	* sysdeps/x86_64/fpu/multiarch/svml_d_cos4_core_avx2.S
	(_ZGVdN4v_cos_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S
	(_ZGVdN4v_cos): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_exp2_core_sse4.S
	(_ZGVbN2v_exp_sse4): Use JUMPTARGET to call exp.
	* sysdeps/x86_64/fpu/multiarch/svml_d_exp4_core_avx2.S
	(_ZGVdN4v_exp_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S
	(_ZGVdN4v_exp): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_log2_core_sse4.S
	(_ZGVbN2v_log_sse4): Use JUMPTARGET to call log.
	* sysdeps/x86_64/fpu/multiarch/svml_d_log4_core_avx2.S
	(_ZGVdN4v_log_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S
	(_ZGVdN4v_log): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_pow2_core_sse4.S
	(_ZGVbN2vv_pow_sse4): Use JUMPTARGET to call pow.
	* sysdeps/x86_64/fpu/multiarch/svml_d_pow4_core_avx2.S
	(_ZGVdN4vv_pow_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S
	(_ZGVdN4vv_pow): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin2_core_sse4.S
	(_ZGVbN2v_sin_sse4): Use JUMPTARGET to call sin.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin4_core_avx2.S
	(_ZGVdN4v_sin_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S
	(_ZGVdN4v_sin): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sincos2_core_sse4.S
	(_ZGVbN2vvv_sincos_sse4): Use JUMPTARGET to call sin and cos.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sincos4_core_avx2.S
	(_ZGVdN4vvv_sincos_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S
	(_ZGVdN4vvv_sincos): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S
	(_ZGVdN8v_cosf): Use JUMPTARGET to call cosf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_cosf4_core_sse4.S
	(_ZGVbN4v_cosf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_cosf8_core_avx2.S
	(_ZGVdN8v_cosf_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S
	(_ZGVdN8v_expf): Use JUMPTARGET to call expf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf4_core_sse4.S
	(_ZGVbN4v_expf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf8_core_avx2.S
	(_ZGVdN8v_expf_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S
	(_ZGVdN8v_logf): Use JUMPTARGET to call logf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_logf4_core_sse4.S
	(_ZGVbN4v_logf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_logf8_core_avx2.S
	(_ZGVdN8v_logf_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S
	(_ZGVdN8vv_powf): Use JUMPTARGET to call powf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_powf4_core_sse4.S
	(_ZGVbN4vv_powf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_powf8_core_avx2.S
	(_ZGVdN8vv_powf_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S
	(_ZGVdN8vv_powf): Use JUMPTARGET to call sinf and cosf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf4_core_sse4.S
	(_ZGVbN4vvv_sincosf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf8_core_avx2.S
	(_ZGVdN8vvv_sincosf_avx2): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S
	(_ZGVdN8v_sinf): Use JUMPTARGET to call sinf.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sinf4_core_sse4.S
	(_ZGVbN4v_sinf_sse4): Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sinf8_core_avx2.S
	(_ZGVdN8v_sinf_avx2): Likewise.
	* sysdeps/x86_64/fpu/svml_d_wrapper_impl.h (WRAPPER_IMPL_SSE2):
	Use JUMPTARGET to call callee.
	(WRAPPER_IMPL_SSE2_ff): Likewise.
	(WRAPPER_IMPL_SSE2_fFF): Likewise.
	(WRAPPER_IMPL_AVX): Likewise.
	(WRAPPER_IMPL_AVX_ff): Likewise.
	(WRAPPER_IMPL_AVX_fFF): Likewise.
	(WRAPPER_IMPL_AVX512): Likewise.
	(WRAPPER_IMPL_AVX512_ff): Likewise.
	* sysdeps/x86_64/fpu/svml_s_wrapper_impl.h (WRAPPER_IMPL_SSE2):
	Likewise.
	(WRAPPER_IMPL_SSE2_ff): Likewise.
	(WRAPPER_IMPL_SSE2_fFF): Likewise.
	(WRAPPER_IMPL_AVX): Likewise.
	(WRAPPER_IMPL_AVX_ff): Likewise.
	(WRAPPER_IMPL_AVX_fFF): Likewise.
	(WRAPPER_IMPL_AVX512): Likewise.
	(WRAPPER_IMPL_AVX512_ff): Likewise.
	(WRAPPER_IMPL_AVX512_fFF): Likewise.
2016-03-16 14:24:19 -07:00
Roland McGrath
3bd80c0de2 Fix tst-audit10 build when -mavx512f is not supported. 2016-03-08 12:32:59 -08:00
Florian Weimer
3c0f7407ee tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]
This ensures that GCC will not use unsupported instructions before
the run-time check to ensure support.
2016-03-07 16:00:25 +01:00
H.J. Lu
fee9eb6200 Group AVX512 functions in .text.avx512 section
* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
	Replace .text with .text.avx512.
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
	Likewise.
2016-03-06 16:48:11 -08:00
H.J. Lu
16b23e0363 Replace PREINIT_FUNCTION@PLT with *%rax in call
Since we have loaded address of PREINIT_FUNCTION into %rax, we can
avoid extra branch to PLT slot.

	[BZ #19745]
	* sysdeps/x86_64/crti.S (_init): Replace PREINIT_FUNCTION@PLT
	with *%rax in call.
2016-03-04 16:15:41 -08:00
H.J. Lu
21683b5a7d Replace @PLT with @GOTPCREL(%rip) in call
Since __libc_start_main is called very early, lazy binding isn't relevant
here.  Use indirect branch via GOT to avoid extra branch to PLT slot.

	[BZ #19745]
	* sysdeps/x86_64/start.S (_start): __libc_start_main@PLT
	with *__libc_start_main@GOTPCREL(%rip) in call.
2016-03-04 16:15:41 -08:00
H.J. Lu
97f7112728 Add a comment in sysdeps/x86_64/Makefile
Mention recursive calls when ENTRY is used in _mcount.S.

	* sysdeps/x86_64/Makefile (sysdep_noprof): Add a comment.
2016-03-04 08:44:58 -08:00
H.J. Lu
14a1d7cc4c x86-64: Fix memcpy IFUNC selection
Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
order is updated with following selection order:

1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __memcpy_sse2 if SSSE3 isn't available.
4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
5. __memcpy_ssse3

	[BZ #18880]
	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
	instead of Slow_BSF, and also check for Fast_Copy_Backward to
	enable __memcpy_ssse3_back.
2016-03-04 08:39:07 -08:00
Paul Pluzhnikov
5cdc3d9db0 2016-03-03 Paul Pluzhnikov <ppluzhnikov@google.com>
[BZ #19490]
	* sysdeps/x86_64/_mcount.S (_mcount): Add unwind descriptor.
	(__fentry__): Likewise
2016-03-03 09:53:49 -08:00
H.J. Lu
87a07a4376 Copy x86_64 _mcount.op from _mcount.o
No need to compile x86_64 _mcount.S with -pg.  We can just copy the
normal static object.

	* gmon/Makefile (noprof): Add $(sysdep_noprof).
	* sysdeps/x86_64/Makefile (sysdep_noprof): Add _mcount.
2016-03-03 06:56:22 -08:00
H.J. Lu
ec215346b9 Call x86-64 __mcount_internal/__sigjmp_save directly
Since __mcount_internal and __sigjmp_save are internal to x86-64 libc.so:

3532: 0000000000104530   289 FUNC    LOCAL  DEFAULT   13 __mcount_internal
3391: 0000000000034170    38 FUNC    LOCAL  DEFAULT   13 __sigjmp_save

they can be called directly without PLT.

	* sysdeps/x86_64/_mcount.S (C_LABEL(_mcount)): Call
	__mcount_internal directly.
	(C_LABEL(__fentry__)): Likewise.
	* sysdeps/x86_64/setjmp.S __sigsetjmp): Call __sigjmp_save
	directly.
2016-03-01 16:58:07 -08:00
H.J. Lu
8d9c92017d [x86_64] Set DL_RUNTIME_UNALIGNED_VEC_SIZE to 8
Due to GCC bug:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066

__tls_get_addr may be called with 8-byte stack alignment.  Although
this bug has been fixed in GCC 4.9.4, 5.3 and 6, we can't assume
that stack will be always aligned at 16 bytes.  Since SSE optimized
memory/string functions with aligned SSE register load and store are
used in the dynamic linker, we must set DL_RUNTIME_UNALIGNED_VEC_SIZE
to 8 so that _dl_runtime_resolve_sse will align the stack before
calling _dl_fixup:

Dump of assembler code for function _dl_runtime_resolve_sse:
   0x00007ffff7deea90 <+0>:	push   %rbx
   0x00007ffff7deea91 <+1>:	mov    %rsp,%rbx
   0x00007ffff7deea94 <+4>:	and    $0xfffffffffffffff0,%rsp
                                ^^^^^^^^^^^ Align stack to 16 bytes
   0x00007ffff7deea98 <+8>:	sub    $0x100,%rsp
   0x00007ffff7deea9f <+15>:	mov    %rax,0xc0(%rsp)
   0x00007ffff7deeaa7 <+23>:	mov    %rcx,0xc8(%rsp)
   0x00007ffff7deeaaf <+31>:	mov    %rdx,0xd0(%rsp)
   0x00007ffff7deeab7 <+39>:	mov    %rsi,0xd8(%rsp)
   0x00007ffff7deeabf <+47>:	mov    %rdi,0xe0(%rsp)
   0x00007ffff7deeac7 <+55>:	mov    %r8,0xe8(%rsp)
   0x00007ffff7deeacf <+63>:	mov    %r9,0xf0(%rsp)
   0x00007ffff7deead7 <+71>:	movaps %xmm0,(%rsp)
   0x00007ffff7deeadb <+75>:	movaps %xmm1,0x10(%rsp)
   0x00007ffff7deeae0 <+80>:	movaps %xmm2,0x20(%rsp)
   0x00007ffff7deeae5 <+85>:	movaps %xmm3,0x30(%rsp)
   0x00007ffff7deeaea <+90>:	movaps %xmm4,0x40(%rsp)
   0x00007ffff7deeaef <+95>:	movaps %xmm5,0x50(%rsp)
   0x00007ffff7deeaf4 <+100>:	movaps %xmm6,0x60(%rsp)
   0x00007ffff7deeaf9 <+105>:	movaps %xmm7,0x70(%rsp)

	[BZ #19679]
	* sysdeps/x86_64/dl-trampoline.S (DL_RUNIME_UNALIGNED_VEC_SIZE):
	Renamed to ...
	(DL_RUNTIME_UNALIGNED_VEC_SIZE): This.  Set to 8.
	(DL_RUNIME_RESOLVE_REALIGN_STACK): Renamed to ...
	(DL_RUNTIME_RESOLVE_REALIGN_STACK): This.  Updated.
	(DL_RUNIME_RESOLVE_REALIGN_STACK): Renamed to ...
	(DL_RUNTIME_RESOLVE_REALIGN_STACK): This.
	* sysdeps/x86_64/dl-trampoline.h
	(DL_RUNIME_RESOLVE_REALIGN_STACK): Renamed to ...
	(DL_RUNTIME_RESOLVE_REALIGN_STACK): This.
2016-02-19 15:45:09 -08:00
Andrew Senkevich
a5df3210a6 Use PIC relocation in ALIAS_IMPL
Since libmvec_nonshared.a may be linked into shared objects, ALIAS_IMPL
should use PIC relocation.

	[BZ #19590]
	* sysdeps/x86_64/fpu/svml_finite_alias.S (ALIAS_IMPL): Use PIC
	relocation.
2016-02-17 14:23:32 -08:00
Paul Pluzhnikov
b274130206 2016-01-20 Paul Pluzhnikov <ppluzhnikov@google.com>
[BZ #19490]
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S (pthread_cond_broadcast): Use ENTRY/END
* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S (pthread_cond_signal): Likewise
* sysdeps/x86_64/nptl/pthread_spin_lock.S (pthread_spin_lock): Likewise
* sysdeps/x86_64/nptl/pthread_spin_trylock.S (pthread_spin_trylock): Likewise
* sysdeps/x86_64/nptl/pthread_spin_unlock.S (pthread_spin_unlock): Likewise
2016-01-20 13:39:20 -08:00