glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-04 02:40:06 +00:00

Author	SHA1	Message	Date
Adhemerval Zanella	c579f48edb	Remove cached PID/TID in clone This patch remove the PID cache and usage in current GLIBC code. Current usage is mainly used a performance optimization to avoid the syscall, however it adds some issues: - The exposed clone syscall will try to set pid/tid to make the new thread somewhat compatible with current GLIBC assumptions. This cause a set of issue with new workloads and usecases (such as BZ#17214 and [1]) as well for new internal usage of clone to optimize other algorithms (such as clone plus CLONE_VM for posix_spawn, BZ#19957). - The caching complexity also added some bugs in the past [2] [3] and requires more effort of each port to handle such requirements (for both clone and vfork implementation). - Caching performance gain in mainly on getpid and some specific code paths. The getpid performance leverage is questionable [4], either by the idea of getpid being a hotspot as for the getpid implementation itself (if it is indeed a justifiable hotspot a vDSO symbol could let to a much more simpler solution). Other usage is mainly for non usual code paths, such as pthread cancellation signal and handling. For thread creation (on stack allocation) the code simplification in fact adds some performance gain due the no need of transverse the stack cache and invalidate each element pid. Other thread usages will require a direct getpid syscall, such as cancellation/setxid signal, thread cancellation, thread fail path (at create_thread), and thread signal (pthread_kill and pthread_sigqueue). However these are hardly usual hotspots and I think adding a syscall is justifiable. It also simplifies both the clone and vfork arch-specific implementation. And by review each fork implementation there are some discrepancies that this patch also solves: - microblaze clone/vfork does not set/reset the pid/tid field - hppa uses the default vfork implementation that fallback to fork. Since vfork is deprecated I do not think we should bother with it. The patch also removes the TID caching in clone. My understanding for such semantic is try provide some pthread usage after a user program issue clone directly (as done by thread creation with CLONE_PARENT_SETTID and pthread tid member). However, as stated before in multiple discussions threads, GLIBC provides clone syscalls without further supporting all this semantics. I ran a full make check on x86_64, x32, i686, armhf, aarch64, and powerpc64le. For sparc32, sparc64, and mips I ran the basic fork and vfork tests from posix/ folder (on a qemu system). So it would require further testing on alpha, hppa, ia64, m68k, nios2, s390, sh, and tile (I excluded microblaze because it is already implementing the patch semantic regarding clone/vfork). [1] https://codereview.chromium.org/800183004/ [2] https://sourceware.org/ml/libc-alpha/2006-07/msg00123.html [3] https://sourceware.org/bugzilla/show_bug.cgi?id=15368 [4] http://yarchive.net/comp/linux/getpid_caching.html * sysdeps/nptl/fork.c (__libc_fork): Remove pid cache setting. * nptl/allocatestack.c (allocate_stack): Likewise. (__reclaim_stacks): Likewise. (setxid_signal_thread): Obtain pid through syscall. * nptl/nptl-init.c (sigcancel_handler): Likewise. (sighandle_setxid): Likewise. * nptl/pthread_cancel.c (pthread_cancel): Likewise. * sysdeps/unix/sysv/linux/pthread_kill.c (__pthread_kill): Likewise. * sysdeps/unix/sysv/linux/pthread_sigqueue.c (pthread_sigqueue): Likewise. * sysdeps/unix/sysv/linux/createthread.c (create_thread): Likewise. * sysdeps/unix/sysv/linux/getpid.c: Remove file. * nptl/descr.h (struct pthread): Change comment about pid value. * nptl/pthread_getattr_np.c (pthread_getattr_np): Remove thread pid assert. * sysdeps/unix/sysv/linux/pthread-pids.h (__pthread_initialize_pids): Do not set pid value. * nptl_db/td_ta_thr_iter.c (iterate_thread_list): Remove thread pid cache check. * nptl_db/td_thr_validate.c (td_thr_validate): Likewise. * sysdeps/aarch64/nptl/tcb-offsets.sym: Remove pid offset. * sysdeps/alpha/nptl/tcb-offsets.sym: Likewise. * sysdeps/arm/nptl/tcb-offsets.sym: Likewise. * sysdeps/hppa/nptl/tcb-offsets.sym: Likewise. * sysdeps/i386/nptl/tcb-offsets.sym: Likewise. * sysdeps/ia64/nptl/tcb-offsets.sym: Likewise. * sysdeps/m68k/nptl/tcb-offsets.sym: Likewise. * sysdeps/microblaze/nptl/tcb-offsets.sym: Likewise. * sysdeps/mips/nptl/tcb-offsets.sym: Likewise. * sysdeps/nios2/nptl/tcb-offsets.sym: Likewise. * sysdeps/powerpc/nptl/tcb-offsets.sym: Likewise. * sysdeps/s390/nptl/tcb-offsets.sym: Likewise. * sysdeps/sh/nptl/tcb-offsets.sym: Likewise. * sysdeps/sparc/nptl/tcb-offsets.sym: Likewise. * sysdeps/tile/nptl/tcb-offsets.sym: Likewise. * sysdeps/x86_64/nptl/tcb-offsets.sym: Likewise. * sysdeps/unix/sysv/linux/aarch64/clone.S: Remove pid and tid caching. * sysdeps/unix/sysv/linux/alpha/clone.S: Likewise. * sysdeps/unix/sysv/linux/arm/clone.S: Likewise. * sysdeps/unix/sysv/linux/hppa/clone.S: Likewise. * sysdeps/unix/sysv/linux/i386/clone.S: Likewise. * sysdeps/unix/sysv/linux/ia64/clone2.S: Likewise. * sysdeps/unix/sysv/linux/mips/clone.S: Likewise. * sysdeps/unix/sysv/linux/nios2/clone.S: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/clone.S: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S: Likewise. * sysdeps/unix/sysv/linux/s390/s390-32/clone.S: Likewise. * sysdeps/unix/sysv/linux/s390/s390-64/clone.S: Likewise. * sysdeps/unix/sysv/linux/sh/clone.S: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc32/clone.S: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc64/clone.S: Likewise. * sysdeps/unix/sysv/linux/tile/clone.S: Likewise. * sysdeps/unix/sysv/linux/x86_64/clone.S: Likewise. * sysdeps/unix/sysv/linux/aarch64/vfork.S: Remove pid set and reset. * sysdeps/unix/sysv/linux/alpha/vfork.S: Likewise. * sysdeps/unix/sysv/linux/arm/vfork.S: Likewise. * sysdeps/unix/sysv/linux/i386/vfork.S: Likewise. * sysdeps/unix/sysv/linux/ia64/vfork.S: Likewise. * sysdeps/unix/sysv/linux/m68k/clone.S: Likewise. * sysdeps/unix/sysv/linux/m68k/vfork.S: Likewise. * sysdeps/unix/sysv/linux/mips/vfork.S: Likewise. * sysdeps/unix/sysv/linux/nios2/vfork.S: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/vfork.S: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/vfork.S: Likewise. * sysdeps/unix/sysv/linux/s390/s390-32/vfork.S: Likewise. * sysdeps/unix/sysv/linux/s390/s390-64/vfork.S: Likewise. * sysdeps/unix/sysv/linux/sh/vfork.S: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc32/vfork.S: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc64/vfork.S: Likewise. * sysdeps/unix/sysv/linux/tile/vfork.S: Likewise. * sysdeps/unix/sysv/linux/x86_64/vfork.S: Likewise. * sysdeps/unix/sysv/linux/tst-clone2.c (f): Remove direct pthread struct access. (clone_test): Remove function. (do_test): Rewrite to take in consideration pid is not cached anymore.	2016-11-24 19:38:51 -02:00
Aurelien Jarno	380ec16d62	x86_64: fix static build of __memcpy_chk for compilers defaulting to PIC/PIE When glibc is compiled with gcc 6.2 that has been configured with to default to PIC/PIE, the static version of __memcpy_chk is not built, as the test is done on PIC instead of SHARED. Fix the test to check for SHARED, like it is done for similar functions like memmove_chk. Changelog: * sysdeps/x86_64/memcpy_chk.S (__memcpy_chk): Check for SHARED instead of PIC.	2016-11-24 16:56:26 +01:00
Joseph Myers	799131036e	Do not hardcode platform names in manual/libm-err-tab.pl (bug 14139). manual/libm-err-tab.pl hardcodes a list of names for particular platforms (mapping from sysdeps directory name to friendly name for the manual). This goes against the principle of keeping information about individual platforms in their corresponding sysdeps directory, and the list is also very out-of-date regarding supported platforms and their corresponding sysdeps directories. This patch fixes this by adding a libm-test-ulps-name file alongside each libm-test-ulps file. The script then gets the friendly name from that file, which is required to exist, so it no longer needs to allow for the mapping being missing. Tested for x86_64. [BZ #14139] * manual/libm-err-tab.pl (%pplatforms): Initialize to empty. (find_files): Obtain platform name from libm-test-ulps-name and store in %pplatforms. (canonicalize_platform): Remove. (print_platforms): Use $pplatforms directly. (by_platforms): Do not allow for platforms missing from %pplatforms. * sysdeps/aarch64/libm-test-ulps-name: New file. * sysdeps/alpha/fpu/libm-test-ulps-name: Likewise. * sysdeps/arm/libm-test-ulps-name: Likewise. * sysdeps/generic/libm-test-ulps-name: Likewise. * sysdeps/hppa/fpu/libm-test-ulps-name: Likewise. * sysdeps/i386/fpu/libm-test-ulps-name: Likewise. * sysdeps/i386/i686/fpu/multiarch/libm-test-ulps-name: Likewise. * sysdeps/ia64/fpu/libm-test-ulps-name: Likewise. * sysdeps/m68k/coldfire/fpu/libm-test-ulps-name: Likewise. * sysdeps/m68k/m680x0/fpu/libm-test-ulps-name: Likewise. * sysdeps/microblaze/libm-test-ulps-name: Likewise. * sysdeps/mips/mips32/libm-test-ulps-name: Likewise. * sysdeps/mips/mips64/libm-test-ulps-name: Likewise. * sysdeps/nios2/libm-test-ulps-name: Likewise. * sysdeps/powerpc/fpu/libm-test-ulps-name: Likewise. * sysdeps/powerpc/nofpu/libm-test-ulps-name: Likewise. * sysdeps/s390/fpu/libm-test-ulps-name: Likewise. * sysdeps/sh/libm-test-ulps-name: Likewise. * sysdeps/sparc/fpu/libm-test-ulps-name: Likewise. * sysdeps/tile/libm-test-ulps-name: Likewise. * sysdeps/x86_64/fpu/libm-test-ulps-name: Likewise.	2016-11-04 16:49:06 +00:00
H.J. Lu	0e6d3adc60	Check IFUNC definition in unrelocated shared library [BZ #20019 ] Calling an IFUNC function defined in unrelocated shared library may lead to segfault. This patch issues an error message to request relinking the shared library if it references IFUNC function defined in the unrelocated shared library. [BZ #20019] * sysdeps/i386/dl-machine.h (elf_machine_rel): Check IFUNC definition in unrelocated shared library. * sysdeps/x86_64/dl-machine.h (elf_machine_rela): Likewise.	2016-10-28 09:12:15 -07:00
Joseph Myers	f280fa6d17	Use __builtin_fma more in dbl-64 code. sysdeps/ieee754/dbl-64/dla.h can use a macro DLA_FMS for more efficient double-width operations when fused multiply-subtract is supported. However, this macro is only defined for x86_64, conditional on architecture-specific __FMA4__. This patch makes the code use __builtin_fma conditional on __FP_FAST_FMA, as used elsewhere in glibc. Tested for x86_64, x86 and powerpc. On powerpc (where this is causing fused operations to be used where they weren't previously) I see an increase from 1ulp to 2ulp in the imaginary part of clog10: testing double (without inline functions) Failure: Test: Imaginary part of: clog10 (0x1.7a858p+0 - 0x6.d940dp-4 i) Result: is: -1.2237865208199886e-01 -0x1.f5435146bb61ap-4 should be: -1.2237865208199888e-01 -0x1.f5435146bb61cp-4 difference: 2.7755575615628914e-17 0x1.0000000000000p-55 ulp : 2.0000 max.ulp : 1.0000 Maximal error of real part of: clog10 is : 3 ulp accepted: 3 ulp Maximal error of imaginary part of: clog10 is : 2 ulp accepted: 1 ulp This is actually resulting from atan2 becoming more accurate (atan2 (-0x6.d940dp-4, 0x1.7a858p+0) should ideally be -0x1.208cd6e841554p-2 but was -0x1.208cd6e841555p-2 from a powerpc libm built before this change, and is -0x1.208cd6e841554p-2 from a powerpc libm built after this change). Since these functions are not expected to be correctly rounding by glibc's accuracy goals, neither result is a problem, but this does imply that some of this code, although designed to be correctly rounding, is not in fact correctly rounding (possibly because of GCC creating fused operations where the code does not expect it, something we've only disabled for specific functions where it was found to cause large errors). (Of course as previously discussed I think we should remove the slow cases where an error analysis shows this wouldn't increase the errors much above 0.5ulp; it's only functions such as cratan2 that are expected to be correctly rounding, not atan2.) * sysdeps/ieee754/dbl-64/dla.h [__FP_FAST_FMA] (DLA_FMS): Define macro to use __builtin_fma. * sysdeps/x86_64/fpu/dla.h: Remove file.	2016-09-30 15:49:51 +00:00
Joseph Myers	ec94343f59	Add femode_t functions. TS 18661-1 defines a type femode_t to represent the set of dynamic floating-point control modes (such as the rounding mode and trap enablement modes), and functions fegetmode and fesetmode to manipulate those modes (without affecting other state such as the raised exception flags) and a corresponding macro FE_DFL_MODE. This patch series implements those interfaces for glibc. This first patch adds the architecture-independent pieces, the x86 and x86_64 implementations, and the <bits/fenv.h> and ABI baseline updates for all architectures so glibc keeps building and passing the ABI tests on all architectures. Subsequent patches add the fegetmode and fesetmode implementations for other architectures. femode_t is generally an integer type - the same type as fenv_t, or as the single element of fenv_t where fenv_t is a structure containing a single integer (or the single relevant element, where it has elements for both status and control registers) - except where architecture properties or consistency with the fenv_t implementation indicate otherwise. FE_DFL_MODE follows FE_DFL_ENV in whether it's a magic pointer value (-1 cast to const femode_t ), a value that can be distinguished from valid pointers by its high bits but otherwise contains a representation of the desired register contents, or a pointer to a constant variable (the powerpc case; __fe_dfl_mode is added as an exported constant object, an alias to __fe_dfl_env). Note that where architectures (that share a register between control and status bits) gain definitions of new floating-point control or status bits in future, the implementations of fesetmode for those architectures may need updating (depending on whether the new bits are control or status bits and what the implementation does with previously unknown bits), just like existing implementations of <fenv.h> functions that take care not to touch reserved bits may need updating when the set of reserved bits changes. (As any new bits are outside the scope of ISO C, that's just a quality-of-implementation issue for supporting them, not a conformance issue.) As with fenv_t, femode_t should properly include any software DFP rounding mode (and for both fenv_t and femode_t I'd consider that fragment of DFP support appropriate for inclusion in glibc even in the absence of the rest of libdfp; hardware DFP rounding modes should already be included if the definitions of which bits are status / control bits are correct). Tested for x86_64, x86, mips64 (hard float, and soft float to test the fallback version), arm (hard float) and powerpc (hard float, soft float and e500). Other architecture versions are untested. math/fegetmode.c: New file. * math/fesetmode.c: Likewise. * sysdeps/i386/fpu/fegetmode.c: Likewise. * sysdeps/i386/fpu/fesetmode.c: Likewise. * sysdeps/x86_64/fpu/fegetmode.c: Likewise. * sysdeps/x86_64/fpu/fesetmode.c: Likewise. * math/fenv.h: Update comment on inclusion of <bits/fenv.h>. [__GLIBC_USE (IEC_60559_BFP_EXT)] (fegetmode): New function declaration. [__GLIBC_USE (IEC_60559_BFP_EXT)] (fesetmode): Likewise. * bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/aarch64/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/alpha/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/arm/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/hppa/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/ia64/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/m68k/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/microblaze/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/mips/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/nios2/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/powerpc/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (__fe_dfl_mode): New variable declaration. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/s390/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/sh/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/sparc/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/tile/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * sysdeps/x86/fpu/bits/fenv.h [__GLIBC_USE (IEC_60559_BFP_EXT)] (femode_t): New typedef. [__GLIBC_USE (IEC_60559_BFP_EXT)] (FE_DFL_MODE): New macro. * manual/arith.texi (FE_DFL_MODE): Document macro. (fegetmode): Document function. (fesetmode): Likewise. * math/Versions (fegetmode): New libm symbol at version GLIBC_2.25. (fesetmode): Likewise. * math/Makefile (libm-support): Add fegetmode and fesetmode. (tests): Add test-femode and test-femode-traps. * math/test-femode-traps.c: New file. * math/test-femode.c: Likewise. * sysdeps/powerpc/fpu/fenv_const.c (__fe_dfl_mode): Declare as alias for __fe_dfl_env. * sysdeps/powerpc/nofpu/fenv_const.c (__fe_dfl_mode): Likewise. * sysdeps/powerpc/powerpc32/e500/nofpu/fenv_const.c (__fe_dfl_mode): Likewise. * sysdeps/powerpc/Versions (__fe_dfl_mode): New libm symbol at version GLIBC_2.25. * sysdeps/nacl/libm.abilist: Update. * sysdeps/unix/sysv/linux/aarch64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/alpha/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/arm/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/hppa/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/ia64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/microblaze/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/nios2/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sh/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilegx/tilegx32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilegx/tilegx64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilepro/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.	2016-09-07 16:40:09 +00:00
H.J. Lu	fb0f7a6755	X86-64: Add _dl_runtime_resolve_avx[512]_{opt\|slow} [BZ #20508 ] There is transition penalty when SSE instructions are mixed with 256-bit AVX or 512-bit AVX512 load instructions. Since _dl_runtime_resolve_avx and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM registers, there is transition penalty when SSE instructions are used with lazy binding on AVX and AVX512 processors. To avoid SSE transition penalty, if only the lower 128 bits of the first 8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers with the zero upper bits. For AVX and AVX512 processors which support XGETBV with ECX == 1, we can use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers or the upper 256 bits of ZMM registers are zero. We can restore only the non-zero portion of vector registers with AVX/AVX512 load instructions which will zero-extend upper bits of vector registers. This patch adds _dl_runtime_resolve_sse_vex which saves and restores XMM registers with 128-bit AVX store/load instructions. It is used to preserve YMM/ZMM registers when only the lower 128 bits are non-zero. _dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so that we store and load only the non-zero portion of vector registers. This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and _dl_runtime_profile_avx512 when only the lower 128 bits of vector registers are used. _dl_runtime_resolve_avx_slow is added and used for AVX processors which don't support XGETBV with ECX == 1. Since there is no SSE transition penalty on AVX512 processors which don't support XGETBV with ECX == 1, _dl_runtime_resolve_avx512_slow isn't provided. [BZ #20495] [BZ #20508] * sysdeps/x86/cpu-features.c (init_cpu_features): For Intel processors, set Use_dl_runtime_resolve_slow and set Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1. * sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt): New. (bit_arch_Use_dl_runtime_resolve_slow): Likewise. (index_arch_Use_dl_runtime_resolve_opt): Likewise. (index_arch_Use_dl_runtime_resolve_slow): Likewise. * sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use _dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt if Use_dl_runtime_resolve_opt is set. Use _dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set. * sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>. (_dl_runtime_resolve_opt): New. Defined for AVX and AVX512. (_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex. * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow): New. (_dl_runtime_resolve_opt): Likewise. (_dl_runtime_profile): Define only if _dl_runtime_profile is defined.	2016-09-06 08:51:07 -07:00
Paul E. Murphy	2bad840e9d	Remove unneeded stubs for k_rem_pio2l. This is only used for the float and double variants. Instead, just add it to the type specific list of files, and remove all stubs, and remove the declaration from math_private.h. I verified x86_64, i486, ia64, m68k, and ppc64 build.	2016-09-01 09:31:06 -05:00
H.J. Lu	0ac8ee53e8	X86-64: Correct CFA in _dl_runtime_resolve When stack is re-aligned in _dl_runtime_resolve, there is no need to adjust CFA when allocating register save area on stack. * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve): Don't adjust CFA when allocating register save area on re-aligned stack.	2016-08-26 08:57:54 -07:00
Joseph Myers	5146356f5a	Add fesetexcept. TS 18661-1 defines an fesetexcept function for setting floating-point exception flags without the side-effect of causing enabled traps to be taken. This patch series implements this function for glibc. The present patch adds the fallback stub implementation, x86 and x86_64 implementations, documentation, tests and ABI baseline updates. The remaining patches, some of them untested, add implementations for other architectures. The implementations generally follow those of the fesetexceptflag function. As for fesetexceptflag, the approach taken for architectures where setting flags causes enabled traps to be taken is to set the flags (and potentially cause traps) rather than refusing to set the flags and returning an error. Since ISO C and TS 18661 provide no way to enable traps, this is formally in accordance with the standards. The NEWS entry should be considered a placeholder, since this patch series is intended to be followed by further such series adding other TS 18661-1 features, so that the NEWS entry would end up looking more like * New <fenv.h> features from TS 18661-1:2014 are added to libm: the fesetexcept, fetestexceptflag, fegetmode and fesetmode functions, the femode_t type and the FE_DFL_MODE macro. with hopefully more such entries for other features, rather than having an entry for a single function in the end. I believe we have consensus for adding TS 18661-1 interfaces as per <https://sourceware.org/ml/libc-alpha/2016-06/msg00421.html>. Tested for x86_64, x86, mips64 (hard float, and soft float to test the fallback version), arm (hard float) and powerpc (hard float, soft float and e500). * math/fesetexcept.c: New file. * sysdeps/i386/fpu/fesetexcept.c: Likewise. * sysdeps/x86_64/fpu/fesetexcept.c: Likewise. * math/fenv.h: Define __GLIBC_INTERNAL_STARTING_HEADER_IMPLEMENTATION and include <bits/libc-header-start.h> instead of including <features.h>. [__GLIBC_USE (IEC_60559_BFP_EXT)] (fesetexcept): New function declaration. * manual/arith.texi (fesetexcept): Document function. * math/Versions (fesetexcept): New libm symbol at version GLIBC_2.25. * math/Makefile (libm-support): Add fesetexcept. (tests): Add test-fesetexcept and test-fesetexcept-traps. * math/test-fesetexcept.c: New file. * math/test-fesetexcept-traps.c: Likewise. * sysdeps/nacl/libm.abilist: Update. * sysdeps/unix/sysv/linux/aarch64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/alpha/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/arm/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/hppa/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/ia64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/microblaze/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/nios2/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sh/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilegx/tilegx32/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilegx/tilegx64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/tile/tilepro/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.	2016-08-16 16:16:10 +00:00
Andrew Senkevich	533f9bebf9	x86_64: Call finite scalar versions in vectorized log, pow, exp (bz #20033 ). Vector math functions require -ffast-math which sets -ffinite-math-only, so it is needed to call finite scalar versions (which are called from vector functions in some cases). Since finite version of pow() returns qNaN instead of 1.0 for several inputs, those inputs are excluded for tests of vector math functions. [BZ #20033] * sysdeps/x86_64/fpu/multiarch/svml_d_exp2_core_sse4.S: Call finite version. * sysdeps/x86_64/fpu/multiarch/svml_d_exp4_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log2_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log4_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow2_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow4_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf4_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf8_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf4_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf8_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf4_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf8_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/svml_d_exp2_core.S: Likewise. * sysdeps/x86_64/fpu/svml_d_log2_core.S: Likewise. * sysdeps/x86_64/fpu/svml_d_pow2_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_expf4_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_logf4_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_powf4_core.S: Likewise. * math/libm-test.inc (pow_test_data): Exclude tests for qNaN in power zero.	2016-08-02 16:35:25 +03:00
H.J. Lu	fe0cf86148	Don't compile do_test with -mavx/-mavx/-mavx512 Don't compile do_test with -mavx, -mavx nor -mavx512 since they won't run on non-AVX machines. [BZ #20384] * sysdeps/x86_64/fpu/Makefile (extra-test-objs): Add test-double-libmvec-sincos-avx-main.o, test-double-libmvec-sincos-avx2-main.o, test-double-libmvec-sincos-main.o, test-float-libmvec-sincosf-avx-main.o, test-float-libmvec-sincosf-avx2-main.o and test-float-libmvec-sincosf-main.o. test-float-libmvec-sincosf-avx512-main.o. ($(objpfx)test-double-libmvec-sincos): Also link with $(objpfx)test-double-libmvec-sincos-main.o. ($(objpfx)test-double-libmvec-sincos-avx): Also link with $(objpfx)test-double-libmvec-sincos-avx-main.o. ($(objpfx)test-double-libmvec-sincos-avx2): Also link with $(objpfx)test-double-libmvec-sincos-avx2-main.o. ($(objpfx)test-float-libmvec-sincosf): Also link with $(objpfx)test-float-libmvec-sincosf-main.o. ($(objpfx)test-float-libmvec-sincosf-avx): Also link with $(objpfx)test-float-libmvec-sincosf-avx2-main.o. [$(config-cflags-avx512) == yes] (extra-test-objs): Add test-double-libmvec-sincos-avx512-main.o and ($(objpfx)test-double-libmvec-sincos-avx512): Also link with $(objpfx)test-double-libmvec-sincos-avx512-main.o. ($(objpfx)test-float-libmvec-sincosf-avx512): Also link with $(objpfx)test-float-libmvec-sincosf-avx512-main.o. (CFLAGS-test-double-libmvec-sincos.c): Removed. (CFLAGS-test-float-libmvec-sincosf.c): Likewise. (CFLAGS-test-double-libmvec-sincos-main.c): New. (CFLAGS-test-double-libmvec-sincos-avx-main.c): Likewise. (CFLAGS-test-double-libmvec-sincos-avx2-main.c): Likewise. (CFLAGS-test-float-libmvec-sincosf-main.c): Likewise. (CFLAGS-test-float-libmvec-sincosf-avx-main.c): Likewise. (CFLAGS-test-float-libmvec-sincosf-avx2-main.c): Likewise. (CFLAGS-test-float-libmvec-sincosf-avx512-main.c): Likewise. (CFLAGS-test-double-libmvec-sincos-avx.c): Set to -DREQUIRE_AVX. (CFLAGS-test-float-libmvec-sincosf-avx.c ): Likewise. (CFLAGS-test-double-libmvec-sincos-avx2.c): Set to -DREQUIRE_AVX2. (CFLAGS-test-float-libmvec-sincosf-avx2.c ): Likewise. (CFLAGS-test-double-libmvec-sincos-avx512.c): Set to -DREQUIRE_AVX512F. (CFLAGS-test-float-libmvec-sincosf-avx512.c): Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Rewritten. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx-main.c: New file. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2-main.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512-main.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-main.c: Likewise.	2016-07-27 11:53:15 -07:00
H.J. Lu	61655555aa	x86-64: Properly align stack in _dl_tlsdesc_dynamic [BZ #20309 ] Since _dl_tlsdesc_dynamic is called via PLT, we need to add 8 bytes for push in the PLT entry to align the stack. [BZ #20309] * configure.ac (have-mtls-dialect-gnu2): Set to yes if -mtls-dialect=gnu2 works. * configure: Regenerated. * elf/Makefile [have-mtls-dialect-gnu2 = yes] (tests): Add tst-gnu2-tls1. (modules-names): Add tst-gnu2-tls1mod. ($(objpfx)tst-gnu2-tls1): New. (tst-gnu2-tls1mod.so-no-z-defs): Likewise. (CFLAGS-tst-gnu2-tls1mod.c): Likewise. * elf/tst-gnu2-tls1.c: New file. * elf/tst-gnu2-tls1mod.c: Likewise. * sysdeps/x86_64/dl-tlsdesc.S (_dl_tlsdesc_dynamic): Add 8 bytes for push in the PLT entry to align the stack.	2016-07-12 06:30:08 -07:00
H.J. Lu	f43cb35c9b	Require binutils 2.24 to build x86-64 glibc [BZ #20139 ] If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used to save the first 8 vector registers, which only saves the lower 256 bits of vector register, for lazy binding. When it is called on AVX512 platform, the upper 256 bits of ZMM registers are clobbered. Parameters passed in ZMM registers will be wrong when the function is called the first time. This patch requires binutils 2.24, whose assembler can store and load ZMM registers, to build x86-64 glibc. Since mathvec library needs assembler support for AVX512DQ, we disable mathvec if assembler doesn't support AVX512DQ. [BZ #20139] * config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ... (HAVE_AVX512DQ_ASM_SUPPORT): This. * sysdeps/x86_64/configure.ac: Require assembler from binutils 2.24 or above. (HAVE_AVX512_ASM_SUPPORT): Removed. (HAVE_AVX512DQ_ASM_SUPPORT): New. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT check unconditional. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT. * sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S: Likewise.	2016-07-01 06:03:05 -07:00
Andrew Senkevich	ee2196bb67	Fixed wrong vector sincos/sincosf ABI to have it compatible with current vector function declaration "#pragma omp declare simd notinbranch", according to which vector sincos should have vector of pointers for second and third parameters. It is fixed with implementation as wrapper to version having second and third parameters as pointers. [BZ #20024] * sysdeps/x86/fpu/test-math-vector-sincos.h: New. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos2_core_sse4.S: Fixed ABI of this implementation of vector function. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos4_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf4_core_sse4.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf8_core_avx2.S: Likewise. * sysdeps/x86_64/fpu/svml_d_sincos2_core.S: Likewise. * sysdeps/x86_64/fpu/svml_d_sincos4_core.S: Likewise. * sysdeps/x86_64/fpu/svml_d_sincos4_core_avx.S: Likewise. * sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_sincosf4_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_sincosf8_core.S: Likewise. * sysdeps/x86_64/fpu/svml_s_sincosf8_core_avx.S: Likewise. * sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c: Use another wrapper for testing vector sincos with fixed ABI. * sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx.c: New test. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise. * sysdeps/x86_64/fpu/Makefile: Added new tests.	2016-07-01 14:15:38 +03:00
H.J. Lu	13efa86ece	Check Prefer_ERMS in memmove/memcpy/mempcpy/memset Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove, memcpy, mempcpy and memset aren't used by the current processors, this patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so that they can be used in the future. * sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New. (index_arch_Prefer_ERMS): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return __memcpy_erms for Prefer_ERMS. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (__memmove_erms): Enabled for libc.a. * ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return __memmove_erms or Prefer_ERMS. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return __mempcpy_erms for Prefer_ERMS. * sysdeps/x86_64/multiarch/memset.S (memset): Return __memset_erms for Prefer_ERMS.	2016-06-30 07:58:11 -07:00
Joseph Myers	30dcf959d2	Avoid "inexact" exceptions in i386/x86_64 trunc functions (bug 15479). As discussed in <https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS 18661-1 disallows ceil, floor, round and trunc functions from raising the "inexact" exception, in accordance with general IEEE 754 semantics for when that exception is raised. Fixing this for x87 floating point is more complicated than for the other versions of these functions, because they use the frndint instruction that raises "inexact" and this can only be avoided by saving and restoring the whole floating-point environment. As I noted in <https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7, such that GCC will inline these functions on x86, without caring about "inexact", when the default -ffp-int-builtin-inexact is in effect. This allows users to get optimized code depending on the options they pass to the compiler, while making the out-of-line functions follow TS 18661-1 semantics and avoid "inexact". This patch duly fixes the out-of-line trunc function implementations to avoid "inexact", in the same way as the nearbyint implementations. I do not know how the performance of implementations such as these based on saving the environment and changing the rounding mode temporarily compares to that of the C versions or SSE 4.1 versions (of course, for 32-bit x86 SSE implementations still need to get the return value in an x87 register); it's entirely possible other implementations could be faster in some cases. Tested for x86_64 and x86. [BZ #15479] * sysdeps/i386/fpu/s_trunc.S (__trunc): Save and restore floating-point environment rather than just control word. * sysdeps/i386/fpu/s_truncf.S (__truncf): Likewise. * sysdeps/i386/fpu/s_truncl.S (__truncl): Save and restore floating-point environment, with "invalid" exceptions merged in, rather than just control word. * sysdeps/x86_64/fpu/s_truncl.S (__truncl): Likewise. * math/libm-test.inc (trunc_test_data): Do not allow spurious "inexact" exceptions.	2016-06-27 17:26:52 +00:00
Joseph Myers	623629de06	Avoid "inexact" exceptions in i386/x86_64 floor functions (bug 15479). As discussed in <https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS 18661-1 disallows ceil, floor, round and trunc functions from raising the "inexact" exception, in accordance with general IEEE 754 semantics for when that exception is raised. Fixing this for x87 floating point is more complicated than for the other versions of these functions, because they use the frndint instruction that raises "inexact" and this can only be avoided by saving and restoring the whole floating-point environment. As I noted in <https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7, such that GCC will inline these functions on x86, without caring about "inexact", when the default -ffp-int-builtin-inexact is in effect. This allows users to get optimized code depending on the options they pass to the compiler, while making the out-of-line functions follow TS 18661-1 semantics and avoid "inexact". This patch duly fixes the out-of-line floor function implementations to avoid "inexact", in the same way as the nearbyint implementations. I do not know how the performance of implementations such as these based on saving the environment and changing the rounding mode temporarily compares to that of the C versions or SSE 4.1 versions (of course, for 32-bit x86 SSE implementations still need to get the return value in an x87 register); it's entirely possible other implementations could be faster in some cases. Tested for x86_64 and x86. [BZ #15479] * sysdeps/i386/fpu/s_floor.S (__floor): Save and restore floating-point environment rather than just control word. * sysdeps/i386/fpu/s_floorf.S (__floorf): Likewise. * sysdeps/i386/fpu/s_floorl.S (__floorl): Save and restore floating-point environment, with "invalid" exceptions merged in, rather than just control word. * sysdeps/x86_64/fpu/s_floorl.S (__floorl): Likewise. * math/libm-test.inc (floor_test_data): Do not allow spurious "inexact" exceptions.	2016-06-27 17:25:47 +00:00
Joseph Myers	26b0bf9600	Avoid "inexact" exceptions in i386/x86_64 ceil functions (bug 15479). As discussed in <https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS 18661-1 disallows ceil, floor, round and trunc functions from raising the "inexact" exception, in accordance with general IEEE 754 semantics for when that exception is raised. Fixing this for x87 floating point is more complicated than for the other versions of these functions, because they use the frndint instruction that raises "inexact" and this can only be avoided by saving and restoring the whole floating-point environment. As I noted in <https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7, such that GCC will inline these functions on x86, without caring about "inexact", when the default -ffp-int-builtin-inexact is in effect. This allows users to get optimized code depending on the options they pass to the compiler, while making the out-of-line functions follow TS 18661-1 semantics and avoid "inexact". This patch duly fixes the out-of-line ceil function implementations to avoid "inexact", in the same way as the nearbyint implementations. I do not know how the performance of implementations such as these based on saving the environment and changing the rounding mode temporarily compares to that of the C versions or SSE 4.1 versions (of course, for 32-bit x86 SSE implementations still need to get the return value in an x87 register); it's entirely possible other implementations could be faster in some cases. Tested for x86_64 and x86. [BZ #15479] * sysdeps/i386/fpu/s_ceil.S (__ceil): Save and restore floating-point environment rather than just control word. * sysdeps/i386/fpu/s_ceilf.S (__ceilf): Likewise. * sysdeps/i386/fpu/s_ceill.S (__ceill): Save and restore floating-point environment, with "invalid" exceptions merged in, rather than just control word. * sysdeps/x86_64/fpu/s_ceill.S (__ceill): Likewise. * math/libm-test.inc (ceil_test_data): Do not allow spurious "inexact" exceptions.	2016-06-27 17:24:30 +00:00
Joseph Myers	40244be372	Fix i386/x86_64 scalbl with sNaN input (bug 20296). The x86_64 and i386 versions of scalbl return sNaN for some cases of sNaN input and are missing "invalid" exceptions for other cases. This results from overly complicated code that either returns a NaN input, or discards both inputs when one is NaN and loads a NaN from memory. This patch fixes this by simplifying the code to add the arguments when either one is NaN. Tested for x86_64 and x86. [BZ #20296] * sysdeps/i386/fpu/e_scalbl.S (__ieee754_scalbl): Add arguments when either argument is a NaN. * sysdeps/x86_64/fpu/e_scalbl.S (__ieee754_scalbl): Likewise. * math/libm-test.inc (scalb_test_data): Add sNaN tests.	2016-06-23 22:17:41 +00:00
Joseph Myers	4e9bf327ad	Simplify x86 nearbyint functions. The i386 implementations of nearbyint functions, and x86_64 nearbyintl, contain code to mask the "inexact" exception. However, the fnstenv instruction has the effect of masking all exceptions, so this masking code has been redundant since fnstenv was added to those implementations (by commit 846d9a4a3acdb4939ca7bf6aed48f9f6f26911be; commit `71d1b0166b` added the test math/test-nearbyint-except-2.c that verifies these functions do work when called with "inexact" traps enabled); this patch removes the redundant code. Tested for x86_64 and x86. * sysdeps/i386/fpu/s_nearbyint.S (__nearbyint): Do not mask "inexact" exceptions after fnstenv. * sysdeps/i386/fpu/s_nearbyintf.S (__nearbyintf): Likewise. * sysdeps/i386/fpu/s_nearbyintl.S (__nearbyintl): Likewise. * sysdeps/x86_64/fpu/s_nearbyintl.S (__nearbyintl): Likewise.	2016-06-22 15:40:30 +00:00
Andrew Senkevich	df2258c6cb	Added tests to ensure linkage through libmvec _finite aliases which are defined in libmvec_nonshared.a (bug 19654). [BZ #19654] sysdeps/x86_64/fpu/Makefile: Added new tests. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx-main.c: New. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx-mod.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2-main.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2-mod.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx2.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512-main.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512-mod.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-avx512.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-main.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias-mod.c: Likewise. * sysdeps/x86_64/fpu/test-double-libmvec-alias.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx-mod.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2-mod.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx2.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512-mod.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-avx512.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-main.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias-mod.c: Likewise. * sysdeps/x86_64/fpu/test-float-libmvec-alias.c: Likewise. * sysdeps/x86_64/fpu/test-libmvec-alias-mod.c: Likewise.	2016-06-20 21:15:50 +03:00
Florian Weimer	aca1daef29	elf: Consolidate machine-agnostic DTV definitions in <dl-dtv.h> Identical definitions of dtv_t and TLS_DTV_UNALLOCATED were repeated for all architectures using DTVs.	2016-06-20 14:31:40 +02:00
Joseph Myers	f4015c8a86	Use generic fdim on more architectures (bug 6796, bug 20255, bug 20256). Some architectures have their own versions of fdim functions, which are missing errno setting (bug 6796) and may also return sNaN instead of qNaN for sNaN input, in the case of the x86 / x86_64 long double versions (bug 20256). These versions are not actually doing anything that a compiler couldn't generate, just straightforward comparisons / arithmetic (and, in the x86 / x86_64 case, testing for NaNs with fxam, which isn't actually needed once you use an unordered comparison and let the NaNs pass through the same subtraction as non-NaN inputs). This patch removes the x86 / x86_64 / powerpc versions, so that those architectures use the generic C versions, which correctly handle setting errno and deal properly with sNaN inputs. This seems better than dealing with setting errno in lots of .S versions. The i386 versions also return results with excess range and precision, which is not appropriate for a function exactly defined by reference to IEEE operations. For errno setting to work correctly on overflow, it's necessary to remove excess range with math_narrow_eval, which this patch duly does in the float and double versions so that the tests can reliably pass on x86. For float, this avoids any double rounding issues as the long double precision is more than twice that of float. For double, double rounding issues will need to be addressed separately, so this patch does not fully fix bug 20255. Tested for x86_64, x86 and powerpc. [BZ #6796] [BZ #20255] [BZ #20256] * math/s_fdim.c: Include <math_private.h>. (__fdim): Use math_narrow_eval on result. * math/s_fdimf.c: Include <math_private.h>. (__fdimf): Use math_narrow_eval on result. * sysdeps/i386/fpu/s_fdim.S: Remove file. * sysdeps/i386/fpu/s_fdimf.S: Likewise. * sysdeps/i386/fpu/s_fdiml.S: Likewise. * sysdeps/i386/i686/fpu/s_fdim.S: Likewise. * sysdeps/i386/i686/fpu/s_fdimf.S: Likewise. * sysdeps/i386/i686/fpu/s_fdiml.S: Likewise. * sysdeps/powerpc/fpu/s_fdim.c: Likewise. * sysdeps/powerpc/fpu/s_fdimf.c: Likewise. * sysdeps/powerpc/powerpc32/fpu/s_fdim.c: Likewise. * sysdeps/powerpc/powerpc64/fpu/s_fdim.c: Likewise. * sysdeps/x86_64/fpu/s_fdiml.S: Likewise. * math/libm-test.inc (fdim_test_data): Expect errno setting on overflow. Add sNaN tests.	2016-06-14 16:04:19 +00:00
Joseph Myers	f00faa4a43	Fix i386/x86_64 log2l (sNaN) (bug 20235). The i386/x86_64 versions of log2l return sNaN for sNaN input. This patch fixes them to add NaN inputs to themselves so that qNaN is returned in this case. Tested for x86_64 and x86. [BZ #20235] * sysdeps/i386/fpu/e_log2l.S (__ieee754_log2l): Add NaN input to itself. * sysdeps/x86_64/fpu/e_log2l.S (__ieee754_log2l): Likewise. * math/libm-test.inc (log2_test_data): Add sNaN tests.	2016-06-09 18:04:30 +00:00
H.J. Lu	ac187dc4ab	Always indirect branch to __libc_start_main via GOT Since __libc_start_main in libc.so is called very early, lazy binding isn't relevant. Always call __libc_start_main with indirect branch via GOT to avoid extra branch to PLT slot. In case of static executable, ld in binutils 2.26 or above can convert indirect branch into direct branch: 0000000000400a80 <_start>: 400a80: 31 ed xor %ebp,%ebp 400a82: 49 89 d1 mov %rdx,%r9 400a85: 5e pop %rsi 400a86: 48 89 e2 mov %rsp,%rdx 400a89: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp 400a8d: 50 push %rax 400a8e: 54 push %rsp 400a8f: 49 c7 c0 20 1b 40 00 mov $0x401b20,%r8 400a96: 48 c7 c1 90 1a 40 00 mov $0x401a90,%rcx 400a9d: 48 c7 c7 c0 03 40 00 mov $0x4003c0,%rdi 400aa4: 67 e8 96 09 00 00 addr32 callq 401440 <__libc_start_main> 400aaa: f4 hlt * sysdeps/x86_64/start.S (_start): Always indirect branch to __libc_start_main via GOT.	2016-06-09 04:43:31 -07:00
H.J. Lu	75437079e4	X86-64: Add dummy memcopy.h and wordcopy.c Since x86-64 no longer uses memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise.	2016-06-09 04:38:34 -07:00
Joseph Myers	8c010e2f71	Fix i386/x86_64 log1pl (sNaN) (bug 20229). The i386/x86_64 versions of log1pl return sNaN for sNaN input. This patch fixes them to add a NaN input to itself so that qNaN is returned in this case. Tested for x86_64 and x86. [BZ #20229] * sysdeps/i386/fpu/s_log1pl.S (__log1pl): Add NaN input to itself. * sysdeps/x86_64/fpu/s_log1pl.S (__log1pl): Likewise. * math/libm-test.inc (log1p_test_data): Add sNaN tests.	2016-06-08 23:11:42 +00:00
Joseph Myers	09096b3615	Fix i386/x86_64 log10l (sNaN) (bug 20228). The i386/x86_64 versions of log10l return sNaN for sNaN input. This patch fixes them to add a NaN input to itself so that qNaN is returned in this case. Tested for x86_64 and x86. [BZ #20228] * sysdeps/i386/fpu/e_log10l.S (__ieee754_log10l): Add NaN input to itself. * sysdeps/x86_64/fpu/e_log10l.S (__ieee754_log10l): Likewise. * math/libm-test.inc (log10_test_data): Add sNaN tests.	2016-06-08 22:59:18 +00:00
Joseph Myers	df179d8808	Fix i386/x86_64 logl (sNaN) (bug 20227). The i386/x86_64 versions of logl return sNaN for sNaN input. This patch fixes them to add a NaN input to itself so that qNaN is returned in this case. Tested for x86_64 and x86 (including a build for i586 to cover the non-i686 logl version). [BZ #20227] * sysdeps/i386/fpu/e_logl.S (__ieee754_logl): Add NaN input to itself. * sysdeps/i386/i686/fpu/e_logl.S (__ieee754_logl): Likewise. * sysdeps/x86_64/fpu/e_logl.S (__ieee754_logl): Likewise. * math/libm-test.inc (log_test_data): Add sNaN tests.	2016-06-08 22:24:06 +00:00
Joseph Myers	9bd3ef8e19	Fix i386/x86_64 expl, exp10l, expm1l for sNaN input (bug 20226). The i386 and x86_64 implementations of expl, exp10l and expm1l (code shared between the functions) return sNaN for sNaN input. This patch fixes them to add NaN inputs to themselves so that qNaN is returned in this case. Tested for x86_64 and x86. [BZ #20226] * sysdeps/i386/fpu/e_expl.S (IEEE754_EXPL): Add NaN argument to itself. * sysdeps/x86_64/fpu/e_expl.S (IEEE754_EXPL): Likewise. * math/libm-test.inc (exp_test_data): Add sNaN tests. (exp10_test_data): Likewise. (expm1_test_data): Likewise.	2016-06-08 21:55:06 +00:00
H.J. Lu	c867597bff	X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned.	2016-06-08 13:58:08 -07:00
H.J. Lu	5e8c5bb1ac	X86-64: Remove the previous SSE2/AVX2 memsets Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned.	2016-06-08 13:56:14 -07:00
H.J. Lu	3f61232ab3	Fix a typo in comments in memmove-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Fix a typo in comments.	2016-06-06 16:03:21 -07:00
Joseph Myers	5ff81530dd	Do not raise "inexact" from x86_64 SSE4.1 ceil, floor (bug 15479). Continuing fixes for ceil and floor functions not to raise the "inexact" exception, this patch fixes the x86_64 SSE4.1 versions. The roundss / roundsd instructions take an immediate operand that determines the rounding mode and whether to raise "inexact"; this just needs bit 3 set to disable "inexact", which this patch does. Remark: we don't have an SSE4.1 version of trunc / truncf (using this instruction with operand 11); I'd expect one to make sense, but of course it should be benchmarked against the existing C code. I'll file a bug in Bugzilla for the lack of such a version. Tested for x86_64. [BZ #15479] * sysdeps/x86_64/fpu/multiarch/s_ceil.S (__ceil_sse41): Set bit 3 of immediate operand to rounding instruction. * sysdeps/x86_64/fpu/multiarch/s_ceilf.S (__ceilf_sse41): Likewise. * sysdeps/x86_64/fpu/multiarch/s_floor.S (__floor_sse41): Likewise. * sysdeps/x86_64/fpu/multiarch/s_floorf.S (__floorf_sse41): Likewise.	2016-05-24 21:11:18 +00:00
H.J. Lu	6901def689	Avoid an extra branch to PLT for -z now When --enable-bind-now is used to configure glibc build, we can avoid an extra branch to the PLT entry by using indirect branch via the GOT slot instead, which is similar to the first instructuon in the PLT entry. Changes in the shared library sizes in text sections: Shared library Before (bytes) After (bytes) libm.so 1060813 1060797 libmvec.so 160881 160805 libpthread.so 94992 94984 librt.so 25064 25048 * config.h.in (BIND_NOW): New. * configure.ac (BIND_NOW): New. Defined for --enable-bind-now. * configure: Regenerated. * sysdeps/x86_64/sysdep.h (JUMPTARGET)[BIND_NOW]: Defined to indirect branch via the GOT slot.	2016-05-24 08:44:23 -07:00
H.J. Lu	eb2c88c7c8	Remove alignments on jump targets in memset X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which increases code sizes, but not necessarily improve performance. As memset benchtest data of align vs no align on various Intel and AMD processors https://sourceware.org/bugzilla/attachment.cgi?id=9277 shows that aligning jump targets isn't necessary. [BZ #20115] * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset): Remove alignments on jump targets.	2016-05-19 08:49:55 -07:00
H.J. Lu	4facca0b0e	Call init_cpu_features only if SHARED is defined In static executable, since init_cpu_features is called early from __libc_start_main, there is no need to call it again in dl_platform_init. [BZ #20072] * sysdeps/i386/dl-machine.h (dl_platform_init): Call init_cpu_features only if SHARED is defined. * sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.	2016-05-13 08:29:33 -07:00
H.J. Lu	2a1f15b1a9	Remove x86 ifunc-defines.sym and rtld-global-offsets.sym Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on i686 and x86-64. * sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers): Remove ifunc-defines.sym. * sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers): Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed. * sysdeps/x86/rtld-global-offsets.sym: Likewise. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise. * sysdeps/x86/Makefile (gen-as-const-headers): Remove rtld-global-offsets.sym. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ... * sysdeps/x86/cpu-features-offsets.sym: This. * sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h> instead of <ifunc-defines.h> and <rtld-global-offsets.h>.	2016-05-11 05:51:39 -07:00
H.J. Lu	a9558b49b3	Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86 Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86 and x86_64. * sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c> instead of <sysdeps/x86_64/cacheinfo.c>. * sysdeps/x86_64/cacheinfo.c: Moved to ... * sysdeps/x86/cacheinfo.c: Here.	2016-05-08 08:49:18 -07:00
Andreas Schwab	b4bcb3aec6	Register extra test objects This makes sure that the extra test objects are compiled with the correct MODULE_NAME and dependencies are tracked.	2016-04-13 17:07:13 +02:00
H.J. Lu	a057f5f8cd	X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source.	2016-04-12 08:10:47 -07:00
Mike Frysinger	b2d4456b33	configure: fix `test ==` usage POSIX defines the = operator, but not ==. Fix the few places where we incorrectly used ==.	2016-04-09 20:05:13 -04:00
H.J. Lu	a7d1c51482	X86-64: Prepare memmove-vec-unaligned-erms.S Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so.	2016-04-06 10:19:16 -07:00
H.J. Lu	4af1bb06c5	X86-64: Prepare memset-vec-unaligned-erms.S Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols.	2016-04-06 09:10:35 -07:00
H.J. Lu	ec0cac9a1f	Force 32-bit displacement in memset-vec-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions.	2016-04-05 05:21:19 -07:00
H.J. Lu	696ac77484	Add a comment in memset-sse2-unaligned-erms.S * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA.	2016-04-05 05:19:18 -07:00
H.J. Lu	5cd7af016d	Don't put SSE2/AVX/AVX512 memmove/memset in ld.so Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise.	2016-04-03 14:35:38 -07:00
H.J. Lu	ea2785e96f	Fix memmove-vec-unaligned-erms.S __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first.	2016-04-03 12:38:25 -07:00
H.J. Lu	830566307f	Add x86-64 memset with unaligned store and rep stosb Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise.	2016-03-31 10:06:07 -07:00

1 2 3 4 5 ...

1037 Commits