There is transition penalty when SSE instructions are mixed with 256-bit
AVX or 512-bit AVX512 load instructions. Since _dl_runtime_resolve_avx
and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
registers, there is transition penalty when SSE instructions are used
with lazy binding on AVX and AVX512 processors.
To avoid SSE transition penalty, if only the lower 128 bits of the first
8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
with the zero upper bits.
For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
or the upper 256 bits of ZMM registers are zero. We can restore only the
non-zero portion of vector registers with AVX/AVX512 load instructions
which will zero-extend upper bits of vector registers.
This patch adds _dl_runtime_resolve_sse_vex which saves and restores
XMM registers with 128-bit AVX store/load instructions. It is used to
preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
_dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
that we store and load only the non-zero portion of vector registers.
This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
_dl_runtime_profile_avx512 when only the lower 128 bits of vector
registers are used.
_dl_runtime_resolve_avx_slow is added and used for AVX processors which
don't support XGETBV with ECX == 1. Since there is no SSE transition
penalty on AVX512 processors which don't support XGETBV with ECX == 1,
_dl_runtime_resolve_avx512_slow isn't provided.
[BZ #20495]
[BZ #20508]
* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
processors, set Use_dl_runtime_resolve_slow and set
Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
New.
(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
(index_arch_Use_dl_runtime_resolve_opt): Likewise.
(index_arch_Use_dl_runtime_resolve_slow): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
if Use_dl_runtime_resolve_opt is set. Use
_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
(_dl_runtime_resolve_opt): New. Defined for AVX and AVX512.
(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
New.
(_dl_runtime_resolve_opt): Likewise.
(_dl_runtime_profile): Define only if _dl_runtime_profile is
defined.
This is only used for the float and double variants.
Instead, just add it to the type specific list of files,
and remove all stubs, and remove the declaration from
math_private.h.
I verified x86_64, i486, ia64, m68k, and ppc64 build.
When stack is re-aligned in _dl_runtime_resolve, there is no need to
adjust CFA when allocating register save area on stack.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve): Don't
adjust CFA when allocating register save area on re-aligned
stack.
TS 18661-1 defines an fesetexcept function for setting floating-point
exception flags without the side-effect of causing enabled traps to be
taken.
This patch series implements this function for glibc. The present
patch adds the fallback stub implementation, x86 and x86_64
implementations, documentation, tests and ABI baseline updates. The
remaining patches, some of them untested, add implementations for
other architectures. The implementations generally follow those of
the fesetexceptflag function.
As for fesetexceptflag, the approach taken for architectures where
setting flags causes enabled traps to be taken is to set the flags
(and potentially cause traps) rather than refusing to set the flags
and returning an error. Since ISO C and TS 18661 provide no way to
enable traps, this is formally in accordance with the standards.
The NEWS entry should be considered a placeholder, since this patch
series is intended to be followed by further such series adding other
TS 18661-1 features, so that the NEWS entry would end up looking more
like
* New <fenv.h> features from TS 18661-1:2014 are added to libm: the
fesetexcept, fetestexceptflag, fegetmode and fesetmode functions,
the femode_t type and the FE_DFL_MODE macro.
with hopefully more such entries for other features, rather than
having an entry for a single function in the end.
I believe we have consensus for adding TS 18661-1 interfaces as per
<https://sourceware.org/ml/libc-alpha/2016-06/msg00421.html>.
Tested for x86_64, x86, mips64 (hard float, and soft float to test the
fallback version), arm (hard float) and powerpc (hard float, soft
float and e500).
* math/fesetexcept.c: New file.
* sysdeps/i386/fpu/fesetexcept.c: Likewise.
* sysdeps/x86_64/fpu/fesetexcept.c: Likewise.
* math/fenv.h: Define
__GLIBC_INTERNAL_STARTING_HEADER_IMPLEMENTATION and include
<bits/libc-header-start.h> instead of including <features.h>.
[__GLIBC_USE (IEC_60559_BFP_EXT)] (fesetexcept): New function
declaration.
* manual/arith.texi (fesetexcept): Document function.
* math/Versions (fesetexcept): New libm symbol at version
GLIBC_2.25.
* math/Makefile (libm-support): Add fesetexcept.
(tests): Add test-fesetexcept and test-fesetexcept-traps.
* math/test-fesetexcept.c: New file.
* math/test-fesetexcept-traps.c: Likewise.
* sysdeps/nacl/libm.abilist: Update.
* sysdeps/unix/sysv/linux/aarch64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/alpha/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/arm/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/hppa/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/i386/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/ia64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/coldfire/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/m68k/m680x0/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/microblaze/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/mips/mips64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/nios2/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm-le.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/s390/s390-32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/s390/s390-64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sh/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc32/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx32/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilegx/tilegx64/libm.abilist:
Likewise.
* sysdeps/unix/sysv/linux/tile/tilepro/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/libm.abilist: Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/libm.abilist: Likewise.
Don't compile do_test with -mavx, -mavx nor -mavx512 since they won't run
on non-AVX machines.
[BZ #20384]
* sysdeps/x86_64/fpu/Makefile (extra-test-objs): Add
test-double-libmvec-sincos-avx-main.o,
test-double-libmvec-sincos-avx2-main.o,
test-double-libmvec-sincos-main.o,
test-float-libmvec-sincosf-avx-main.o,
test-float-libmvec-sincosf-avx2-main.o and
test-float-libmvec-sincosf-main.o.
test-float-libmvec-sincosf-avx512-main.o.
($(objpfx)test-double-libmvec-sincos): Also link with
$(objpfx)test-double-libmvec-sincos-main.o.
($(objpfx)test-double-libmvec-sincos-avx): Also link with
$(objpfx)test-double-libmvec-sincos-avx-main.o.
($(objpfx)test-double-libmvec-sincos-avx2): Also link with
$(objpfx)test-double-libmvec-sincos-avx2-main.o.
($(objpfx)test-float-libmvec-sincosf): Also link with
$(objpfx)test-float-libmvec-sincosf-main.o.
($(objpfx)test-float-libmvec-sincosf-avx): Also link with
$(objpfx)test-float-libmvec-sincosf-avx2-main.o.
[$(config-cflags-avx512) == yes] (extra-test-objs): Add
test-double-libmvec-sincos-avx512-main.o and
($(objpfx)test-double-libmvec-sincos-avx512): Also link with
$(objpfx)test-double-libmvec-sincos-avx512-main.o.
($(objpfx)test-float-libmvec-sincosf-avx512): Also link with
$(objpfx)test-float-libmvec-sincosf-avx512-main.o.
(CFLAGS-test-double-libmvec-sincos.c): Removed.
(CFLAGS-test-float-libmvec-sincosf.c): Likewise.
(CFLAGS-test-double-libmvec-sincos-main.c): New.
(CFLAGS-test-double-libmvec-sincos-avx-main.c): Likewise.
(CFLAGS-test-double-libmvec-sincos-avx2-main.c): Likewise.
(CFLAGS-test-float-libmvec-sincosf-main.c): Likewise.
(CFLAGS-test-float-libmvec-sincosf-avx-main.c): Likewise.
(CFLAGS-test-float-libmvec-sincosf-avx2-main.c): Likewise.
(CFLAGS-test-float-libmvec-sincosf-avx512-main.c): Likewise.
(CFLAGS-test-double-libmvec-sincos-avx.c): Set to -DREQUIRE_AVX.
(CFLAGS-test-float-libmvec-sincosf-avx.c ): Likewise.
(CFLAGS-test-double-libmvec-sincos-avx2.c): Set to
-DREQUIRE_AVX2.
(CFLAGS-test-float-libmvec-sincosf-avx2.c ): Likewise.
(CFLAGS-test-double-libmvec-sincos-avx512.c): Set to
-DREQUIRE_AVX512F.
(CFLAGS-test-float-libmvec-sincosf-avx512.c): Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Rewritten.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx-main.c: New
file.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512-main.c:
Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-main.c:
Likewise.
Since _dl_tlsdesc_dynamic is called via PLT, we need to add 8 bytes for
push in the PLT entry to align the stack.
[BZ #20309]
* configure.ac (have-mtls-dialect-gnu2): Set to yes if
-mtls-dialect=gnu2 works.
* configure: Regenerated.
* elf/Makefile [have-mtls-dialect-gnu2 = yes]
(tests): Add tst-gnu2-tls1.
(modules-names): Add tst-gnu2-tls1mod.
($(objpfx)tst-gnu2-tls1): New.
(tst-gnu2-tls1mod.so-no-z-defs): Likewise.
(CFLAGS-tst-gnu2-tls1mod.c): Likewise.
* elf/tst-gnu2-tls1.c: New file.
* elf/tst-gnu2-tls1mod.c: Likewise.
* sysdeps/x86_64/dl-tlsdesc.S (_dl_tlsdesc_dynamic): Add 8
bytes for push in the PLT entry to align the stack.
If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used
to save the first 8 vector registers, which only saves the lower 256
bits of vector register, for lazy binding. When it is called on AVX512
platform, the upper 256 bits of ZMM registers are clobbered. Parameters
passed in ZMM registers will be wrong when the function is called the
first time. This patch requires binutils 2.24, whose assembler can store
and load ZMM registers, to build x86-64 glibc. Since mathvec library
needs assembler support for AVX512DQ, we disable mathvec if assembler
doesn't support AVX512DQ.
[BZ #20139]
* config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ...
(HAVE_AVX512DQ_ASM_SUPPORT): This.
* sysdeps/x86_64/configure.ac: Require assembler from binutils
2.24 or above.
(HAVE_AVX512_ASM_SUPPORT): Removed.
(HAVE_AVX512DQ_ASM_SUPPORT): New.
* sysdeps/x86_64/configure: Regenerated.
* sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT
check unconditional.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset.S: Likewise.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check
HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT.
* sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S:
Likewise.
current vector function declaration "#pragma omp declare simd notinbranch",
according to which vector sincos should have vector of pointers for second and
third parameters. It is fixed with implementation as wrapper to version
having second and third parameters as pointers.
[BZ #20024]
* sysdeps/x86/fpu/test-math-vector-sincos.h: New.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos2_core_sse4.S: Fixed ABI
of this implementation of vector function.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos4_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx512.S:
Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf4_core_sse4.S: Likewise.
* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf8_core_avx2.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos2_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos4_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/svml_d_sincos8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf16_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf4_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core.S: Likewise.
* sysdeps/x86_64/fpu/svml_s_sincosf8_core_avx.S: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c: Use another wrapper
for testing vector sincos with fixed ABI.
* sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx.c: New test.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-double-libmvec-sincos.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx2.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf-avx512.c: Likewise.
* sysdeps/x86_64/fpu/test-float-libmvec-sincosf.c: Likewise.
* sysdeps/x86_64/fpu/Makefile: Added new tests.
Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove,
memcpy, mempcpy and memset aren't used by the current processors, this
patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so
that they can be used in the future.
* sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New.
(index_arch_Prefer_ERMS): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return
__memcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(__memmove_erms): Enabled for libc.a.
* ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return
__memmove_erms or Prefer_ERMS.
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return
__mempcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memset.S (memset): Return
__memset_erms for Prefer_ERMS.
As discussed in
<https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS
18661-1 disallows ceil, floor, round and trunc functions from raising
the "inexact" exception, in accordance with general IEEE 754 semantics
for when that exception is raised. Fixing this for x87 floating point
is more complicated than for the other versions of these functions,
because they use the frndint instruction that raises "inexact" and
this can only be avoided by saving and restoring the whole
floating-point environment.
As I noted in
<https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have
now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7,
such that GCC will inline these functions on x86, without caring about
"inexact", when the default -ffp-int-builtin-inexact is in effect.
This allows users to get optimized code depending on the options they
pass to the compiler, while making the out-of-line functions follow TS
18661-1 semantics and avoid "inexact".
This patch duly fixes the out-of-line trunc function implementations
to avoid "inexact", in the same way as the nearbyint implementations.
I do not know how the performance of implementations such as these
based on saving the environment and changing the rounding mode
temporarily compares to that of the C versions or SSE 4.1 versions (of
course, for 32-bit x86 SSE implementations still need to get the
return value in an x87 register); it's entirely possible other
implementations could be faster in some cases.
Tested for x86_64 and x86.
[BZ #15479]
* sysdeps/i386/fpu/s_trunc.S (__trunc): Save and restore
floating-point environment rather than just control word.
* sysdeps/i386/fpu/s_truncf.S (__truncf): Likewise.
* sysdeps/i386/fpu/s_truncl.S (__truncl): Save and restore
floating-point environment, with "invalid" exceptions merged in,
rather than just control word.
* sysdeps/x86_64/fpu/s_truncl.S (__truncl): Likewise.
* math/libm-test.inc (trunc_test_data): Do not allow spurious
"inexact" exceptions.
As discussed in
<https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS
18661-1 disallows ceil, floor, round and trunc functions from raising
the "inexact" exception, in accordance with general IEEE 754 semantics
for when that exception is raised. Fixing this for x87 floating point
is more complicated than for the other versions of these functions,
because they use the frndint instruction that raises "inexact" and
this can only be avoided by saving and restoring the whole
floating-point environment.
As I noted in
<https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have
now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7,
such that GCC will inline these functions on x86, without caring about
"inexact", when the default -ffp-int-builtin-inexact is in effect.
This allows users to get optimized code depending on the options they
pass to the compiler, while making the out-of-line functions follow TS
18661-1 semantics and avoid "inexact".
This patch duly fixes the out-of-line floor function implementations
to avoid "inexact", in the same way as the nearbyint implementations.
I do not know how the performance of implementations such as these
based on saving the environment and changing the rounding mode
temporarily compares to that of the C versions or SSE 4.1 versions (of
course, for 32-bit x86 SSE implementations still need to get the
return value in an x87 register); it's entirely possible other
implementations could be faster in some cases.
Tested for x86_64 and x86.
[BZ #15479]
* sysdeps/i386/fpu/s_floor.S (__floor): Save and restore
floating-point environment rather than just control word.
* sysdeps/i386/fpu/s_floorf.S (__floorf): Likewise.
* sysdeps/i386/fpu/s_floorl.S (__floorl): Save and restore
floating-point environment, with "invalid" exceptions merged in,
rather than just control word.
* sysdeps/x86_64/fpu/s_floorl.S (__floorl): Likewise.
* math/libm-test.inc (floor_test_data): Do not allow spurious
"inexact" exceptions.
As discussed in
<https://sourceware.org/ml/libc-alpha/2016-05/msg00577.html>, TS
18661-1 disallows ceil, floor, round and trunc functions from raising
the "inexact" exception, in accordance with general IEEE 754 semantics
for when that exception is raised. Fixing this for x87 floating point
is more complicated than for the other versions of these functions,
because they use the frndint instruction that raises "inexact" and
this can only be avoided by saving and restoring the whole
floating-point environment.
As I noted in
<https://sourceware.org/ml/libc-alpha/2016-06/msg00128.html>, I have
now implemented a GCC option -fno-fp-int-builtin-inexact for GCC 7,
such that GCC will inline these functions on x86, without caring about
"inexact", when the default -ffp-int-builtin-inexact is in effect.
This allows users to get optimized code depending on the options they
pass to the compiler, while making the out-of-line functions follow TS
18661-1 semantics and avoid "inexact".
This patch duly fixes the out-of-line ceil function implementations to
avoid "inexact", in the same way as the nearbyint implementations.
I do not know how the performance of implementations such as these
based on saving the environment and changing the rounding mode
temporarily compares to that of the C versions or SSE 4.1 versions (of
course, for 32-bit x86 SSE implementations still need to get the
return value in an x87 register); it's entirely possible other
implementations could be faster in some cases.
Tested for x86_64 and x86.
[BZ #15479]
* sysdeps/i386/fpu/s_ceil.S (__ceil): Save and restore
floating-point environment rather than just control word.
* sysdeps/i386/fpu/s_ceilf.S (__ceilf): Likewise.
* sysdeps/i386/fpu/s_ceill.S (__ceill): Save and restore
floating-point environment, with "invalid" exceptions merged in,
rather than just control word.
* sysdeps/x86_64/fpu/s_ceill.S (__ceill): Likewise.
* math/libm-test.inc (ceil_test_data): Do not allow spurious
"inexact" exceptions.
The x86_64 and i386 versions of scalbl return sNaN for some cases of
sNaN input and are missing "invalid" exceptions for other cases. This
results from overly complicated code that either returns a NaN input,
or discards both inputs when one is NaN and loads a NaN from memory.
This patch fixes this by simplifying the code to add the arguments
when either one is NaN.
Tested for x86_64 and x86.
[BZ #20296]
* sysdeps/i386/fpu/e_scalbl.S (__ieee754_scalbl): Add arguments
when either argument is a NaN.
* sysdeps/x86_64/fpu/e_scalbl.S (__ieee754_scalbl): Likewise.
* math/libm-test.inc (scalb_test_data): Add sNaN tests.
The i386 implementations of nearbyint functions, and x86_64
nearbyintl, contain code to mask the "inexact" exception. However,
the fnstenv instruction has the effect of masking all exceptions, so
this masking code has been redundant since fnstenv was added to those
implementations (by commit 846d9a4a3acdb4939ca7bf6aed48f9f6f26911be;
commit 71d1b0166b added the test
math/test-nearbyint-except-2.c that verifies these functions do work
when called with "inexact" traps enabled); this patch removes the
redundant code.
Tested for x86_64 and x86.
* sysdeps/i386/fpu/s_nearbyint.S (__nearbyint): Do not mask
"inexact" exceptions after fnstenv.
* sysdeps/i386/fpu/s_nearbyintf.S (__nearbyintf): Likewise.
* sysdeps/i386/fpu/s_nearbyintl.S (__nearbyintl): Likewise.
* sysdeps/x86_64/fpu/s_nearbyintl.S (__nearbyintl): Likewise.
Some architectures have their own versions of fdim functions, which
are missing errno setting (bug 6796) and may also return sNaN instead
of qNaN for sNaN input, in the case of the x86 / x86_64 long double
versions (bug 20256).
These versions are not actually doing anything that a compiler
couldn't generate, just straightforward comparisons / arithmetic (and,
in the x86 / x86_64 case, testing for NaNs with fxam, which isn't
actually needed once you use an unordered comparison and let the NaNs
pass through the same subtraction as non-NaN inputs). This patch
removes the x86 / x86_64 / powerpc versions, so that those
architectures use the generic C versions, which correctly handle
setting errno and deal properly with sNaN inputs. This seems better
than dealing with setting errno in lots of .S versions.
The i386 versions also return results with excess range and precision,
which is not appropriate for a function exactly defined by reference
to IEEE operations. For errno setting to work correctly on overflow,
it's necessary to remove excess range with math_narrow_eval, which
this patch duly does in the float and double versions so that the
tests can reliably pass on x86. For float, this avoids any double
rounding issues as the long double precision is more than twice that
of float. For double, double rounding issues will need to be
addressed separately, so this patch does not fully fix bug 20255.
Tested for x86_64, x86 and powerpc.
[BZ #6796]
[BZ #20255]
[BZ #20256]
* math/s_fdim.c: Include <math_private.h>.
(__fdim): Use math_narrow_eval on result.
* math/s_fdimf.c: Include <math_private.h>.
(__fdimf): Use math_narrow_eval on result.
* sysdeps/i386/fpu/s_fdim.S: Remove file.
* sysdeps/i386/fpu/s_fdimf.S: Likewise.
* sysdeps/i386/fpu/s_fdiml.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdim.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdimf.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdiml.S: Likewise.
* sysdeps/powerpc/fpu/s_fdim.c: Likewise.
* sysdeps/powerpc/fpu/s_fdimf.c: Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_fdim.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_fdim.c: Likewise.
* sysdeps/x86_64/fpu/s_fdiml.S: Likewise.
* math/libm-test.inc (fdim_test_data): Expect errno setting on
overflow. Add sNaN tests.
The i386/x86_64 versions of log2l return sNaN for sNaN input. This
patch fixes them to add NaN inputs to themselves so that qNaN is
returned in this case.
Tested for x86_64 and x86.
[BZ #20235]
* sysdeps/i386/fpu/e_log2l.S (__ieee754_log2l): Add NaN input to
itself.
* sysdeps/x86_64/fpu/e_log2l.S (__ieee754_log2l): Likewise.
* math/libm-test.inc (log2_test_data): Add sNaN tests.
Since x86-64 no longer uses memory copy functions, add dummy memcopy.h
and wordcopy.c to reduce code size. It reduces the size of libc.so by
about 1 KB.
* sysdeps/x86_64/memcopy.h: New file.
* sysdeps/x86_64/wordcopy.c: Likewise.
The i386/x86_64 versions of log1pl return sNaN for sNaN input. This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.
Tested for x86_64 and x86.
[BZ #20229]
* sysdeps/i386/fpu/s_log1pl.S (__log1pl): Add NaN input to itself.
* sysdeps/x86_64/fpu/s_log1pl.S (__log1pl): Likewise.
* math/libm-test.inc (log1p_test_data): Add sNaN tests.
The i386/x86_64 versions of log10l return sNaN for sNaN input. This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.
Tested for x86_64 and x86.
[BZ #20228]
* sysdeps/i386/fpu/e_log10l.S (__ieee754_log10l): Add NaN input to
itself.
* sysdeps/x86_64/fpu/e_log10l.S (__ieee754_log10l): Likewise.
* math/libm-test.inc (log10_test_data): Add sNaN tests.
The i386/x86_64 versions of logl return sNaN for sNaN input. This
patch fixes them to add a NaN input to itself so that qNaN is returned
in this case.
Tested for x86_64 and x86 (including a build for i586 to cover the
non-i686 logl version).
[BZ #20227]
* sysdeps/i386/fpu/e_logl.S (__ieee754_logl): Add NaN input to
itself.
* sysdeps/i386/i686/fpu/e_logl.S (__ieee754_logl): Likewise.
* sysdeps/x86_64/fpu/e_logl.S (__ieee754_logl): Likewise.
* math/libm-test.inc (log_test_data): Add sNaN tests.
The i386 and x86_64 implementations of expl, exp10l and expm1l (code
shared between the functions) return sNaN for sNaN input. This patch
fixes them to add NaN inputs to themselves so that qNaN is returned in
this case.
Tested for x86_64 and x86.
[BZ #20226]
* sysdeps/i386/fpu/e_expl.S (IEEE754_EXPL): Add NaN argument to
itself.
* sysdeps/x86_64/fpu/e_expl.S (IEEE754_EXPL): Likewise.
* math/libm-test.inc (exp_test_data): Add sNaN tests.
(exp10_test_data): Likewise.
(expm1_test_data): Likewise.
Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
the new ones.
No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
memcpy/memmove optimized with Enhanced REP MOVSB will be used for
processors with ERMS. The new AVX512 memcpy/memmove will be used for
processors with AVX512 which prefer vzeroupper.
Since the new SSE2 memcpy/memmove are faster than the previous default
memcpy/memmove used in libc.a and ld.so, we also remove the previous
default memcpy/memmove and make them the default memcpy/memmove, except
that non-temporal store isn't used in ld.so.
Together, it reduces the size of libc.so by about 6 KB and the size of
ld.so by about 2 KB.
[BZ #19776]
* sysdeps/x86_64/memcpy.S: Make it dummy.
* sysdeps/x86_64/mempcpy.S: Likewise.
* sysdeps/x86_64/memmove.S: New file.
* sysdeps/x86_64/memmove_chk.S: Likewise.
* sysdeps/x86_64/multiarch/memmove.S: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
* sysdeps/x86_64/memmove.c: Removed.
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memmove.c: Likewise.
* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
memcpy-sse2-unaligned, memmove-avx-unaligned,
memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Replace
__memmove_chk_avx512_unaligned_2 with
__memmove_chk_avx512_unaligned. Remove
__memmove_chk_avx_unaligned_2. Replace
__memmove_chk_sse2_unaligned_2 with
__memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and
__memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2
with __memmove_avx512_unaligned. Replace
__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2
with __memcpy_chk_avx512_unaligned. Remove
__memcpy_chk_avx_unaligned_2. Replace
__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2.
Replace __memcpy_avx512_unaligned_2 with
__memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2
and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2
with __mempcpy_chk_avx512_unaligned. Remove
__mempcpy_chk_avx_unaligned_2. Replace
__mempcpy_chk_sse2_unaligned_2 with
__mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2.
Replace __mempcpy_avx512_unaligned_2 with
__mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2.
Replace __mempcpy_sse2_unaligned_2 with
__mempcpy_sse2_unaligned. Remove __mempcpy_sse2.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
if processor has ERMS. Default to __memcpy_sse2_unaligned.
(ENTRY): Removed.
(END): Likewise.
(ENTRY_CHK): Likewise.
(libc_hidden_builtin_def): Likewise.
Don't include ../memcpy.S.
* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
__memcpy_chk_avx512_unaligned_erms and
__memcpy_chk_avx512_unaligned. Use
__memcpy_chk_avx_unaligned_erms and
__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
Default to __memcpy_chk_sse2_unaligned.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
Change function suffix from unaligned_2 to unaligned.
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
if processor has ERMS. Default to __mempcpy_sse2_unaligned.
(ENTRY): Removed.
(END): Likewise.
(ENTRY_CHK): Likewise.
(libc_hidden_builtin_def): Likewise.
Don't include ../mempcpy.S.
(mempcpy): New. Add a weak alias.
* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
__mempcpy_chk_avx512_unaligned_erms and
__mempcpy_chk_avx512_unaligned. Use
__mempcpy_chk_avx_unaligned_erms and
__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
Default to __mempcpy_chk_sse2_unaligned.
Since the new SSE2/AVX2 memsets are faster than the previous ones, we
can remove the previous SSE2/AVX2 memsets and replace them with the
new ones. This reduces the size of libc.so by about 900 bytes.
No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
optimized with Enhanced REP STOSB will be used for processors with
ERMS. The new AVX512 memset will be used for processors with AVX512
which prefer vzeroupper.
[BZ #19881]
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
into ...
* sysdeps/x86_64/memset.S: This.
(__bzero): Removed.
(__memset_tail): Likewise.
(__memset_chk): Likewise.
(memset): Likewise.
(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
defined.
(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
(__memset_zero_constant_len_parameter): Check SHARED instead of
PIC.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
memset-avx2 and memset-sse2-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(__bzero): Enabled.
* sysdeps/x86_64/multiarch/memset.S (memset): Replace
__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms
or __memset_avx2_unaligned_erms if processor has ERMS. Support
__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
(memset): Removed.
(__memset_chk): Likewise.
(MEMSET_SYMBOL): New.
(libc_hidden_builtin_def): Replace __memset_sse2 with
__memset_sse2_unaligned.
* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
__memset_chk_sse2 and __memset_chk_avx2 with
__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
Use __memset_chk_sse2_unaligned_erms or
__memset_chk_avx2_unaligned_erms if processor has ERMS. Support
__memset_chk_avx512_unaligned_erms and
__memset_chk_avx512_unaligned.
Continuing fixes for ceil and floor functions not to raise the
"inexact" exception, this patch fixes the x86_64 SSE4.1 versions. The
roundss / roundsd instructions take an immediate operand that
determines the rounding mode and whether to raise "inexact"; this just
needs bit 3 set to disable "inexact", which this patch does.
Remark: we don't have an SSE4.1 version of trunc / truncf (using this
instruction with operand 11); I'd expect one to make sense, but of
course it should be benchmarked against the existing C code. I'll
file a bug in Bugzilla for the lack of such a version.
Tested for x86_64.
[BZ #15479]
* sysdeps/x86_64/fpu/multiarch/s_ceil.S (__ceil_sse41): Set bit 3
of immediate operand to rounding instruction.
* sysdeps/x86_64/fpu/multiarch/s_ceilf.S (__ceilf_sse41):
Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floor.S (__floor_sse41):
Likewise.
* sysdeps/x86_64/fpu/multiarch/s_floorf.S (__floorf_sse41):
Likewise.
When --enable-bind-now is used to configure glibc build, we can avoid
an extra branch to the PLT entry by using indirect branch via the GOT
slot instead, which is similar to the first instructuon in the PLT
entry. Changes in the shared library sizes in text sections:
Shared library Before (bytes) After (bytes)
libm.so 1060813 1060797
libmvec.so 160881 160805
libpthread.so 94992 94984
librt.so 25064 25048
* config.h.in (BIND_NOW): New.
* configure.ac (BIND_NOW): New. Defined for --enable-bind-now.
* configure: Regenerated.
* sysdeps/x86_64/sysdep.h (JUMPTARGET)[BIND_NOW]: Defined to
indirect branch via the GOT slot.
X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
increases code sizes, but not necessarily improve performance. As
memset benchtest data of align vs no align on various Intel and AMD
processors
https://sourceware.org/bugzilla/attachment.cgi?id=9277
shows that aligning jump targets isn't necessary.
[BZ #20115]
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
Remove alignments on jump targets.
In static executable, since init_cpu_features is called early from
__libc_start_main, there is no need to call it again in dl_platform_init.
[BZ #20072]
* sysdeps/i386/dl-machine.h (dl_platform_init): Call
init_cpu_features only if SHARED is defined.
* sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on
i686 and x86-64.
* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
Remove ifunc-defines.sym.
* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
Likewise.
* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
* sysdeps/x86/rtld-global-offsets.sym: Likewise.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
* sysdeps/x86/Makefile (gen-as-const-headers): Remove
rtld-global-offsets.sym.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
* sysdeps/x86/cpu-features-offsets.sym: This.
* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86
and x86_64.
* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
instead of <sysdeps/x86_64/cacheinfo.c>.
* sysdeps/x86_64/cacheinfo.c: Moved to ...
* sysdeps/x86/cacheinfo.c: Here.
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine. non-temporal store in
memcpy on large data can improve performance significantly. This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size. When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.
For size below 8 vector register width, we load all data into registers
and store them together. Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop. For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.
[BZ #19928]
* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
New.
(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
times of shared cache size.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
(VMOVNT): New.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
(VMOVNT): Likewise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
(VMOVNT): Likewise.
(VMOVU): Changed to movups for smaller code sizes.
(VMOVA): Changed to movaps for smaller code sizes.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
comments.
(PREFETCH): New.
(PREFETCH_SIZE): Likewise.
(PREFETCHED_LOAD_SIZE): Likewise.
(PREFETCH_ONE_SET): Likewise.
Rewrite to use forward and backward loops, which move 4 vector
registers at a time, to support overlapping addresses and use
non temporal store if size is above the threshold and there is
no overlap between destination and source.
Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
default memcpy, mempcpy and memmove.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(MEMCPY_SYMBOL): New.
(MEMPCPY_SYMBOL): Likewise.
(MEMMOVE_CHK_SYMBOL): Likewise.
Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
__mempcpy symbols. Provide alias for __memcpy_chk in libc.a.
Provide alias for memcpy in libc.a and ld.so.
Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
default memset.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
(MEMSET_CHK_SYMBOL): New. Define if not defined.
(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
Disabled fro now.
Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
symbols. Properly check USE_MULTIARCH on __memset symbols.
Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
and AVX512 memmove and memset in ld.so.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
if not in libc.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
Likewise.
__mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
and __memmove it breaks __memmove_chk.
Don't check source == destination first since it is less common.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
with unaligned_erms.
(__memmove_erms): Skip if source == destination.
(__memmove_unaligned_erms): Don't check source == destination
first.
Implement x86-64 memset with unaligned store and rep movsb. Support
16-byte, 32-byte and 64-byte vector register sizes. A single file
provides 2 implementations of memset, one with rep stosb and the other
without rep stosb. They share the same codes when size is between 2
times of vector register size and REP_STOSB_THRESHOLD which defaults
to 2KB.
Key features:
1. Use overlapping store to avoid branch.
2. For size <= 4 times of vector register size, fully unroll the loop.
3. For size > 4 times of vector register size, store 4 times of vector
register size at a time.
[BZ #19881]
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
memset-avx512-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
__memset_sse2_unaligned_erms, __memset_erms,
__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
file.
* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
Likewise.
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
Likewise.
Implement x86-64 memmove with unaligned load/store and rep movsb.
Support 16-byte, 32-byte and 64-byte vector register sizes. When
size <= 8 times of vector register size, there is no check for
address overlap bewteen source and destination. Since overhead for
overlap check is small when size > 8 times of vector register size,
memcpy is an alias of memmove.
A single file provides 2 implementations of memmove, one with rep movsb
and the other without rep movsb. They share the same codes when size is
between 2 times of vector register size and REP_MOVSB_THRESHOLD which
is 2KB for 16-byte vector register size and scaled up by large vector
register size.
Key features:
1. Use overlapping load and store to avoid branch.
2. For size <= 8 times of vector register size, load all sources into
registers and store them together.
3. If there is no address overlap bewteen source and destination, copy
from both ends with 4 times of vector register size at a time.
4. If address of destination > address of source, backward copy 8 times
of vector register size at a time.
5. Otherwise, forward copy 8 times of vector register size at a time.
6. Use rep movsb only for forward copy. Avoid slow backward rep movsb
by fallbacking to backward copy 8 times of vector register size at a
time.
7. Skip when address of destination == address of source.
[BZ #19776]
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
memmove-avx512-unaligned-erms.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test
__memmove_chk_avx512_unaligned_2,
__memmove_chk_avx512_unaligned_erms,
__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
__memmove_chk_sse2_unaligned_2,
__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
__memmove_avx512_unaligned_erms, __memmove_erms,
__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
__memcpy_chk_avx512_unaligned_2,
__memcpy_chk_avx512_unaligned_erms,
__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
__mempcpy_chk_avx512_unaligned_erms,
__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
__mempcpy_erms.
* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
file.
* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
Likwise.
* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
Likwise.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
Likwise.
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load. A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.
[BZ #19583]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
processors. Set Fast_Copy_Backward for AMD Excavator
processors.
* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
New.
(index_arch_Fast_Unaligned_Copy): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
Bug 19848 reports cases where powl on x86 / x86_64 has error
accumulation, for small integer exponents, larger than permitted by
glibc's accuracy goals, at least in some rounding modes. This patch
further restricts the exponent range for which the
small-integer-exponent logic is used to limit the possible error
accumulation.
Tested for x86_64 and x86 and ulps updated accordingly.
[BZ #19848]
* sysdeps/i386/fpu/e_powl.S (p3): Rename to p2 and change value
from 8 to 4.
(__ieee754_powl): Compare integer exponent against 4 not 8.
* sysdeps/x86_64/fpu/e_powl.S (p3): Rename to p2 and change value
from 8 to 4.
(__ieee754_powl): Compare integer exponent against 4 not 8.
* math/auto-libm-test-in: Add more tests of pow.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/i686/fpu/multiarch/libm-test-ulps: Update.
* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.