This patch provides optimized versions of memccpy with the z13 vector
instructions.
ChangeLog:
* sysdeps/s390/multiarch/memccpy-c.c: New File.
* sysdeps/s390/multiarch/memccpy-vx.S: Likewise.
* sysdeps/s390/multiarch/memccpy.c: Likewise.
* sysdeps/s390/multiarch/Makefile
(sysdep_routines): Add memccpy functions.
* sysdeps/s390/multiarch/ifunc-impl-list-common.c
(__libc_ifunc_impl_list_common): Add ifunc test for memccpy.
* string/memccpy.c: Use MEMCCPY if defined.
This patch provides optimized versions of strcmp and wcscmp with the z13
vector instructions.
The architecture specific string.h had a typo, which leads to ommiting the
inline version in this file if __USE_STRING_INLINES is defined.
Tested this inline version by tweaking test-strcmp.c.
ChangeLog:
* sysdeps/s390/multiarch/strcmp-vx.S: New File.
* sysdeps/s390/multiarch/strcmp.c: Likewise.
* sysdeps/s390/multiarch/wcscmp-c.c: Likewise.
* sysdeps/s390/multiarch/wcscmp-vx.S: Likewise.
* sysdeps/s390/multiarch/wcscmp.c: Likewise.
* sysdeps/s390/s390-32/multiarch/strcmp.c: Likewise.
* sysdeps/s390/s390-64/multiarch/strcmp.c: Likewise.
* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add strcmp and
wcscmp functions.
* sysdeps/s390/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add ifunc test for strcmp, wcscmp.
* string/strcmp.c (STRCMP): Define and use macro.
* benchtests/bench-wcscmp.c: New File.
* benchtests/Makefile (wcsmbs-bench): Add wcscmp.
* sysdeps/s390/bits/string.h: Fix typo: _HAVE_STRING_ARCH_strcmp
instead of _HAVE_STRING_ARCH_memchr.
This patch provides optimized versions of strlen and wcslen with the z13 vector
instructions.
The helper macro IFUNC_VX_IMPL is introduced and is used to register all
__<func>_c() and __<func>_vx() functions within __libc_ifunc_impl_list()
to the ifunc test framework.
ChangeLog:
* sysdeps/s390/multiarch/Makefile: New File.
* sysdeps/s390/multiarch/strlen-c.c: Likewise.
* sysdeps/s390/multiarch/strlen-vx.S: Likewise.
* sysdeps/s390/multiarch/strlen.c: Likewise.
* sysdeps/s390/multiarch/wcslen-c.c: Likewise.
* sysdeps/s390/multiarch/wcslen-vx.S: Likewise.
* sysdeps/s390/multiarch/wcslen.c: Likewise.
* string/strlen.c (STRLEN): Define and use macro.
* sysdeps/s390/multiarch/ifunc-impl-list.c
(IFUNC_VX_IMPL): New macro function.
(__libc_ifunc_impl_list): Add ifunc test for strlen, wcslen.
* benchtests/Makefile (wcsmbs-bench): New variable.
(string-bench-all): Added wcsmbs-bench.
* benchtests/bench-wcslen.c: New File.
This patch introduces a s390 specific ifunc resolver macro for 32/64bit,
which chooses <func>_vx with vector instructions if HWCAP_S390_VX flag
in hwcaps is set or <func>_c if not.
ChangeLog:
* sysdeps/s390/multiarch/ifunc-resolve.h (s390_vx_libc_ifunc,
s390_vx_libc_ifunc2): New macro function.
The S390 specific test checks if the assembler has support for the new z13
vector instructions by compiling a vector instruction. The .machine and
.machinemode directives are needed to compile the vector instruction without
-march=z13 option on 31/64 bit.
On success the macro HAVE_S390_VX_ASM_SUPPORT is defined. This macro is used
to determine if the optimized functions can be build without compile errors.
If the used assembler lacks vector support, then a warning is dumped while
configuring and only the common code functions are build.
The z13 instruction support was introduced in
"[Committed] S/390: Add support for IBM z13."
(https://sourceware.org/ml/binutils/2015-01/msg00197.html)
ChangeLog:
* config.h.in (HAVE_S390_VX_ASM_SUPPORT): New macro undefine.
* sysdeps/s390/configure.ac: Add test for S390 vector instruction
assembler support.
* sysdeps/s390/configure: Regenerated.
The new IBM z13 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented to 8,
because it was not incremented by commit
"S/390: Sync AUXV capabilities and archs with kernel".
ChangeLog:
* sysdeps/s390/dl-procinfo.c (_dl_s390_cap_flags): Add z13.
* sysdeps/s390/dl-procinfo.h (_DL_PLATFORMS_COUNT): Increased.
The HWCAP_S390_VX flag in hwcap field of auxiliary vector indicates
if the vector facility is available and the kernel is aware of it.
This can be tested with LD_SHOW_AUXV=1 <prog>.
Currently it does not show te, because it was not incremented
by commit "S/390: Add hwcap value for transactional execution.".
Thus _DL_HWCAP_COUNT is incremented by two.
ChangeLog:
* sysdeps/s390/dl-procinfo.c (_dl_s390_platforms): Add vector flag.
* sysdeps/s390/dl-procinfo.h: Add vector capability.
* sysdeps/unix/sysv/linux/s390/bits/hwcap.h (HWCAP_S390_VX): Define.
On s390 all ifunc resolvers were implemented in multiarch/ifunc-resolve.c.
The resulting single object files has undefined references to all ifunc-functions.
This patch introduces one multiarch/<func>.c file for each of memcpy, memcmp
and memset with the function specific ifunc resolver. The different function
implementations are now implemented in multiarch/<func>-s390x.S
(moved from multiarch/<func>.S).
The new multiarch/ifunc-resolve.h file contains the ifunc-resolver macro
and other helper-macros. They are merged and are now used in common for
32/64bit. Therefore the __<func>_g5/__<func>_z900 functions were renamed to
__<func>_default.
This patch also enables testing the ifunc implementations by implementing
the function __libc_ifunc_impl_list. It uses the helper-macros of ifunc-resolve.h.
ChangeLog:
* sysdeps/s390/s390-32/multiarch/Makefile (sysdep_routines):
Remove ifunc-resolve, add memset-s390, memcpy-s390, memcmp-s390.
* sysdeps/s390/s390-32/multiarch/ifunc-resolve.c: Delete File.
* sysdeps/s390/s390-32/multiarch/memcmp.S: Move to ...
* sysdeps/s390/s390-32/multiarch/memcmp-s390.S: ... here.
(memcmp, bcmp): Use __memcmp_default as alias source.
* sysdeps/s390/s390-32/multiarch/memcmp.c: New File.
* sysdeps/s390/s390-32/memcmp.S (__memcmp_g5):
Rename to __memcmp_default.
* sysdeps/s390/s390-32/multiarch/memcpy.S: Move to ...
* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: ... here.
(memcpy): Use __memcpy_default as alias source.
* sysdeps/s390/s390-32/multiarch/memcpy.c: New File.
* sysdeps/s390/s390-32/memcpy.S (__memcpy_g5):
Rename to __memcpy_default.
* sysdeps/s390/s390-32/multiarch/memset.S: Move to ...
* sysdeps/s390/s390-32/multiarch/memset-s390.S: ... here.
(memset): Use __memset_default as alias source.
* sysdeps/s390/s390-32/multiarch/memset.c: New File.
* sysdeps/s390/s390-32/memset.S (__memset_g5):
Rename to __memset_default.
* sysdeps/s390/s390-64/multiarch/Makefile (sysdep_routines):
Remove ifunc-resolve, add memset-s390x, memcpy-s390x, memcmp-s390x.
* sysdeps/s390/s390-64/multiarch/ifunc-resolve.c: Delete File.
* sysdeps/s390/s390-64/multiarch/memcmp.S: Move to ...
* sysdeps/s390/s390-64/multiarch/memcmp-s390x.S: ... here.
(memcmp, bcmp): Use __memcmp_default as alias source.
* sysdeps/s390/s390-64/multiarch/memcmp.c: New File.
* sysdeps/s390/s390-64/memcmp.S (__memcmp_z900):
Rename to __memcmp_default.
* sysdeps/s390/s390-64/multiarch/memcpy.S: Move to ...
* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: ... here.
(memcpy): Use __memcpy_default as alias source.
* sysdeps/s390/s390-64/multiarch/memcpy.c: New File.
* sysdeps/s390/s390-64/memcpy.S (__memcpy_z900):
Rename to __memcpy_default.
* sysdeps/s390/s390-64/multiarch/memset.S: Move to ...
* sysdeps/s390/s390-64/multiarch/memset-s390x.S: ... here.
(memset): Use __memset_default as alias source.
* sysdeps/s390/s390-64/multiarch/memset.c: New File.
* sysdeps/s390/s390-64/memset.S (__memset_z900):
Rename to __memset_default.
* sysdeps/s390/multiarch/ifunc-resolve.h: New File.
* sysdeps/s390/multiarch/ifunc-impl-list.c: New File.
On s390, the DXC(data-exception-code)-byte in FPC(floating-point-control)-
register contains a code of the last occured exception.
If bits 6 and 7 of DXC-byte are zero, the bits 0-5 correspond to the
ieee-exception flag bits.
The current implementation always uses these bits as ieee-exception flag bits.
fetestexcept() reports any exception after the first usage of a
vector-instruction in a process, because it raises an "vector instruction
exception" with DXC-code 0xFE.
This patch fixes the handling of the DXC-byte. The DXC-Byte is only handled
if bits 6 and 7 are zero.
The #define _FPU_RESERVED is extended by the DXC-Byte.
Otherwise the tests math/test-fpucw-static and math/test-fpucw-ieee-static
fails, because DXC-Byte contains the vector instruction exception when reaching
main(). This exception was triggered by strrchr() call in __init_misc().
__init_misc() is called after __setfpucw () in __libc_init_first().
The field __ieee_instruction_pointer in struct fenv_t is renamed to __unused
because it is a relict from commit "Remove PTRACE_PEEKUSER"
(87b9b50f0d) and isn´t used anymore.
ChangeLog:
[BZ #18610]
* sysdeps/s390/fpu/bits/fenv.h (fenv_t): Rename
__ieee_instruction_pointer to __unused.
* sysdeps/s390/fpu/fesetenv.c (__fesetenv): Remove usage of
__ieee_instruction_pointer.
* sysdeps/s390/fpu/fclrexcpt.c (feclearexcept): Fix dxc-field handling.
* sysdeps/s390/fpu/fgetexcptflg.c (fegetexceptflag): Likewise.
* sysdeps/s390/fpu/fsetexcptflg.c (fesetexceptflag): Likewise.
* sysdeps/s390/fpu/ftestexcept.c (fetestexcept): Likewise.
* sysdeps/s390/fpu/fpu_control.h (_FPU_RESERVED):
Mark dxc-field as reserved.
Since ld.so preserves vector registers now, we can use the same SSE2
optimized strcmp in x86-64 libc and ld.so.
* sysdeps/x86_64/strcmp.S: Remove "#if !IS_IN (libc)".
Since _dl_x86_64_save_sse and _dl_x86_64_restore_sse are removed now,
we don't need to run tst-getpid2 with LD_BIND_NOW=1.
[BZ #11214]
* sysdeps/unix/sysv/linux/Makefile (tst-getpid2-ENV): Removed.
Explicit system calls for the socket operations were added in Linux kernel
in commit 86250b9d12ca for powerpc. This patch make use of those instead of
calling socketcall to save number of cycles on networking syscalls.
2015-08-25 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* sysdeps/unix/sysv/linux/powerpc/kernel-features.h: Define new macros.
* sysdeps/unix/sysv/linux/accept.c: Call direct system call.
* sysdeps/unix/sysv/linux/bind.c: Call direct system call.
* sysdeps/unix/sysv/linux/connect.c: Call direct system call.
* sysdeps/unix/sysv/linux/getpeername.c: Call direct system call.
* sysdeps/unix/sysv/linux/getsockname.c: Call direct system call.
* sysdeps/unix/sysv/linux/getsockopt.c: Call direct system call.
* sysdeps/unix/sysv/linux/listen.c: Call direct system call.
* sysdeps/unix/sysv/linux/recv.c: Call direct system call.
* sysdeps/unix/sysv/linux/recvfrom.c: Call direct system call.
* sysdeps/unix/sysv/linux/recvmsg.c: Call direct system call.
* sysdeps/unix/sysv/linux/send.c: Call direct system call.
* sysdeps/unix/sysv/linux/sendmsg.c: Call direct system call.
* sysdeps/unix/sysv/linux/sendto.c: Call direct system call.
* sysdeps/unix/sysv/linux/setsockopt.c: Call direct system call.
* sysdeps/unix/sysv/linux/shutdown.c: Call direct system call.
* sysdeps/unix/sysv/linux/socket.c: Call direct system call.
* sysdeps/unix/sysv/linux/socketpair.c: Call direct system call.
Fix usage of tabort in generated syscalls. r0 has special meaning
when used with this instruction, thus it will not generate
persistent errors, nor return an error code. This mitigates poor
CPU usage when performing elided critical sections.
Additionally, transactions should be aborted when entering a user
invoked syscall. Otherwise the results of the transaction may be
undefined.
2015-08-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc32/sysdep.h (ABORT_TRANSACTION): Use
register other than r0 for tabort, it has special meaning.
* sysdeps/powerpc/powerpc64/sysdep.h (ABORT_TRANSACTION): Likewise
* sysdeps/unix.sysv/linux/powerpc/syscall.S (syscall): Abort
transaction before starting syscall.
Instead of checking needle length, constant 'n' number of comparisons
is checked to fall back to default implementation. This patch is tested
on powerpc64 and powerpc64le.
2015-08-25 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc64/power7/strstr.S: Handle worst case.
Since ld.so preserves vector registers now, we can use %xmm[0-4] to
avoid the REX prefix.
* sysdeps/x86_64/strlen.S: Replace %xmm[8-12] with %xmm[0-4].
Hi,
As in bugzilla entry there is overflow in hsearch when looking for prime
number as SIZE_MAX - 1 is divisible by 5. We fix that by rejecting large
inputs before looking for prime.
* misc/hsearch_r.c (__hcreate_r): Handle overflow.
Hi, as I wrote in previous patches a performance of checked strcpy and
stpcpy is terrible as these don't use sse2 and are around four times
slower that strcpy and stpcpy now.
As this bug shows that these functions are not performance sensitive I
decided just to improve generic implementation instead for easier
maintainance.
* debug/strcpy_chk.c: Improve performance.
* debug/stpcpy_chk.c: Likewise.
* sysdeps/x86_64/strcpy_chk.S: Remove.
* sysdeps/x86_64/stpcpy_chk.S: Remove.
This patch adds SSE, AVX and AVX512 versions of _dl_runtime_resolve
and _dl_runtime_profile, which save and restore the first 8 vector
registers used for parameter passing. elf_machine_runtime_setup
selects the proper _dl_runtime_resolve or _dl_runtime_profile based
on _dl_x86_cpu_features. It avoids race condition caused by
FOREIGN_CALL macros, which are only used for x86-64.
Performance impact of saving and restoring 8 vector registers are
negligible on Nehalem, Sandy Bridge, Ivy Bridge and Haswell when
ld.so is optimized with SSE2.
[BZ #15128]
* sysdeps/x86_64/Makefile [$(subdir) == elf] (tests): Add
ifuncmain8.
(modules-names): Add ifuncmod8.
($(objpfx)ifuncmain8): New rule.
* sysdeps/x86_64/dl-machine.h: Include <dl-procinfo.h> and
<cpuid.h>.
(elf_machine_runtime_setup): Use _dl_runtime_resolve_sse,
_dl_runtime_resolve_avx, or _dl_runtime_resolve_avx512,
_dl_runtime_profile_sse, _dl_runtime_profile_avx, or
_dl_runtime_profile_avx512, based on HAS_ARCH_FEATURE.
* sysdeps/x86_64/dl-trampoline.S: Rewrite.
* sysdeps/x86_64/dl-trampoline.h: Likewise.
* sysdeps/x86_64/ifuncmain8.c: New file.
* sysdeps/x86_64/ifuncmod8.c: Likewise.
* sysdeps/x86_64/nptl/tcb-offsets.sym (RTLD_SAVESPACE_SSE):
Removed.
* sysdeps/x86_64/nptl/tls.h (__128bits): Removed.
(tcbhead_t): Change rtld_must_xmm_save to __glibc_unused1.
Change rtld_savespace_sse to __glibc_unused2.
(RTLD_CHECK_FOREIGN_CALL): Removed.
(RTLD_ENABLE_FOREIGN_CALL): Likewise.
(RTLD_PREPARE_FOREIGN_CALL): Likewise.
(RTLD_FINALIZE_FOREIGN_CALL): Likewise.
If allocation on a non-main arena fails, the main arena is used
without checking to see if it is corrupt. Add a check that avoids the
main arena if it is corrupt.
* malloc/arena.c (arena_get_retry): Don't use main_arena if it is
corrupt.
The arena pointer in the first argument to arena_get2 was used in the
old days before per-thread arenas. They're unused now and hence can
be dropped.
ChangeLog:
* malloc/arena.c (arena_get2): Drop unused argument.
(arena_lock): Adjust.
(arena_get_retry): Likewise.
PowerPC has always used __IPC_64 like most other architectures, which
means that __ASSUME_IPC64 can be always true. Also, all other
architecture implementations that use the ipc syscall are effectively
identical to the generic version and can be removed.
Do not try to generate the manual when perl is unavailable. This
matches the behavior when makeinfo is unavailable. Otherwise the
install step fails when trying to generate the libm section since
it runs a perl script.
In powerpc64, memchr was always pointing to the internal __GI_memchr
implementation. This patch fixes that and makes it use the
optimized POWER7 version when adequate.
* sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c: Make
memchr not point to the internal __GI_memchr implementation.