This patch makes the POWER7 optimized strcspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 does not change.
This patch makes the POWER7 optimized strspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 machines does not changed.
This patch optimizes strtok and strtok_r for POWERPC64.
A table of 256 characters is created and marked based on
the 'accept' argument and used to check for any occurance on
the input string.Loop unrolling is also used to gain improvements.
On powerpc, atomic_exchange_and_add is implemented without any
barriers. This patchs adds the missing instruction and memory barrier
for acquire and release semanthics.
This sets __HAVE_64B_ATOMICS if provided. It also sets
USE_ATOMIC_COMPILER_BUILTINS to true if the existing atomic ops use the
__atomic* builtins (aarch64, mips partially) or if this has been
tested (x86_64); otherwise, this is set to false so that C11 atomics will
be based on the existing atomic operations.
This patch fixes the build of C mempcpy and stpcpy by disabling the
redirection to __mempcpy and __stpcpy asm names if
NO_MEMPCPY_STPCPY_REDIRECT is defined, and defining that macro in the
relevant source files.
Tested for powerpc32 that the build is fixed.
* include/string.h [NO_MEMPCPY_STPCPY_REDIRECT] (mempcpy): Do not
redeclare with asm name.
[NO_MEMPCPY_STPCPY_REDIRECT] (stpcpy): Likewise.
* string/mempcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Define before
including <string.h>.
* string/stpcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && !NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
Completing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch removes the final use - that for _dl_starting_up - replacing it
by rtld_hidden_def / rtld_hidden_proto. Having removed the last use,
the mechanism itself is also removed.
Tested for x86_64 that installed stripped shared libraries are
unchanged by the patch. (This is not much of a test since this
variable is only defined and used in the !HAVE_INLINED_SYSCALLS case.)
[BZ #14132]
* include/libc-symbols.h (INTUSE): Remove macro.
(INTDEF): Likewise.
(INTVARDEF): Likewise.
(_INTVARDEF): Likewise.
(INTDEF2): Likewise.
(INTVARDEF2): Likewise.
* elf/rtld.c [!HAVE_INLINED_SYSCALLS] (_dl_starting_up): Use
rtld_hidden_def instead of INTVARDEF.
* sysdeps/generic/ldsodefs.h [IS_IN_rtld]
(_dl_starting_up_internal): Remove declaration.
(_dl_starting_up): Use rtld_hidden_proto.
* elf/dl-init.c [!HAVE_INLINED_SYSCALLS] (_dl_starting_up): Remove
declaration.
[!HAVE_INLINED_SYSCALLS] (_dl_starting_up_internal): Likewise.
(_dl_init) [!HAVE_INLINED_SYSCALLS]: Don't use INTUSE with
_dl_starting_up.
* elf/dl-writev.h (_dl_writev): Likewise.
* sysdeps/powerpc/powerpc64/dl-machine.h [!HAVE_INLINED_SYSCALLS]
(DL_STARTING_UP_DEF): Use __GI__dl_starting_up instead of
_dl_starting_up_internal.
Continuing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch replaces its use for _dl_argv with rtld_hidden_data_def and
rtld_hidden_proto. Some places in .S files that previously used
_dl_argv_internal or INTUSE(_dl_argv) now use __GI__dl_argv directly
(there are plenty of existing examples of such direct use of __GI_*).
A single place in rtld.c previously used _dl_argv without INTUSE,
apparently accidentally, while the rtld_hidden_proto mechanism avoids
such accidential omissions. As a consequence, this patch *does*
change the contents of stripped ld.so. However, the installed
stripped shared libraries are identical to those you get if instead of
this patch you change that single _dl_argv use to use INTUSE, without
any other changes.
Tested for x86_64 (testsuite as well as comparison of installed
stripped shared libraries as described above).
[BZ #14132]
* sysdeps/generic/ldsodefs.h (_dl_argv): Use rtld_hidden_proto.
[IS_IN_rtld] (_dl_argv_internal): Do not declare.
(rtld_progname): Make macro definition unconditional.
* elf/rtld.c (_dl_argv): Use rtld_hidden_data_def instead of
INTDEF.
(dlmopen_doit): Do not use INTUSE with _dl_argv.
(dl_main): Likewise.
* elf/dl-sysdep.c (_dl_sysdep_start): Likewise.
* sysdeps/alpha/dl-machine.h (RTLD_START): Use __GI__dl_argv
instead of _dl_argv_internal.
* sysdeps/powerpc/powerpc32/dl-start.S (_dl_start_user): Use
__GI__dl_argv instead of INTUSE(_dl_argv).
* sysdeps/powerpc/powerpc64/dl-machine.h (RTLD_START): Use
__GI__dl_argv instead of _dl_argv_internal.
This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
This patch cleanups the multiarch bzero for powerpc64 by remove
the multiarch objects and use instead the the memset embedded
implementation presented in each multiarch optimization. The
code generate is essentially the same, but the TB_TOCLESS (which
is not essential).
Now that MEMCPY_OK_FOR_FWD_MEMMOVE should be define on memcopy.h there
is no need to specialized powerpc memmove implementation. This patch
moves the define set to powerpc memcopy and cleanup its definition on
powerpc code.
This patch changes power7 memcpy to use VSX instructions only when
memory is aligned to quardword. It is to avoid unaligned kernel traps
on non-cacheable memory (for instance, memory-mapped I/O).
This patch adds an optimized memmove optimization for POWER7/powerpc64.
Basically the idea is to use the memcpy for POWER7 on non-overlapped
memory regions and a optimized backward memcpy for memory regions
that overlap (similar to the idea of string/memmove.c).
The backward memcpy algorithm used is similar the one use for memcpy for
POWER7, with adjustments done for alignment. The difference is memory
is always aligned to 16 bytes before using VSX/altivec instructions.
This patch adds an ifunc power7 strcat symbol that uses the logic on
sysdeps/powerpc/strcat.c but call power7 strlen/strcpy symbols instead
of default ones.
This patch defines ELF_MACHINE_NO_RELA on all architectures. Tested
only on x86_64 to verify that the sources before and after are
identical except for two instructions that pass the current line
number in dl-machine.h to assert_fail.
Optimization is achieved on 8 byte aligned strings with double word
comparison using cmpb instruction. On unaligned strings loop unrolling
is applied for Power7 gain.
This patch fixes the optimized ppc64/power7 strncat strlen call for
static build without ifunc enabled. The strlen symbol to call in such
situation is just strlen, instead of __GI_strlen (since the __GI_
alias is just created for shared objects).
The optimization is achieved by following techniques:
> data alignment [gain from aligned memory access on read/write]
> POWER7 gains performance with loop unrolling/unwinding
[gain by reduction of branch penalty].
> zero padding done by calling optimized memset
This patch changes de default symbol redirection for internal call of
memcpy, memset, memchr, and strlen to the IFUNC resolved ones. The
performance improvement is noticeable in algorithms that uses these
symbols extensible, like the regex functions.
This patch fixes some powerpc32 and powerpc64 builds with
--disable-multi-arch option along with different --with-cpu=powerN.
It cleanups the Implies directories by removing the multiarch
folder for non multiarch config and also fixing two assembly
implementations: powerpc64/power7/strncat.S that is calling the
wrong strlen; and power8/fpu/s_isnan.S that misses the hidden_def and
weak_alias directives.
This patch add an optimized strpbrk for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance on
the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and memory
clear using VSX instructions.
This patch add a optimized strcspn for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance
on the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and align
stack memory to table to 16 bytes (so VSX clean can ran without
alignment issues).
The roundl assembly implementation
(sysdeps/powerpc/powerpc64/fpu/s_roundl.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_roundl.c instead fixes the failing math.
This fixes 16707.
The nearbyintl assembly implementation
(sysdeps/powerpc/powerpc64/fpu/s_nearbyintl.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_nearbyintl.c instead fixes the failing
math.
Fixes BZ#16706.
The ceill assembly implementation (sysdeps/powerpc/powerpc64/fpu/s_ceill.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_ceill.c instead fixes the failing math.
Fixes BZ#16701.
This patch fixes an issue for powerpc64[le] static build where __bzero
is definied in multiple places (memset-ppc64.o and bzero.o). It is now
defined only in bzero.o and memset-ppc64.o only defined __bzero_ppc for
both dynamic and static library.
Fixes BZ#16683.
The optimization is achieved by following techniques:
> hashing of needle.
> hashing avoids scanning of duplicate entries in needle across the string.
> initializing the hash table with Vector instructions (VSX) by quadword access.
> unrolling when scanning for character in string across hash table.
The optimization is achieved by following techniques:
1. Doubleword aligned memory access and compares using
cmpb instruction.
2. Loop unrolling for byte load/store.
3. CPU pre-fetch to avoid cache miss.
This patch optimizes strrchr() for ppc64. It uses aligned memory
access along with cmpb instruction and CPU prefetch to avoid
cache misses for speed improvement.
This patch add a optimized llround/llroundf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized llrint/llrintf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized finite/finitef implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized isinf/isinff implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized isnan/isnanf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
The truncl assembly implementation (sysdeps/powerpc/powerpc64/fpu/s_truncl.S)
returns wrong results for some inputs where first double is a exact integer
and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_truncl.c instead it fixes tgammal
issues regarding wrong result sign.
For PPC64, all the wrappers at sysdeps are superfluous: they are
basically the same implementation from math/w_sqrt.c with the
'#ifdef _IEEE_LIBM'. And the power4 version just force the 'fsqrt'
instruction utilization with an inline assembly, which is already
handled by math_private.h __ieee754_sqrt implementation.
This patch add optimized __mpn_addmul, __mpn_addsub, __mpn_lshift, and
__mpn_mul_1 implementations for PowerPC64. They are originally from GMP
with adjustments for GLIBC.