This utilizes vectors and bitmasks. For small needle, large
haystack, the performance improvement is upto 8x. For short
strings (0-4B), the cost of computing the bitmask dominates,
and is a tad slower.
In powerpc64, memchr was always pointing to the internal __GI_memchr
implementation. This patch fixes that and makes it use the
optimized POWER7 version when adequate.
* sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c: Make
memchr not point to the internal __GI_memchr implementation.
This patches uses the default strcpy/stpcpy implementation for
POWER7/PPC64. This is faster in mostly inputs for benchtests
and for multiarch the implementation uses the POWER7 strlen and
memcpy.
* string/stpcpy.c (__stpcpy): Use STPCPY to redefine symbol name and
cleanup macro usage.
* string/strcpy.c (strcpt): Use STRCPY to redefine symbol name.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-power7.S: Remove file.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-power7.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/stpcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.c: Likewise.
* sysdeps/powerpc/powerpc64/stpcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/strcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && IS_IN (libc)]: Include <string/strcpy.c>.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && IS_IN (libc)]: Include <string/stpcpy.c>.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-power7.c: New file.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.c: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-power7.c: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.c: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.c: Likewise.
This patch optimizes strstr function for power >= 7 systems. Performance
gain is obtained using aligned memory access and usage of cmpb
instruction for quicker comparison. The average improvement of this
optimization is ~40%. Tested on ppc64 and ppc64le.
2015-07-16 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc64/multiarch/Makefile: Add strstr().
* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
* sysdeps/powerpc/powerpc64/power7/strstr.S: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr.c: New File.
This patch fixes the missing "__memcpy_ppc" symbol for memmove-ppc64
object in static builds. Since memcpy ifunc is not enabled in static
mode, the specialized symbols are not provided. The patch changed the
it to just "__memcpy" instead.
This patch cleanup some multiarch code related to memmmove
optimization. Initial IFUNC support added specialized wordcopy
symbols which turned in local IFUNC calls used by memmove default
implementation. The patch removes the internal IFUNC for wordcopy
symbols and uses local branches in the memmmove optimization instead.
This patch cleanup some multiarch code related to memmmove
optimization. Initial IFUNC support added specialized wordcopy
symbols which turned in local IFUNC calls used by memmove default
implementation.
This change by removing then and used the optimized memmove instead
for supported chips.
This patch simplify the default bcopy symbol for powerpc64 by just using
memmove instead of implementing using the default bcopy. Since the
symbol is deprecated, it trades speed by code size.
This patch adds an optimized POWER8 strncmp. The implementation focus
on speeding up unaligned cases follwing the ideas of power8 strcmp.
The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
(where sources alignment are different).
This patch adds an optimized POWER8 strcmp using unaligned accesses.
The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses.
It shows 10%-80% improvement over the optimized POWER7 one that uses
only aligned accesses, specially on unaligned inputs.
The algorithm first read and check 16 bytes (if inputs do not cross a 4K
page size). The it realign source to 16-bytes and issue a 16 bytes read
and compare loop to speedup null byte checks for large strings. Also,
different from POWER7 optimization, the null pad is done inline in the
implementation using possible unaligned accesses, instead of realying on
a memset call. Special case is added for page cross reads.
With 3eb38795db (Simplify strncat) the generic algorithms uses
strlen, strnlen, and memcpy. This is faster than POWER7 current
implementation, especially for unaligned strings (where POWER7 code
uses byte-byte operations).
This patch removes the assembly implementation and uses a multiarch
specialization based on default algorithm calling optimized POWER7
symbols.
This patch adds an optimized POWER8 strcpy using unaligned accesses.
For strings up to 16 bytes the implementation first calculate the
string size, like strlen, and issues a memcpy. For larger strings,
source is first aligned to 16 bytes and then tested over a loop that
reads 16 bytes am combine the cmpb results for speedup. Special case is
added for page cross reads.
It shows 30%-60% improvement over the optimized POWER7 one that uses
only aligned accesses.
Use of strftime, a C90 function, ends up bringing in wcschr, which is
not a C90 function. Although not a conformance bug (C90 reserves
wcs*), this is still contrary to glibc practice of avoiding relying on
those reservations; this patch arranges for the internal uses to use
__wcschr instead, with wcschr being a weak alias. This is more
complicated than some such patches because of the various IFUNC
definitions of wcschr (which include code redefining libc_hidden_def
in a way that involves creating __GI_wcschr manually and so also needs
to create __GI___wcschr after the change of internal uses to use
__wcschr).
Tested for x86_64 and 32-bit x86 (testsuite, and that disassembly of
installed shared libraries is unchanged by the patch).
2014-12-10 Joseph Myers <joseph@codesourcery.com>
Adhemerval Zanella <azanella@linux.vnet.ibm.com>
[BZ #17634]
* wcsmbs/wcschr.c [!WCSCHR] (wcschr): Define as __wcschr.
Undefine after defining function. Define as weak alias of
__wcschr. Use libc_hidden_weak.
* include/wchar.h (__wcschr): Declare. Use libc_hidden_proto.
* sysdeps/i386/i686/multiarch/wcschr-c.c [IS_IN (libc) && SHARED]
(libc_hidden_def): Also define __GI___wcschr alias.
* sysdeps/i386/i686/multiarch/wcschr.S (wcschr): Rename to
__wcschr and define as weak alias of __wcschr.
* sysdeps/powerpc/power6/wcschr.c [!WCSCHR] (WCSCHR): Define as
__wcschr.
[!WCSCHR] (DEFAULT_WCSCHR): Define.
[DEFAULT_WCSCHR] (__wcschr): Use libc_hidden_def.
[DEFAULT_WCSCHR] (wcschr): Define as weak alias of __wcschr. Use
libc_hidden_weak. Do not use libc_hidden_def.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c
[IS_IN (libc) && SHARED] (libc_hidden_def): Also define
__GI___wcschr alias.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c
[IS_IN (libc)] (wcschr): Define as macro expanding to
__redirect_wcschr.
[IS_IN (libc)] (__wcschr_ppc): Use __redirect_wcschr in typeof.
[IS_IN (libc)] (__wcschr_power6): Likewise.
[IS_IN (libc)] (__wcschr_power7): Likewise.
[IS_IN (libc)] (__libc_wcschr): New. Define with libc_ifunc
instead of wcschr.
[IS_IN (libc)] (wcschr): Undefine and define as weak alias of
__libc_wcschr.
[!IS_IN (libc)] (libc_hidden_def): Do not undefine and redefine.
* sysdeps/powerpc/powerpc64/multiarch/wcschr.c (wcschr): Rename to
__wcschr and define as weak alias of __wcschr. Use
libc_hidden_builtin_def.
* sysdeps/x86_64/wcschr.S (wcschr): Rename to __wcschr and define
as weak alias of __wcschr. Use libc_hidden_weak.
* time/alt_digit.c (_nl_get_walt_digit): Use __wcschr instead of
wcschr.
* time/era.c (_nl_init_era_entries): Likewise.
* conform/Makefile (test-xfail-ISO/time.h/linknamespace): Remove
variable.
(test-xfail-XPG3/time.h/linknamespace): Likewise.
(test-xfail-XPG4/time.h/linknamespace): Likewise.
This patch makes the POWER7 optimized strpbrk generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 does not change.
This patch makes the POWER7 optimized strcspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 does not change.
This patch makes the POWER7 optimized strspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 machines does not changed.
This patch fixes the build of C mempcpy and stpcpy by disabling the
redirection to __mempcpy and __stpcpy asm names if
NO_MEMPCPY_STPCPY_REDIRECT is defined, and defining that macro in the
relevant source files.
Tested for powerpc32 that the build is fixed.
* include/string.h [NO_MEMPCPY_STPCPY_REDIRECT] (mempcpy): Do not
redeclare with asm name.
[NO_MEMPCPY_STPCPY_REDIRECT] (stpcpy): Likewise.
* string/mempcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Define before
including <string.h>.
* string/stpcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && !NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
This patch cleanups the multiarch bzero for powerpc64 by remove
the multiarch objects and use instead the the memset embedded
implementation presented in each multiarch optimization. The
code generate is essentially the same, but the TB_TOCLESS (which
is not essential).
Now that MEMCPY_OK_FOR_FWD_MEMMOVE should be define on memcopy.h there
is no need to specialized powerpc memmove implementation. This patch
moves the define set to powerpc memcopy and cleanup its definition on
powerpc code.
This patch adds an optimized memmove optimization for POWER7/powerpc64.
Basically the idea is to use the memcpy for POWER7 on non-overlapped
memory regions and a optimized backward memcpy for memory regions
that overlap (similar to the idea of string/memmove.c).
The backward memcpy algorithm used is similar the one use for memcpy for
POWER7, with adjustments done for alignment. The difference is memory
is always aligned to 16 bytes before using VSX/altivec instructions.
This patch adds an ifunc power7 strcat symbol that uses the logic on
sysdeps/powerpc/strcat.c but call power7 strlen/strcpy symbols instead
of default ones.
Optimization is achieved on 8 byte aligned strings with double word
comparison using cmpb instruction. On unaligned strings loop unrolling
is applied for Power7 gain.
The optimization is achieved by following techniques:
> data alignment [gain from aligned memory access on read/write]
> POWER7 gains performance with loop unrolling/unwinding
[gain by reduction of branch penalty].
> zero padding done by calling optimized memset
This patch changes de default symbol redirection for internal call of
memcpy, memset, memchr, and strlen to the IFUNC resolved ones. The
performance improvement is noticeable in algorithms that uses these
symbols extensible, like the regex functions.
This patch add an optimized strpbrk for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance on
the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and memory
clear using VSX instructions.
This patch add a optimized strcspn for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance
on the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and align
stack memory to table to 16 bytes (so VSX clean can ran without
alignment issues).
This patch fixes an issue for powerpc64[le] static build where __bzero
is definied in multiple places (memset-ppc64.o and bzero.o). It is now
defined only in bzero.o and memset-ppc64.o only defined __bzero_ppc for
both dynamic and static library.
Fixes BZ#16683.
The optimization is achieved by following techniques:
> hashing of needle.
> hashing avoids scanning of duplicate entries in needle across the string.
> initializing the hash table with Vector instructions (VSX) by quadword access.
> unrolling when scanning for character in string across hash table.
The optimization is achieved by following techniques:
1. Doubleword aligned memory access and compares using
cmpb instruction.
2. Loop unrolling for byte load/store.
3. CPU pre-fetch to avoid cache miss.
This patch optimizes strrchr() for ppc64. It uses aligned memory
access along with cmpb instruction and CPU prefetch to avoid
cache misses for speed improvement.