Commit Graph

72 Commits

Author SHA1 Message Date
Raoni Fassina Firmino
23fdf8178c powerpc64le: Optimize memset for POWER10
This implementation is based on __memset_power8 and integrates a lot
of suggestions from Anton Blanchard.

The biggest difference is that it makes extensive use of stxvl to
alignment and tail code to avoid branches and small stores.  It has
three main execution paths:

a) "Short lengths" for lengths up to 64 bytes, avoiding as many
   branches as possible.

b) "General case" for larger lengths, it has an alignment section
   using stxvl to avoid branches, a 128 bytes loop and then a tail
   code, again using stxvl with few branches.

c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set
   value being zero.  It is mostly the __memset_power8 code but the
   alignment phase was simplified because, at this point, address is
   already 16-bytes aligned and also changed to use vector stores.
   The tail code was also simplified to reuse the general case tail.

All unaligned stores use stxvl instructions that do not generate
alignment interrupts on POWER10, making it safe to use on
caching-inhibited memory.

On average, this implementation provides something around 30%
improvement when compared to __memset_power8.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-04-30 18:12:08 -03:00
Tulio Magno Quites Machado Filho
e941e0ae80 powerpc64le: Optimize memcpy for POWER10
This implementation is based on __memcpy_power8_cached and integrates
suggestions from Anton Blanchard.
It benefits from loads and stores with length for short lengths and for
tail code, simplifying the code.

All unaligned memory accesses use instructions that do not generate
alignment interrupts on POWER10, making it safe to use on
caching-inhibited memory.

The main loop has also been modified in order to increase instruction
throughput by reducing the dependency on updates from previous iterations.

On average, this implementation provides around 30% improvement when
compared to __memcpy_power7 and 10% improvement in comparison to
__memcpy_power8_cached.
2021-04-30 18:12:08 -03:00
Lucas A. M. Magalhaes
dd59655e93 powerpc64le: Optimized memmove for POWER10
This patch was initially based on the __memmove_power7 with some ideas
from strncpy implementation for Power 9.

Improvements from __memmove_power7:

1. Use lxvl/stxvl for alignment code.

   The code for Power 7 uses branches when the input is not naturally
   aligned to the width of a vector. The new implementation uses
   lxvl/stxvl instead which reduces pressure on GPRs. It also allows
   the removal of branch instructions, implicitly removing branch stalls
   and mispredictions.

2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited
   memory.

   On Power 10 vector load and stores are safe to use on CI memory for
   addresses unaligned to 16B. This code takes advantage of this to
   do unaligned loads.

   The unaligned loads don't have a significant performance impact by
   themselves. However doing so decreases register pressure on GPRs
   and interdependence stalls on load/store pairs. This also improved
   readability as there are now less code paths for different alignments.
   Finally this reduces the overall code size.

3. Improved performance.

   This version runs on average about 30% better than memmove_power7
   for lengths  larger than 8KB. For input lengths shorter than 8KB
   the improvement is smaller, it has on average about 17% better
   performance.

   This version has a degradation of about 50% for input lengths
   in the 0 to 31 bytes range when dest is unaligned.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-04-30 18:12:08 -03:00
Matheus Castanho
10624a97e8 powerpc: Add optimized strlen for POWER10
Improvements compared to POWER9 version:

1. Take into account first 16B comparison for aligned strings

   The previous version compares the first 16B and increments r4 by the number
   of bytes until the address is 16B-aligned, then starts doing aligned loads at
   that address. For aligned strings, this causes the first 16B to be compared
   twice, because the increment is 0. Here we calculate the next 16B-aligned
   address differently, which avoids that issue.

2. Use simple comparisons for the first ~192 bytes

   The main loop is good for big strings, but comparing 16B each time is better
   for smaller strings.  So after aligning the address to 16 Bytes, we check
   more 176B in 16B chunks.  There may be some overlaps with the main loop for
   unaligned strings, but we avoid using the more aggressive strategy too soon,
   and also allow the loop to start at a 64B-aligned address.  This greatly
   benefits smaller strings and avoids overlapping checks if the string is
   already aligned at a 64B boundary.

3. Reduce dependencies between load blocks caused by address calculation on loop

   Doing a precise time tracing on the code showed many loads in the loop were
   stalled waiting for updates to r4 from previous code blocks.  This
   implementation avoids that as much as possible by using 2 registers (r4 and
   r5) to hold addresses to be used by different parts of the code.

   Also, the previous code aligned the address to 16B, then to 64B by doing a
   few 48B loops (if needed) until the address was aligned. The main loop could
   not start until that 48B loop had finished and r4 was updated with the
   current address. Here we calculate the address used by the loop very early,
   so it can start sooner.

   The main loop now uses 2 pointers 128B apart to make pointer updates less
   frequent, and also unrolls 1 iteration to guarantee there is enough time
   between iterations to update the pointers, reducing stalled cycles.

4. Use new P10 instructions

   lxvp is used to load 32B with a single instruction, reducing contention in
   the load queue.

   vextractbm allows simplifying the tail code for the loop, replacing
   vbpermq and avoiding having to generate a permute control vector.

Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
2021-04-22 16:18:06 -03:00
Raphael M Zinsly
7beee7b39a powerpc: Add optimized stpncpy for POWER9
Add stpncpy support into the POWER9 strncpy.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2020-11-12 13:16:36 -03:00
Raphael M Zinsly
b9d83bf3eb powerpc: Add optimized strncpy for POWER9
Similar to the strcpy P9 optimization, this version uses VSX to improve
performance.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2020-11-12 13:12:24 -03:00
Paul E. Murphy
a23bd00f9d powerpc64le: add optimized strlen for P9
This started as a trivial change to Anton's rawmemchr.  I got
carried away.  This is a hybrid between P8's asympotically
faster 64B checks with extremely efficient small string checks
e.g <64B (and sometimes a little bit more depending on alignment).

The second trick is to align to 64B by running a 48B checking loop
16B at a time until we naturally align to 64B (i.e checking 48/96/144
bytes/iteration based on the alignment after the first 5 comparisons).
This allieviates the need to check page boundaries.

Finally, explicly use the P7 strlen with the runtime loader when building
P9.  We need to be cautious about vector/vsx extensions here on P9 only
builds.
2020-06-05 15:30:00 -05:00
Anton Blanchard
765de945ef powerpc: Optimized rawmemchr for POWER9
This version uses vector instructions and is up to 60% faster on medium
matches and up to 90% faster on long matches, compared to the POWER7
version. A few examples:

                            __rawmemchr_power9  __rawmemchr_power7
Length   32, alignment  0:   2.27566             3.77765
Length   64, alignment  2:   2.46231             3.51064
Length 1024, alignment  0:  17.3059             32.6678
2020-05-18 17:08:54 -05:00
Anton Blanchard via Libc-alpha
aa70d05632 powerpc: Optimized stpcpy for POWER9
Add stpcpy support to the POWER9 strcpy. This is up to 40% faster on
small strings and up to 90% faster on long relatively unaligned strings,
compared to the POWER8 version. A few examples:

                                        __stpcpy_power9  __stpcpy_power8
Length   20, alignments in bytes  4/ 4:  2.58246          4.8788
Length 1024, alignments in bytes  1/ 6: 24.8186          47.8528
2020-05-18 08:26:22 -05:00
Anton Blanchard via Libc-alpha
3903704850 powerpc: Optimized strcpy for POWER9
This version uses VSX store vector with length instructions and is
significantly faster on small strings and relatively unaligned large
strings, compared to the POWER8 version. A few examples:

                                        __strcpy_power9  __strcpy_power8
Length   16, alignments in bytes  0/ 0: 2.52454          4.62695
Length  412, alignments in bytes  4/ 0: 11.6             22.9185
2020-05-18 08:26:22 -05:00
Adhemerval Zanella
f82ed45d7f powerpc: Use generic wcsrchr optimization
This patch removes the power6 wcsrchr optimization and uses generic
implementation instead.  Currently, both power6 and power7 IFUNC variant
resulting binary are essentially the same and the generic implementation
with unrolling loop set to 8 also results in similar performance.

Checked on powerpc64-linux-gnu.

	* sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcsrchr.c):
	New rule.
	* sysdeps/powerpc/power6/wcsrchr.c: Remove file.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-power6.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-power7.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-ppc32.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr-power6.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr-power7.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/power6/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/Makefile
	[$(subdir) == wcsmbs] (sysdeps_routines): Remove wcsrchr-power6 and
	wcsrchr-power7.
	(CFLAGS-wcsrchr-power7.c, CFLAGS-wcsrchr-power6.c): Remove rule.
	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c:
	Remove wcsrchr optimizations.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
2019-04-04 16:01:14 +07:00
Adhemerval Zanella
421e3005ca powerpc: Use generic wcschr optimization
This patch removes the power6 wcschr optimization and uses generic
implementation instead.  Currently, both power6 and power7 IFUNC variant
resulting binary are essentially the same and the generic implementation
with unrolling loop set to 8 also results in similar performance.

Checked on powerpc64-linux-gnu.

	* sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcschr.c):
	New rule.
	* sysdeps/powerpc/power6/wcschr.c: Remove file.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-power6.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-power7.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr-power6.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr-power7.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc64/power6/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/Makefile
	[$(subdir) == wcsmbs] (sysdeps_routines): Remove wcschr-power6 and
	wcschr-power7.
	(CFLAGS-wcschr-power7.c, CFLAGS-wcschr-power6.c): Remove rule.
	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c:
	Remove wcschr optimizations.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
2019-04-04 16:01:14 +07:00
Adhemerval Zanella
447a1306c3 powerpc: Use generic wcscpy optimization
This patch removes the power6 wcscpy optimization and uses generic
implementation instead.  Currently, both power6 and power7 IFUNC variant
resulting binary are essentially the same and the generic implementation
with unrolling loop set to 8 also results in similar performance.

Checked on powerpc64-linux-gnu.

	* sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcscpy.c):
	New rule.
	* sysdeps/powerpc/power6/wcscpy.c: Remove file.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-power6.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-power7.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c:
	Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy-power6.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy-power7.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/power6/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/Makefile
	[$(subdir) == wcsmbs] (sysdeps_routines): Remove wcscpy-power6 and
	wcscpy-power7.
	(CFLAGS-wcscpy-power7.c, CFLAGS-wcscpy-power6.c): Remove rule.
	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c:
	Remove wcscpy optimizations.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
2019-04-04 16:01:14 +07:00
Rajalakshmi Srinivasaraghavan
7793ad7a2c powerpc: Rearrange little endian specific files
This patch moves little endian specific POWER9 optimization files to
sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions
only for little endian.
2018-08-16 12:12:02 +05:30
Adhemerval Zanella
c9cd7b0ce5 powerpc: POWER8 memcpy optimization for cached memory
On POWER8, unaligned memory accesses to cached memory has little impact
on performance as opposed to its ancestors.

It is disabled by default and will only be available when the tunable
glibc.tune.cached_memopt is set to 1.

                 __memcpy_power8_cached      __memcpy_power7
============================================================
    max-size=4096:     33325.70 ( 12.65%)        38153.00
    max-size=8192:     32878.20 ( 11.17%)        37012.30
   max-size=16384:     33782.20 ( 11.61%)        38219.20
   max-size=32768:     33296.20 ( 11.30%)        37538.30
   max-size=65536:     33765.60 ( 10.53%)        37738.40

	* manual/tunables.texi (Hardware Capability Tunables): Document
	glibc.tune.cached_memopt.
	* sysdeps/powerpc/cpu-features.c: New file.
	* sysdeps/powerpc/cpu-features.h: New file.
	* sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add
	_dl_powerpc_cpu_features.
	* sysdeps/powerpc/dl-tunables.list: New file.
	* sysdeps/powerpc/ldsodefs.h: Include cpu-features.h.
	* sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h
	(INIT_ARCH): Initialize use_aligned_memopt.
	* sysdeps/powerpc/powerpc64/dl-machine.h [defined(SHARED &&
	IS_IN(rtld))]: Restrict dl_platform_init availability and
	initialize CPU features used by tunables.
	* sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines):
	Add memcpy-power8-cached.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add
	__memcpy_power8_cached.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S:
	New file.

Reviewed-by: Rajalakshmi Srinivasaraghavan  <raji@linux.vnet.ibm.com>
2017-12-11 17:39:42 -02:00
Rajalakshmi Srinivasaraghavan
59ba2d2b54 powerpc: Optimize memrchr for power8
Vectorized loops are used for sizes greater than 32B to improve
performance over power7 optimization.  This shows as an average
of 25% improvement depending on the position of search
character.  The performance is same for shorter strings.
2017-10-02 17:31:13 +05:30
Rajalakshmi Srinivasaraghavan
43e0ac24c8 powerpc: Optimize memchr for power8
Vectorized loops are used for sizes greater than 32B to improve
performance over power7 optimiztion.
2017-06-21 10:55:12 +05:30
Rajalakshmi Srinivasaraghavan
dec4a7105e powerpc: Improve memcmp performance for POWER8
Vectorization improves performance over the current implementation.
Tested on powerpc64 and powerpc64le.
2017-05-18 11:21:20 +05:30
Rajalakshmi Srinivasaraghavan
6c6ab1fc49 powerpc64: strrchr optimization for power8
P7 code is used for <=32B strings and for > 32B vectorized loops are used.
This shows as an average 25% improvement depending on the position of search
character.  The performance is same for shorter strings.
Tested on ppc64 and ppc64le.
2017-04-18 11:28:56 +05:30
Rajalakshmi Srinivasaraghavan
249dcdb71b powerpc: Optimized strncat for POWER8
With new optimized strnlen for POWER8 [1], this patch adds
strncat for power8 to make use of optimized strlen and strnlen.
This is faster than POWER7 current implementation for larger strings.

Tested on powerpc64 and powerpc64le.

[1] https://sourceware.org/ml/libc-alpha/2017-03/msg00491.html

	* sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add
	strncat-power8.
	* sysdeps/powerpc/powerpc64/multiarch/strncat.c (strncat): Add
	__strncat_power8 to ifunc list.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
	(strncat): Add __strncat_power8 to list of strncat functions.
	* sysdeps/powerpc/powerpc64/multiarch/strncat-power8.c: New file.
2017-04-13 11:29:20 +05:30
Wainer dos Santos Moschetta
18b6e2c86c powerpc64: Add POWER8 strnlen
Added strnlen POWER8 otimized for long strings. It delivers
same performance as POWER7 implementation for short strings.

This takes advantage of reasonably performing unaligned loads
and bit permutes to check the first 1-16 bytes until
quadword aligned, then checks in 64 bytes strides until unsafe,
then 16 bytes, truncating the count if need be.

Likewise, the POWER7 code is recycled for less than 32 bytes strings.

Tested on ppc64 and ppc64le.

	* sysdeps/powerpc/powerpc64/multiarch/Makefile
	(sysdep_routines): Add strnlen-power8.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
	(strnlen): Add __strnlen_power8 to list of strnlen functions.
	* sysdeps/powerpc/powerpc64/multiarch/strnlen-power8.S:
	New file.
	* sysdeps/powerpc/powerpc64/multiarch/strnlen.c
	(__strnlen): Add __strnlen_power8 to ifunc list.
	* sysdeps/powerpc/powerpc64/power8/strnlen.S: New file.
2017-04-05 10:26:58 -03:00
Rajalakshmi Srinivasaraghavan
9314d3545e powerpc64: strchr/strchrnul optimization for power8
The P7 code is used for <=32B strings and for > 32B vectorized loops are used.
This shows as an average 25% improvement depending on the position of search
character.  The performance is same for shorter strings.
Tested on ppc64 and ppc64le.
2016-12-28 11:44:31 -02:00
Rajalakshmi Srinivasaraghavan
d89060d603 powerpc: strncmp optimization for power9
Vectorized loops are used for strings > 32B when compared
to power8 optimization.

Tested on power9 ppc64le simulator.
2016-12-13 10:53:42 +05:30
Rajalakshmi Srinivasaraghavan
80ab6401a9 powerpc: strcmp optimization for power9
Vectorized loops are used for strings > 32B when compared
to power8 optimization.

Tested on power9 ppc64le simulator.
2016-12-01 11:35:43 +05:30
raji
c8376f3e07 powerpc: strcasecmp/strncasecmp optmization for power8
This implementation utilizes vectors to improve performance
compared to current byte by byte implementation for POWER7.
The performance improvement is upto 4x.  This patch is tested
on powerpc64 and powerpc64le.
2016-06-14 14:51:16 +05:30
Paul E. Murphy
8f1b841e45 powerpc: Add optimized strcspn for P8
A few minor adjustments to the P8 strspn gives us
an almost equally optimized P8 strcspn.
2016-04-25 09:11:02 -05:00
Rajalakshmi Srinivasaraghavan
e413b14e18 powerpc: strcasestr optmization for power8
This patch optimizes strcasestr function for power >= 8 systems.  The average
improvement of this optimization is ~40% and compares 16 bytes at a time
using vector instructions.  This patch is tested on powerpc64 and powerpc64le.
2016-04-22 19:23:13 +05:30
Carlos Eduardo Seo
1b045ee53e powerpc: Optimization for strlen for POWER8.
This implementation takes advantage of vectorization to improve performance of
the loop over the current strlen implementation for POWER7.
2016-04-15 17:19:19 -03:00
Paul E. Murphy
25dba0ad05 powerpc: Add optimized P8 strspn
This utilizes vectors and bitmasks.  For small needle, large
haystack, the performance improvement is upto 8x.  For short
strings (0-4B), the cost of computing the bitmask dominates,
and is a tad slower.
2016-04-07 15:51:28 -05:00
Rajalakshmi Srinivasaraghavan
b42f8cad52 powerpc: strstr optimization
This patch optimizes strstr function for power >= 7 systems.  Performance
gain is obtained using aligned memory access and usage of cmpb
instruction for quicker comparison.  The average improvement of this
optimization is ~40%.  Tested on ppc64 and ppc64le.

2015-07-16  Rajalakshmi Srinivasaraghavan  <raji@linux.vnet.ibm.com>

	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Add strstr().
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/powerpc/powerpc64/power7/strstr.S: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strstr.c: New File.
2015-07-16 13:43:51 -03:00
Adhemerval Zanella
b269211467 powerpc: wordcopy/memmove cleanup for ppc64
This patch cleanup some multiarch code related to memmmove
optimization. Initial IFUNC support added specialized wordcopy
symbols which turned in local IFUNC calls used by memmove default
implementation.

This change by removing then and used the optimized memmove instead
for supported chips.
2015-02-09 06:42:28 -05:00
Adhemerval Zanella
18e270aada powerpc: Remove POWER7 wordcopy ifunc
This patch remove the POWER7 ifunc wordcopy function
(_wordcopy_*_power7), since now GLIBC provides a optimized memmove/bcopy
for POWER7.
2015-02-09 06:42:28 -05:00
Adhemerval Zanella
3001e54c57 powerpc: multiarch Makefile cleanup for powerpc64
This patch cleanups the multiarch Makefile by putting the wide chars
implementation to correct wcsmbs rule.
2015-02-09 06:42:27 -05:00
Adhemerval Zanella
d3b00f468b powerpc: Optimized strncmp for POWER8/PPC64
This patch adds an optimized POWER8 strncmp.  The implementation focus
on speeding up unaligned cases follwing the ideas of power8 strcmp.

The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
(where sources alignment are different).
2015-01-13 14:35:40 -05:00
Adhemerval Zanella
8bedcb5f03 powerpc: Optimized strcmp for POWER8/PPC64
This patch adds an optimized POWER8 strcmp using unaligned accesses.
The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
2015-01-13 11:28:58 -05:00
Adhemerval Zanella
f06a4faf8a powerpc: Optimized st{r,p}ncpy for POWER8/PPC64
This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses.
It shows 10%-80% improvement over the optimized POWER7 one that uses
only aligned accesses, specially on unaligned inputs.

The algorithm first read and check 16 bytes (if inputs do not cross a 4K
page size).  The it realign source to 16-bytes and issue a 16 bytes read
and compare loop to speedup null byte checks for large strings.  Also,
different from POWER7 optimization, the null pad is done inline in the
implementation using possible unaligned accesses, instead of realying on
a memset call.  Special case is added for page cross reads.
2015-01-13 11:28:44 -05:00
Adhemerval Zanella
94c9680945 powerpc: Optimized strcat for POWER8/PPC64
With new optimized strcpy for POWER8, this patch adds an optimized
strcat which uses it along with default implementation at strings/.
2015-01-13 11:28:36 -05:00
Adhemerval Zanella
96d6fd6c40 powerpc: Optimized st{r,p}cpy for POWER8/PPC64
This patch adds an optimized POWER8 strcpy using unaligned accesses.
For strings up to 16 bytes the implementation first calculate the
string size, like strlen, and issues a memcpy.  For larger strings,
source is first aligned to 16 bytes and then tested over a loop that
reads 16 bytes am combine the cmpb results for speedup.  Special case is
added for page cross reads.

It shows 30%-60% improvement over the optimized POWER7 one that uses
only aligned accesses.
2015-01-13 11:28:30 -05:00
Adhemerval Zanella
0f0a1c82f5 powerpc: Add powerpc64 strpbrk optimization
This patch makes the POWER7 optimized strpbrk generic by using
default doubleword stores to zero the hash, instead of VSX
instructions.  Performance on POWER7/POWER8 does not change.
2014-12-02 13:34:02 -05:00
Adhemerval Zanella
bb2542e0ae powerpc: Add powerpc64 strcspn optimization
This patch makes the POWER7 optimized strcspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions.  Performance on POWER7/POWER8 does not change.
2014-12-02 07:16:24 -05:00
Adhemerval Zanella
2e8a2de2da powerpc: Add powerpc64 strspn optimization
This patch makes the POWER7 optimized strspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 machines does not changed.
2014-12-02 07:15:58 -05:00
Adhemerval Zanella
71ae86478e PowerPC: memset optimization for POWER8/PPC64
This patch adds an optimized memset implementation for POWER8.  For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.

For size higher than 255 two strategies are used:

1. If the constant is different than 0, the memory is written with
   altivec vector instruction;

2. If constant is 0, dbcz instructions are used.  The loop is unrolled
   to clear 512 byte at time.

Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024.  The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
2014-09-10 07:39:46 -04:00
Adhemerval Zanella
3b473fecdf PowerPC: multiarch bzero cleanup for PPC64
This patch cleanups the multiarch bzero for powerpc64 by remove
the multiarch objects and use instead the the memset embedded
implementation presented in each multiarch optimization.  The
code generate is essentially the same, but the TB_TOCLESS (which
is not essential).
2014-09-10 07:39:46 -04:00
Adhemerval Zanella
17762f6625 PowerPC: optimized memmove for POWER7/PPC64
This patch adds an optimized memmove optimization for POWER7/powerpc64.
Basically the idea is to use the memcpy for POWER7 on non-overlapped
memory regions and a optimized backward memcpy for memory regions
that overlap (similar to the idea of string/memmove.c).

The backward memcpy algorithm used is similar the one use for memcpy for
POWER7, with adjustments done for alignment.  The difference is memory
is always aligned to 16 bytes before using VSX/altivec instructions.
2014-07-07 15:41:21 -05:00
Vidya Ranganathan
bc8ea38590 PowerPC: strcat optimization for PPC64/POWER7
This patch adds an ifunc power7 strcat symbol that uses the logic on
sysdeps/powerpc/strcat.c but call power7 strlen/strcpy symbols instead
of default ones.
2014-07-02 14:04:21 -05:00
Vidya Ranganathan
e23d3d2690 PowerPC: Optimized strcmp for PPC64/POWER7
Optimization is achieved on 8 byte aligned strings with double word
comparison using cmpb instruction. On unaligned strings loop unrolling
is applied for Power7 gain.
2014-06-11 08:39:31 -05:00
Vidya Ranganathan
f360f94a05 PowerPC: strncpy/stpncpy optimization for PPC64/POWER7
The optimization is achieved by following techniques:
  > data alignment [gain from aligned memory access on read/write]
  > POWER7 gains performance with loop unrolling/unwinding
    [gain by reduction of branch penalty].
  > zero padding done by calling optimized memset
2014-05-06 09:54:25 -05:00
Adhemerval Zanella
6f23d0939e PowerPC: optimized strpbrk for POWER7
This patch add an optimized strpbrk for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance on
the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and memory
clear using VSX instructions.
2014-03-20 19:46:13 -05:00
Adhemerval Zanella
6eaf95cbfa PowerPC: optimized strcspn for PPC64/POWER7
This patch add a optimized strcspn for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance
on the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and align
stack memory to table to 16 bytes (so VSX clean can ran without
alignment issues).
2014-03-20 11:24:52 -05:00
Vidya Ranganathan
e65caf1f1d PowerPC: strspn optimization for PPC64/POWER7
The optimization is achieved by following techniques:
  > hashing of needle.
  > hashing avoids scanning of duplicate entries in needle across the string.
  > initializing the hash table with Vector instructions (VSX) by quadword access.
  > unrolling when scanning for character in string across hash table.
2014-03-11 08:54:33 -05:00