glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-12-04 02:40:06 +00:00

Author	SHA1	Message	Date
Adhemerval Zanella	1e9a550ba4	powerpc: Remove power7 strstr optimization The optimization is not faster than the generic algorithm, using the bench-strstr the geometric mean running on a POWER10 machine using gcc 13.1.1 is 482.47 while the default __strstr_ppc is 340.97 (which uses the generic implementation). Also, there is no need to redirect the internal str/mem call to optimized version, internal ifunc is supported and enabled for internal calls (meaning that the generic implementation will use any asm optimization if available). Checked on powerpc64le-linux-gnu. Reviewed-by: Peter Bergner <bergner@linux.ibm.com>	2024-02-23 08:50:00 -03:00
MAHESH BODAPATI	b9182c793c	powerpc : Add optimized memchr for POWER10 Optimized memchr for POWER10 based on existing rawmemchr and strlen. Reordering instructions and loop unrolling helped in getting better performance. Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2023-12-14 14:40:14 -06:00
Amrita H S	3367d8e180	powerpc: Optimized strcmp for power10 This patch is based on __strcmp_power9 and __strlen_power10. Improvements from __strcmp_power9: 1. Uses new POWER10 instructions - This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. 2. Performance implication - This version has around 30% better performance on average. - Performance regression is seen for a specific combination of sizes and alignments. Some of them is observed without changes also, while rest may be induced by the patch. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com> Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>	2023-12-07 11:10:40 -06:00
Adhemerval Zanella	59db5735e6	powerpc: Disable stack protector in early static initialization Similar to `fb95c31638`, also disable for string-ppc64.c (pulled on rltd as the default string implementation). Checked on powerpc64-linux-gnu.	2023-04-03 17:42:08 -03:00
Adhemerval Zanella Netto	92fdb11ae7	powerpc: Remove powerpc64 strncmp variants The default, and power7 implementation just adds word aligned access when inputs have the same aligment. The unaligned case is still done by byte operations. This is already covered by the generic implementation, which also add the unaligned input optimization. Checked on powerpc64-linux-gnu built without multi-arch for powerpc64, power7, power8, and power9 (build for le). Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2023-03-02 16:41:43 -03:00
Rajalakshmi Srinivasaraghavan	2f47198b04	powerpc64: Remove old strncmp optimization This patch cleans up the power4 strncmp optimization for powerpc64 which is unlikely to be used anywhere. Tested on ppc64le with and without --disable-multi-arch flag. Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-12-02 14:26:41 -06:00
Adhemerval Zanella	86a82cd57c	powerpc: Remove bcopy optimizations The symbols is not present in current POSIX specification and compiler already generates memmove call.	2022-02-23 14:06:49 -03:00
Lucas A. M. Magalhaes	a55e2da270	powerpc: Optimized memcmp for power10 This patch was based on the __memcmp_power8 and the recent __strlen_power10. Improvements from __memcmp_power8: 1. Don't need alignment code. On POWER10 lxvp and lxvl do not generate alignment interrupts, so they are safe for use on caching-inhibited memory. Notice that the comparison on the main loop will wait for both VSR to be ready. Therefore aligning one of the input address does not improve performance. In order to align both registers a vperm is necessary which add too much overhead. 2. Uses new POWER10 instructions This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. The vextractbm is used to have a smaller tail code for calculating the return value. 3. Performance improvement This version has around 35% better performance on average. I saw no performance regressions for any length or alignment. Thanks Matheus for helping me out with some details. Co-authored-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>	2021-05-31 18:00:20 -03:00
Matheus Castanho	1a594aa986	powerpc: Add optimized rawmemchr for POWER10 Reuse code for optimized strlen to implement a faster version of rawmemchr. This takes advantage of the same benefits provided by the strlen implementation, but needs some extra steps. __strlen_power10 code should be unchanged after this change. rawmemchr returns a pointer to the char found, while strlen returns only the length, so we have to take that into account when preparing the return value. To quickly check 64B, the loop on __strlen_power10 merges the whole block into 16B by using unsigned minimum vector operations (vminub) and checks if there are any \0 on the resulting vector. The same code is used by rawmemchr if the char c is 0. However, this approach does not work when c != 0. We first need to subtract each byte by c, so that the value we are looking for is converted to a 0, then taking the minimum and checking for nulls works again. The new code branches after it has compared ~256 bytes and chooses which of the two strategies above will be used in the main loop, based on the char c. This extra branch adds some overhead (~5%) for length ~256, but is quickly amortized by the faster loop for larger sizes. Compared to __rawmemchr_power9, this version is ~20% faster for length < 256. Because of the optimized main loop, the improvement becomes ~35% for c != 0 and ~50% for c = 0 for strings longer than 256. Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>	2021-05-17 10:30:35 -03:00
Raoni Fassina Firmino	23fdf8178c	powerpc64le: Optimize memset for POWER10 This implementation is based on __memset_power8 and integrates a lot of suggestions from Anton Blanchard. The biggest difference is that it makes extensive use of stxvl to alignment and tail code to avoid branches and small stores. It has three main execution paths: a) "Short lengths" for lengths up to 64 bytes, avoiding as many branches as possible. b) "General case" for larger lengths, it has an alignment section using stxvl to avoid branches, a 128 bytes loop and then a tail code, again using stxvl with few branches. c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set value being zero. It is mostly the __memset_power8 code but the alignment phase was simplified because, at this point, address is already 16-bytes aligned and also changed to use vector stores. The tail code was also simplified to reuse the general case tail. All unaligned stores use stxvl instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. On average, this implementation provides something around 30% improvement when compared to __memset_power8. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>	2021-04-30 18:12:08 -03:00
Tulio Magno Quites Machado Filho	e941e0ae80	powerpc64le: Optimize memcpy for POWER10 This implementation is based on __memcpy_power8_cached and integrates suggestions from Anton Blanchard. It benefits from loads and stores with length for short lengths and for tail code, simplifying the code. All unaligned memory accesses use instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. The main loop has also been modified in order to increase instruction throughput by reducing the dependency on updates from previous iterations. On average, this implementation provides around 30% improvement when compared to __memcpy_power7 and 10% improvement in comparison to __memcpy_power8_cached.	2021-04-30 18:12:08 -03:00
Lucas A. M. Magalhaes	dd59655e93	powerpc64le: Optimized memmove for POWER10 This patch was initially based on the __memmove_power7 with some ideas from strncpy implementation for Power 9. Improvements from __memmove_power7: 1. Use lxvl/stxvl for alignment code. The code for Power 7 uses branches when the input is not naturally aligned to the width of a vector. The new implementation uses lxvl/stxvl instead which reduces pressure on GPRs. It also allows the removal of branch instructions, implicitly removing branch stalls and mispredictions. 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited memory. On Power 10 vector load and stores are safe to use on CI memory for addresses unaligned to 16B. This code takes advantage of this to do unaligned loads. The unaligned loads don't have a significant performance impact by themselves. However doing so decreases register pressure on GPRs and interdependence stalls on load/store pairs. This also improved readability as there are now less code paths for different alignments. Finally this reduces the overall code size. 3. Improved performance. This version runs on average about 30% better than memmove_power7 for lengths larger than 8KB. For input lengths shorter than 8KB the improvement is smaller, it has on average about 17% better performance. This version has a degradation of about 50% for input lengths in the 0 to 31 bytes range when dest is unaligned. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>	2021-04-30 18:12:08 -03:00
Matheus Castanho	10624a97e8	powerpc: Add optimized strlen for POWER10 Improvements compared to POWER9 version: 1. Take into account first 16B comparison for aligned strings The previous version compares the first 16B and increments r4 by the number of bytes until the address is 16B-aligned, then starts doing aligned loads at that address. For aligned strings, this causes the first 16B to be compared twice, because the increment is 0. Here we calculate the next 16B-aligned address differently, which avoids that issue. 2. Use simple comparisons for the first ~192 bytes The main loop is good for big strings, but comparing 16B each time is better for smaller strings. So after aligning the address to 16 Bytes, we check more 176B in 16B chunks. There may be some overlaps with the main loop for unaligned strings, but we avoid using the more aggressive strategy too soon, and also allow the loop to start at a 64B-aligned address. This greatly benefits smaller strings and avoids overlapping checks if the string is already aligned at a 64B boundary. 3. Reduce dependencies between load blocks caused by address calculation on loop Doing a precise time tracing on the code showed many loads in the loop were stalled waiting for updates to r4 from previous code blocks. This implementation avoids that as much as possible by using 2 registers (r4 and r5) to hold addresses to be used by different parts of the code. Also, the previous code aligned the address to 16B, then to 64B by doing a few 48B loops (if needed) until the address was aligned. The main loop could not start until that 48B loop had finished and r4 was updated with the current address. Here we calculate the address used by the loop very early, so it can start sooner. The main loop now uses 2 pointers 128B apart to make pointer updates less frequent, and also unrolls 1 iteration to guarantee there is enough time between iterations to update the pointers, reducing stalled cycles. 4. Use new P10 instructions lxvp is used to load 32B with a single instruction, reducing contention in the load queue. vextractbm allows simplifying the tail code for the loop, replacing vbpermq and avoiding having to generate a permute control vector. Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>	2021-04-22 16:18:06 -03:00
Raphael M Zinsly	7beee7b39a	powerpc: Add optimized stpncpy for POWER9 Add stpncpy support into the POWER9 strncpy. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>	2020-11-12 13:16:36 -03:00
Raphael M Zinsly	b9d83bf3eb	powerpc: Add optimized strncpy for POWER9 Similar to the strcpy P9 optimization, this version uses VSX to improve performance. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>	2020-11-12 13:12:24 -03:00
Paul E. Murphy	a23bd00f9d	powerpc64le: add optimized strlen for P9 This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds.	2020-06-05 15:30:00 -05:00
Anton Blanchard	765de945ef	powerpc: Optimized rawmemchr for POWER9 This version uses vector instructions and is up to 60% faster on medium matches and up to 90% faster on long matches, compared to the POWER7 version. A few examples: __rawmemchr_power9 __rawmemchr_power7 Length 32, alignment 0: 2.27566 3.77765 Length 64, alignment 2: 2.46231 3.51064 Length 1024, alignment 0: 17.3059 32.6678	2020-05-18 17:08:54 -05:00
Anton Blanchard via Libc-alpha	aa70d05632	powerpc: Optimized stpcpy for POWER9 Add stpcpy support to the POWER9 strcpy. This is up to 40% faster on small strings and up to 90% faster on long relatively unaligned strings, compared to the POWER8 version. A few examples: __stpcpy_power9 __stpcpy_power8 Length 20, alignments in bytes 4/ 4: 2.58246 4.8788 Length 1024, alignments in bytes 1/ 6: 24.8186 47.8528	2020-05-18 08:26:22 -05:00
Anton Blanchard via Libc-alpha	3903704850	powerpc: Optimized strcpy for POWER9 This version uses VSX store vector with length instructions and is significantly faster on small strings and relatively unaligned large strings, compared to the POWER8 version. A few examples: __strcpy_power9 __strcpy_power8 Length 16, alignments in bytes 0/ 0: 2.52454 4.62695 Length 412, alignments in bytes 4/ 0: 11.6 22.9185	2020-05-18 08:26:22 -05:00
Adhemerval Zanella	f82ed45d7f	powerpc: Use generic wcsrchr optimization This patch removes the power6 wcsrchr optimization and uses generic implementation instead. Currently, both power6 and power7 IFUNC variant resulting binary are essentially the same and the generic implementation with unrolling loop set to 8 also results in similar performance. Checked on powerpc64-linux-gnu. * sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcsrchr.c): New rule. * sysdeps/powerpc/power6/wcsrchr.c: Remove file. * sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-power6.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-power7.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-ppc32.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcsrchr-power6.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcsrchr-power7.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcsrchr-ppc64.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcsrchr.c: Likewise. * sysdeps/powerpc/powerpc64/power6/wcsrchr.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/Makefile [$(subdir) == wcsmbs] (sysdeps_routines): Remove wcsrchr-power6 and wcsrchr-power7. (CFLAGS-wcsrchr-power7.c, CFLAGS-wcsrchr-power6.c): Remove rule. * sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c: Remove wcsrchr optimizations. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.	2019-04-04 16:01:14 +07:00
Adhemerval Zanella	421e3005ca	powerpc: Use generic wcschr optimization This patch removes the power6 wcschr optimization and uses generic implementation instead. Currently, both power6 and power7 IFUNC variant resulting binary are essentially the same and the generic implementation with unrolling loop set to 8 also results in similar performance. Checked on powerpc64-linux-gnu. * sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcschr.c): New rule. * sysdeps/powerpc/power6/wcschr.c: Remove file. * sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-power6.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-power7.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcschr-power6.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcschr-power7.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcschr-ppc64.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcschr.c: Likewise. * sysdeps/powerpc/powerpc64/power6/wcschr.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/Makefile [$(subdir) == wcsmbs] (sysdeps_routines): Remove wcschr-power6 and wcschr-power7. (CFLAGS-wcschr-power7.c, CFLAGS-wcschr-power6.c): Remove rule. * sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c: Remove wcschr optimizations. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.	2019-04-04 16:01:14 +07:00
Adhemerval Zanella	447a1306c3	powerpc: Use generic wcscpy optimization This patch removes the power6 wcscpy optimization and uses generic implementation instead. Currently, both power6 and power7 IFUNC variant resulting binary are essentially the same and the generic implementation with unrolling loop set to 8 also results in similar performance. Checked on powerpc64-linux-gnu. * sysdeps/powerpc/Makefile [$(subdir) == wcsmbs] (CFLAGS-wcscpy.c): New rule. * sysdeps/powerpc/power6/wcscpy.c: Remove file. * sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-power6.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-power7.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcscpy-power6.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcscpy-power7.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcscpy-ppc64.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/wcscpy.c: Likewise. * sysdeps/powerpc/powerpc64/power6/wcscpy.c: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/Makefile [$(subdir) == wcsmbs] (sysdeps_routines): Remove wcscpy-power6 and wcscpy-power7. (CFLAGS-wcscpy-power7.c, CFLAGS-wcscpy-power6.c): Remove rule. * sysdeps/powerpc/powerpc64/multiarch/Makefile: Likewise. * sysdeps/powerpc/powerpc32/power4/multiarch/ifunc-impl-list.c: Remove wcscpy optimizations. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.	2019-04-04 16:01:14 +07:00
Rajalakshmi Srinivasaraghavan	7793ad7a2c	powerpc: Rearrange little endian specific files This patch moves little endian specific POWER9 optimization files to sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions only for little endian.	2018-08-16 12:12:02 +05:30
Adhemerval Zanella	c9cd7b0ce5	powerpc: POWER8 memcpy optimization for cached memory On POWER8, unaligned memory accesses to cached memory has little impact on performance as opposed to its ancestors. It is disabled by default and will only be available when the tunable glibc.tune.cached_memopt is set to 1. __memcpy_power8_cached __memcpy_power7 ============================================================ max-size=4096: 33325.70 ( 12.65%) 38153.00 max-size=8192: 32878.20 ( 11.17%) 37012.30 max-size=16384: 33782.20 ( 11.61%) 38219.20 max-size=32768: 33296.20 ( 11.30%) 37538.30 max-size=65536: 33765.60 ( 10.53%) 37738.40 * manual/tunables.texi (Hardware Capability Tunables): Document glibc.tune.cached_memopt. * sysdeps/powerpc/cpu-features.c: New file. * sysdeps/powerpc/cpu-features.h: New file. * sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add _dl_powerpc_cpu_features. * sysdeps/powerpc/dl-tunables.list: New file. * sysdeps/powerpc/ldsodefs.h: Include cpu-features.h. * sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h (INIT_ARCH): Initialize use_aligned_memopt. * sysdeps/powerpc/powerpc64/dl-machine.h [defined(SHARED && IS_IN(rtld))]: Restrict dl_platform_init availability and initialize CPU features used by tunables. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add memcpy-power8-cached. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add __memcpy_power8_cached. * sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S: New file. Reviewed-by: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>	2017-12-11 17:39:42 -02:00
Rajalakshmi Srinivasaraghavan	59ba2d2b54	powerpc: Optimize memrchr for power8 Vectorized loops are used for sizes greater than 32B to improve performance over power7 optimization. This shows as an average of 25% improvement depending on the position of search character. The performance is same for shorter strings.	2017-10-02 17:31:13 +05:30
Rajalakshmi Srinivasaraghavan	43e0ac24c8	powerpc: Optimize memchr for power8 Vectorized loops are used for sizes greater than 32B to improve performance over power7 optimiztion.	2017-06-21 10:55:12 +05:30
Rajalakshmi Srinivasaraghavan	dec4a7105e	powerpc: Improve memcmp performance for POWER8 Vectorization improves performance over the current implementation. Tested on powerpc64 and powerpc64le.	2017-05-18 11:21:20 +05:30
Rajalakshmi Srinivasaraghavan	6c6ab1fc49	powerpc64: strrchr optimization for power8 P7 code is used for <=32B strings and for > 32B vectorized loops are used. This shows as an average 25% improvement depending on the position of search character. The performance is same for shorter strings. Tested on ppc64 and ppc64le.	2017-04-18 11:28:56 +05:30
Rajalakshmi Srinivasaraghavan	249dcdb71b	powerpc: Optimized strncat for POWER8 With new optimized strnlen for POWER8 [1], this patch adds strncat for power8 to make use of optimized strlen and strnlen. This is faster than POWER7 current implementation for larger strings. Tested on powerpc64 and powerpc64le. [1] https://sourceware.org/ml/libc-alpha/2017-03/msg00491.html * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add strncat-power8. * sysdeps/powerpc/powerpc64/multiarch/strncat.c (strncat): Add __strncat_power8 to ifunc list. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c (strncat): Add __strncat_power8 to list of strncat functions. * sysdeps/powerpc/powerpc64/multiarch/strncat-power8.c: New file.	2017-04-13 11:29:20 +05:30
Wainer dos Santos Moschetta	18b6e2c86c	powerpc64: Add POWER8 strnlen Added strnlen POWER8 otimized for long strings. It delivers same performance as POWER7 implementation for short strings. This takes advantage of reasonably performing unaligned loads and bit permutes to check the first 1-16 bytes until quadword aligned, then checks in 64 bytes strides until unsafe, then 16 bytes, truncating the count if need be. Likewise, the POWER7 code is recycled for less than 32 bytes strings. Tested on ppc64 and ppc64le. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add strnlen-power8. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c (strnlen): Add __strnlen_power8 to list of strnlen functions. * sysdeps/powerpc/powerpc64/multiarch/strnlen-power8.S: New file. * sysdeps/powerpc/powerpc64/multiarch/strnlen.c (__strnlen): Add __strnlen_power8 to ifunc list. * sysdeps/powerpc/powerpc64/power8/strnlen.S: New file.	2017-04-05 10:26:58 -03:00
Rajalakshmi Srinivasaraghavan	9314d3545e	powerpc64: strchr/strchrnul optimization for power8 The P7 code is used for <=32B strings and for > 32B vectorized loops are used. This shows as an average 25% improvement depending on the position of search character. The performance is same for shorter strings. Tested on ppc64 and ppc64le.	2016-12-28 11:44:31 -02:00
Rajalakshmi Srinivasaraghavan	d89060d603	powerpc: strncmp optimization for power9 Vectorized loops are used for strings > 32B when compared to power8 optimization. Tested on power9 ppc64le simulator.	2016-12-13 10:53:42 +05:30
Rajalakshmi Srinivasaraghavan	80ab6401a9	powerpc: strcmp optimization for power9 Vectorized loops are used for strings > 32B when compared to power8 optimization. Tested on power9 ppc64le simulator.	2016-12-01 11:35:43 +05:30
raji	c8376f3e07	powerpc: strcasecmp/strncasecmp optmization for power8 This implementation utilizes vectors to improve performance compared to current byte by byte implementation for POWER7. The performance improvement is upto 4x. This patch is tested on powerpc64 and powerpc64le.	2016-06-14 14:51:16 +05:30
Paul E. Murphy	8f1b841e45	powerpc: Add optimized strcspn for P8 A few minor adjustments to the P8 strspn gives us an almost equally optimized P8 strcspn.	2016-04-25 09:11:02 -05:00
Rajalakshmi Srinivasaraghavan	e413b14e18	powerpc: strcasestr optmization for power8 This patch optimizes strcasestr function for power >= 8 systems. The average improvement of this optimization is ~40% and compares 16 bytes at a time using vector instructions. This patch is tested on powerpc64 and powerpc64le.	2016-04-22 19:23:13 +05:30
Carlos Eduardo Seo	1b045ee53e	powerpc: Optimization for strlen for POWER8. This implementation takes advantage of vectorization to improve performance of the loop over the current strlen implementation for POWER7.	2016-04-15 17:19:19 -03:00
Paul E. Murphy	25dba0ad05	powerpc: Add optimized P8 strspn This utilizes vectors and bitmasks. For small needle, large haystack, the performance improvement is upto 8x. For short strings (0-4B), the cost of computing the bitmask dominates, and is a tad slower.	2016-04-07 15:51:28 -05:00
Rajalakshmi Srinivasaraghavan	b42f8cad52	powerpc: strstr optimization This patch optimizes strstr function for power >= 7 systems. Performance gain is obtained using aligned memory access and usage of cmpb instruction for quicker comparison. The average improvement of this optimization is ~40%. Tested on ppc64 and ppc64le. 2015-07-16 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com> * sysdeps/powerpc/powerpc64/multiarch/Makefile: Add strstr(). * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise. * sysdeps/powerpc/powerpc64/power7/strstr.S: New File. * sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S: New File. * sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c: New File. * sysdeps/powerpc/powerpc64/multiarch/strstr.c: New File.	2015-07-16 13:43:51 -03:00
Adhemerval Zanella	b269211467	powerpc: wordcopy/memmove cleanup for ppc64 This patch cleanup some multiarch code related to memmmove optimization. Initial IFUNC support added specialized wordcopy symbols which turned in local IFUNC calls used by memmove default implementation. This change by removing then and used the optimized memmove instead for supported chips.	2015-02-09 06:42:28 -05:00
Adhemerval Zanella	18e270aada	powerpc: Remove POWER7 wordcopy ifunc This patch remove the POWER7 ifunc wordcopy function (_wordcopy_*_power7), since now GLIBC provides a optimized memmove/bcopy for POWER7.	2015-02-09 06:42:28 -05:00
Adhemerval Zanella	3001e54c57	powerpc: multiarch Makefile cleanup for powerpc64 This patch cleanups the multiarch Makefile by putting the wide chars implementation to correct wcsmbs rule.	2015-02-09 06:42:27 -05:00
Adhemerval Zanella	d3b00f468b	powerpc: Optimized strncmp for POWER8/PPC64 This patch adds an optimized POWER8 strncmp. The implementation focus on speeding up unaligned cases follwing the ideas of power8 strcmp. The algorithm first check the initial 16 bytes, then align the first function source and uses unaligned loads on second argument only. Aditional checks for page boundaries are done for unaligned cases (where sources alignment are different).	2015-01-13 14:35:40 -05:00
Adhemerval Zanella	8bedcb5f03	powerpc: Optimized strcmp for POWER8/PPC64 This patch adds an optimized POWER8 strcmp using unaligned accesses. The algorithm first check the initial 16 bytes, then align the first function source and uses unaligned loads on second argument only. Aditional checks for page boundaries are done for unaligned cases	2015-01-13 11:28:58 -05:00
Adhemerval Zanella	f06a4faf8a	powerpc: Optimized st{r,p}ncpy for POWER8/PPC64 This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses. It shows 10%-80% improvement over the optimized POWER7 one that uses only aligned accesses, specially on unaligned inputs. The algorithm first read and check 16 bytes (if inputs do not cross a 4K page size). The it realign source to 16-bytes and issue a 16 bytes read and compare loop to speedup null byte checks for large strings. Also, different from POWER7 optimization, the null pad is done inline in the implementation using possible unaligned accesses, instead of realying on a memset call. Special case is added for page cross reads.	2015-01-13 11:28:44 -05:00
Adhemerval Zanella	94c9680945	powerpc: Optimized strcat for POWER8/PPC64 With new optimized strcpy for POWER8, this patch adds an optimized strcat which uses it along with default implementation at strings/.	2015-01-13 11:28:36 -05:00
Adhemerval Zanella	96d6fd6c40	powerpc: Optimized st{r,p}cpy for POWER8/PPC64 This patch adds an optimized POWER8 strcpy using unaligned accesses. For strings up to 16 bytes the implementation first calculate the string size, like strlen, and issues a memcpy. For larger strings, source is first aligned to 16 bytes and then tested over a loop that reads 16 bytes am combine the cmpb results for speedup. Special case is added for page cross reads. It shows 30%-60% improvement over the optimized POWER7 one that uses only aligned accesses.	2015-01-13 11:28:30 -05:00
Adhemerval Zanella	0f0a1c82f5	powerpc: Add powerpc64 strpbrk optimization This patch makes the POWER7 optimized strpbrk generic by using default doubleword stores to zero the hash, instead of VSX instructions. Performance on POWER7/POWER8 does not change.	2014-12-02 13:34:02 -05:00
Adhemerval Zanella	bb2542e0ae	powerpc: Add powerpc64 strcspn optimization This patch makes the POWER7 optimized strcspn generic by using default doubleword stores to zero the hash, instead of VSX instructions. Performance on POWER7/POWER8 does not change.	2014-12-02 07:16:24 -05:00
Adhemerval Zanella	2e8a2de2da	powerpc: Add powerpc64 strspn optimization This patch makes the POWER7 optimized strspn generic by using default doubleword stores to zero the hash, instead of VSX instructions. Performance on POWER7/POWER8 machines does not changed.	2014-12-02 07:15:58 -05:00

1 2

81 Commits