glibc/sysdeps/powerpc/powerpc64/multiarch/Makefile

ifeq ($(subdir),string)
sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \
		   memcpy-cell memcpy-power4 memcpy-ppc64 \
		   memcmp-power8 memcmp-power7 memcmp-power4 memcmp-ppc64 \
		   memset-power7 memset-power6 memset-power4 \
		   memset-ppc64 memset-power8 \
		   mempcpy-power7 mempcpy-ppc64 \
		   memchr-power8 memchr-power7 memchr-ppc64 \
		   memrchr-power8 memrchr-power7 memrchr-ppc64 \
		   rawmemchr-power7 rawmemchr-ppc64 \
		   strlen-power7 strlen-ppc64 \
		   strnlen-power8 strnlen-power7 strnlen-ppc64 \
		   strcasecmp-power7 strcasecmp_l-power7 \
		   strncase-power7 strncase_l-power7 \
		   strncmp-power8 strncmp-power7 \
		   strncmp-power4 strncmp-ppc64 \
		   strchr-power8 strchr-power7 strchr-ppc64 \
		   strchrnul-power8 strchrnul-power7 strchrnul-ppc64 \
		   strcpy-power8 strcpy-power7 strcpy-ppc64 stpcpy-power8 \
		   stpcpy-power7 stpcpy-ppc64 \
		   strrchr-power8 strrchr-power7 strrchr-ppc64 \
		   strncat-power8 strncat-power7 strncat-ppc64 \
		   strncpy-power7 strncpy-ppc64 \
		   stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
		   strcmp-power8 strcmp-power7 strcmp-ppc64 \
		   strcat-power8 strcat-power7 strcat-ppc64 \
		   memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
		   strncpy-power8 strstr-power7 strstr-ppc64 \
		   strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
		   strlen-power8 strcasestr-power8 strcasestr-ppc64 \
		   strcasecmp-ppc64 strcasecmp-power8 strncase-ppc64 \
		   strncase-power8

ifneq (,$(filter %le,$(config-machine)))
sysdep_routines += memcpy-power10 memmove-power10 memset-power10 \
		   strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \
		   rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9 \
		   strlen-power10
endif
CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
endif
PowerPC: multiarch memcpy for PowerPC64 2013-12-13 19:31:41 +00:00			`ifeq ($(subdir),string)`
powerpc: POWER8 memcpy optimization for cached memory On POWER8, unaligned memory accesses to cached memory has little impact on performance as opposed to its ancestors. It is disabled by default and will only be available when the tunable glibc.tune.cached_memopt is set to 1. __memcpy_power8_cached __memcpy_power7 ============================================================ max-size=4096: 33325.70 ( 12.65%) 38153.00 max-size=8192: 32878.20 ( 11.17%) 37012.30 max-size=16384: 33782.20 ( 11.61%) 38219.20 max-size=32768: 33296.20 ( 11.30%) 37538.30 max-size=65536: 33765.60 ( 10.53%) 37738.40 * manual/tunables.texi (Hardware Capability Tunables): Document glibc.tune.cached_memopt. * sysdeps/powerpc/cpu-features.c: New file. * sysdeps/powerpc/cpu-features.h: New file. * sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add _dl_powerpc_cpu_features. * sysdeps/powerpc/dl-tunables.list: New file. * sysdeps/powerpc/ldsodefs.h: Include cpu-features.h. * sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h (INIT_ARCH): Initialize use_aligned_memopt. * sysdeps/powerpc/powerpc64/dl-machine.h [defined(SHARED && IS_IN(rtld))]: Restrict dl_platform_init availability and initialize CPU features used by tunables. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add memcpy-power8-cached. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add __memcpy_power8_cached. * sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S: New file. Reviewed-by: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com> 2017-12-11 19:39:42 +00:00			`sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \`
			`memcpy-cell memcpy-power4 memcpy-ppc64 \`
powerpc: Improve memcmp performance for POWER8 Vectorization improves performance over the current implementation. Tested on powerpc64 and powerpc64le. 2017-05-18 05:51:20 +00:00			`memcmp-power8 memcmp-power7 memcmp-power4 memcmp-ppc64 \`
			`memset-power7 memset-power6 memset-power4 \`
PowerPC: memset optimization for POWER8/PPC64 This patch adds an optimized memset implementation for POWER8. For sizes from 0 to 255 bytes, a word/doubleword algorithm similar to POWER7 optimized one is used. For size higher than 255 two strategies are used: 1. If the constant is different than 0, the memory is written with altivec vector instruction; 2. If constant is 0, dbcz instructions are used. The loop is unrolled to clear 512 byte at time. Using vector instructions increases throughput considerable, with a double performance for sizes larger than 1024. The dcbz loops unrolls also shows performance improvement, by doubling throughput for sizes larger than 8192 bytes. 2014-07-15 16:19:09 +00:00			`memset-ppc64 memset-power8 \`
powerpc: Optimize memchr for power8 Vectorized loops are used for sizes greater than 32B to improve performance over power7 optimiztion. 2017-06-21 05:25:12 +00:00			`mempcpy-power7 mempcpy-ppc64 \`
			`memchr-power8 memchr-power7 memchr-ppc64 \`
powerpc: Optimize memrchr for power8 Vectorized loops are used for sizes greater than 32B to improve performance over power7 optimization. This shows as an average of 25% improvement depending on the position of search character. The performance is same for shorter strings. 2017-10-02 12:01:13 +00:00			`memrchr-power8 memrchr-power7 memrchr-ppc64 \`
			`rawmemchr-power7 rawmemchr-ppc64 \`
			`strlen-power7 strlen-ppc64 \`
powerpc64: Add POWER8 strnlen Added strnlen POWER8 otimized for long strings. It delivers same performance as POWER7 implementation for short strings. This takes advantage of reasonably performing unaligned loads and bit permutes to check the first 1-16 bytes until quadword aligned, then checks in 64 bytes strides until unsafe, then 16 bytes, truncating the count if need be. Likewise, the POWER7 code is recycled for less than 32 bytes strings. Tested on ppc64 and ppc64le. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add strnlen-power8. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c (strnlen): Add __strnlen_power8 to list of strnlen functions. * sysdeps/powerpc/powerpc64/multiarch/strnlen-power8.S: New file. * sysdeps/powerpc/powerpc64/multiarch/strnlen.c (__strnlen): Add __strnlen_power8 to ifunc list. * sysdeps/powerpc/powerpc64/power8/strnlen.S: New file. 2017-04-05 13:24:24 +00:00			`strnlen-power8 strnlen-power7 strnlen-ppc64 \`
			`strcasecmp-power7 strcasecmp_l-power7 \`
powerpc: Optimized strncmp for POWER8/PPC64 This patch adds an optimized POWER8 strncmp. The implementation focus on speeding up unaligned cases follwing the ideas of power8 strcmp. The algorithm first check the initial 16 bytes, then align the first function source and uses unaligned loads on second argument only. Aditional checks for page boundaries are done for unaligned cases (where sources alignment are different). 2015-01-09 21:04:26 +00:00			`strncase-power7 strncase_l-power7 \`
powerpc: Rearrange little endian specific files This patch moves little endian specific POWER9 optimization files to sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions only for little endian. 2018-08-16 06:42:02 +00:00			`strncmp-power8 strncmp-power7 \`
powerpc: strncmp optimization for power9 Vectorized loops are used for strings > 32B when compared to power8 optimization. Tested on power9 ppc64le simulator. 2016-12-13 05:23:42 +00:00			`strncmp-power4 strncmp-ppc64 \`
powerpc64: strchr/strchrnul optimization for power8 The P7 code is used for <=32B strings and for > 32B vectorized loops are used. This shows as an average 25% improvement depending on the position of search character. The performance is same for shorter strings. Tested on ppc64 and ppc64le. 2016-12-27 19:48:37 +00:00			`strchr-power8 strchr-power7 strchr-ppc64 \`
			`strchrnul-power8 strchrnul-power7 strchrnul-ppc64 \`
powerpc: Optimized st{r,p}cpy for POWER8/PPC64 This patch adds an optimized POWER8 strcpy using unaligned accesses. For strings up to 16 bytes the implementation first calculate the string size, like strlen, and issues a memcpy. For larger strings, source is first aligned to 16 bytes and then tested over a loop that reads 16 bytes am combine the cmpb results for speedup. Special case is added for page cross reads. It shows 30%-60% improvement over the optimized POWER7 one that uses only aligned accesses. 2014-12-23 11:59:44 +00:00			`strcpy-power8 strcpy-power7 strcpy-ppc64 stpcpy-power8 \`
			`stpcpy-power7 stpcpy-ppc64 \`
powerpc64: strrchr optimization for power8 P7 code is used for <=32B strings and for > 32B vectorized loops are used. This shows as an average 25% improvement depending on the position of search character. The performance is same for shorter strings. Tested on ppc64 and ppc64le. 2017-04-18 05:58:56 +00:00			`strrchr-power8 strrchr-power7 strrchr-ppc64 \`
powerpc: Optimized strncat for POWER8 With new optimized strnlen for POWER8 [1], this patch adds strncat for power8 to make use of optimized strlen and strnlen. This is faster than POWER7 current implementation for larger strings. Tested on powerpc64 and powerpc64le. [1] https://sourceware.org/ml/libc-alpha/2017-03/msg00491.html * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add strncat-power8. * sysdeps/powerpc/powerpc64/multiarch/strncat.c (strncat): Add __strncat_power8 to ifunc list. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c (strncat): Add __strncat_power8 to list of strncat functions. * sysdeps/powerpc/powerpc64/multiarch/strncat-power8.c: New file. 2017-04-13 05:59:20 +00:00			`strncat-power8 strncat-power7 strncat-ppc64 \`
powerpc: Add powerpc64 strpbrk optimization This patch makes the POWER7 optimized strpbrk generic by using default doubleword stores to zero the hash, instead of VSX instructions. Performance on POWER7/POWER8 does not change. 2014-11-19 21:27:56 +00:00			`strncpy-power7 strncpy-ppc64 \`
powerpc: Optimized st{r,p}ncpy for POWER8/PPC64 This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses. It shows 10%-80% improvement over the optimized POWER7 one that uses only aligned accesses, specially on unaligned inputs. The algorithm first read and check 16 bytes (if inputs do not cross a 4K page size). The it realign source to 16-bytes and issue a 16 bytes read and compare loop to speedup null byte checks for large strings. Also, different from POWER7 optimization, the null pad is done inline in the implementation using possible unaligned accesses, instead of realying on a memset call. Special case is added for page cross reads. 2014-12-31 16:47:41 +00:00			`stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \`
powerpc: Rearrange little endian specific files This patch moves little endian specific POWER9 optimization files to sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions only for little endian. 2018-08-16 06:42:02 +00:00			`strcmp-power8 strcmp-power7 strcmp-ppc64 \`
powerpc: wordcopy/memmove cleanup for ppc64 This patch cleanup some multiarch code related to memmmove optimization. Initial IFUNC support added specialized wordcopy symbols which turned in local IFUNC calls used by memmove default implementation. This change by removing then and used the optimized memmove instead for supported chips. 2015-01-21 12:41:46 +00:00			`strcat-power8 strcat-power7 strcat-ppc64 \`
			`memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \`
powerpc: Add optimized P8 strspn This utilizes vectors and bitmasks. For small needle, large haystack, the performance improvement is upto 8x. For short strings (0-4B), the cost of computing the bitmask dominates, and is a tad slower. 2016-03-14 21:40:46 +00:00			`strncpy-power8 strstr-power7 strstr-ppc64 \`
powerpc: Add optimized strcspn for P8 A few minor adjustments to the P8 strspn gives us an almost equally optimized P8 strcspn. 2016-04-25 14:11:02 +00:00			`strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \`
powerpc: strcasecmp/strncasecmp optmization for power8 This implementation utilizes vectors to improve performance compared to current byte by byte implementation for POWER7. The performance improvement is upto 4x. This patch is tested on powerpc64 and powerpc64le. 2016-06-14 09:21:16 +00:00			`strlen-power8 strcasestr-power8 strcasestr-ppc64 \`
			`strcasecmp-ppc64 strcasecmp-power8 strncase-ppc64 \`
			`strncase-power8`
PowerPC: multiarch strncasecmp for PowerPC64 2013-12-13 19:40:28 +00:00
powerpc: Rearrange little endian specific files This patch moves little endian specific POWER9 optimization files to sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions only for little endian. 2018-08-16 06:42:02 +00:00			`ifneq (,$(filter %le,$(config-machine)))`
powerpc64le: Optimize memset for POWER10 This implementation is based on __memset_power8 and integrates a lot of suggestions from Anton Blanchard. The biggest difference is that it makes extensive use of stxvl to alignment and tail code to avoid branches and small stores. It has three main execution paths: a) "Short lengths" for lengths up to 64 bytes, avoiding as many branches as possible. b) "General case" for larger lengths, it has an alignment section using stxvl to avoid branches, a 128 bytes loop and then a tail code, again using stxvl with few branches. c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set value being zero. It is mostly the __memset_power8 code but the alignment phase was simplified because, at this point, address is already 16-bytes aligned and also changed to use vector stores. The tail code was also simplified to reuse the general case tail. All unaligned stores use stxvl instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. On average, this implementation provides something around 30% improvement when compared to __memset_power8. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> 2021-04-30 21:12:08 +00:00			`sysdep_routines += memcpy-power10 memmove-power10 memset-power10 \`
powerpc64le: Optimized memmove for POWER10 This patch was initially based on the __memmove_power7 with some ideas from strncpy implementation for Power 9. Improvements from __memmove_power7: 1. Use lxvl/stxvl for alignment code. The code for Power 7 uses branches when the input is not naturally aligned to the width of a vector. The new implementation uses lxvl/stxvl instead which reduces pressure on GPRs. It also allows the removal of branch instructions, implicitly removing branch stalls and mispredictions. 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited memory. On Power 10 vector load and stores are safe to use on CI memory for addresses unaligned to 16B. This code takes advantage of this to do unaligned loads. The unaligned loads don't have a significant performance impact by themselves. However doing so decreases register pressure on GPRs and interdependence stalls on load/store pairs. This also improved readability as there are now less code paths for different alignments. Finally this reduces the overall code size. 3. Improved performance. This version runs on average about 30% better than memmove_power7 for lengths larger than 8KB. For input lengths shorter than 8KB the improvement is smaller, it has on average about 17% better performance. This version has a degradation of about 50% for input lengths in the 0 to 31 bytes range when dest is unaligned. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> 2021-04-30 21:12:08 +00:00			`strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \`
powerpc: Add optimized strlen for POWER10 Improvements compared to POWER9 version: 1. Take into account first 16B comparison for aligned strings The previous version compares the first 16B and increments r4 by the number of bytes until the address is 16B-aligned, then starts doing aligned loads at that address. For aligned strings, this causes the first 16B to be compared twice, because the increment is 0. Here we calculate the next 16B-aligned address differently, which avoids that issue. 2. Use simple comparisons for the first ~192 bytes The main loop is good for big strings, but comparing 16B each time is better for smaller strings. So after aligning the address to 16 Bytes, we check more 176B in 16B chunks. There may be some overlaps with the main loop for unaligned strings, but we avoid using the more aggressive strategy too soon, and also allow the loop to start at a 64B-aligned address. This greatly benefits smaller strings and avoids overlapping checks if the string is already aligned at a 64B boundary. 3. Reduce dependencies between load blocks caused by address calculation on loop Doing a precise time tracing on the code showed many loads in the loop were stalled waiting for updates to r4 from previous code blocks. This implementation avoids that as much as possible by using 2 registers (r4 and r5) to hold addresses to be used by different parts of the code. Also, the previous code aligned the address to 16B, then to 64B by doing a few 48B loops (if needed) until the address was aligned. The main loop could not start until that 48B loop had finished and r4 was updated with the current address. Here we calculate the address used by the loop very early, so it can start sooner. The main loop now uses 2 pointers 128B apart to make pointer updates less frequent, and also unrolls 1 iteration to guarantee there is enough time between iterations to update the pointers, reducing stalled cycles. 4. Use new P10 instructions lxvp is used to load 32B with a single instruction, reducing contention in the load queue. vextractbm allows simplifying the tail code for the loop, replacing vbpermq and avoiding having to generate a permute control vector. Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com> 2020-09-29 18:40:08 +00:00			`rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9 \`
			`strlen-power10`
powerpc: Rearrange little endian specific files This patch moves little endian specific POWER9 optimization files to sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions only for little endian. 2018-08-16 06:42:02 +00:00			`endif`
PowerPC: multiarch strncasecmp for PowerPC64 2013-12-13 19:40:28 +00:00			`CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops`
			`CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops`
powerpc: multiarch Makefile cleanup for powerpc64 This patch cleanups the multiarch Makefile by putting the wide chars implementation to correct wcsmbs rule. 2015-01-20 20:41:38 +00:00			`endif`