glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-21 20:40:05 +00:00

Author	SHA1	Message	Date
Noah Goldstein	c510681a69	x86/string: Use `movsl` instead of `movsd` in strncpy/strncat [BZ #32344 ] `ld`, starting at 2.40, emits a warning when using `movsd`. There is no change to the actual code produced. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-11-13 10:09:30 -06:00
Noah Goldstein	6754b5becf	x86/string: Use `movsl` instead of `movsd` [BZ #32344 ] `ld`, starting at 2.40, emits a warning when using `movsd`. There is no change to the actual code produced. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-11-08 17:23:05 -06:00
Noah Goldstein	483443d321	x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212 ] The loop should be aligned to 32-bytes so that it can ideally run out the DSB. This is particularly important on Skylake-Server where deficiencies in it's DSB implementation make it prone to not being able to run loops out of the DSB. For example running strcmp-evex on 200Mb string: 32-byte aligned loop: - 43,399,578,766 idq.dsb_uops not 32-byte aligned loop: - 6,060,139,704 idq.dsb_uops This results in a 25% performance degradation for the non-aligned version. The fix is to just ensure the code layout is such that the loop is aligned. (Which was previously the case but was accidentally dropped in `84e7c46df`). NB: The fix was actually 64-byte alignment. This is because 64-byte alignment generally produces more stable performance than 32-byte aligned code (cache line crosses can affect perf), so if we are going past 16-byte alignmnent, might as well go to 64. 64-byte alignment also matches most other functions we over-align, so it creates a common point of optimization. Times are reported as ratio of Time_With_Patch / Time_Without_Patch. Lower is better. The values being reported is the geometric mean of the ratio across all tests in bench-strcmp and bench-strncmp. Note this patch is only attempting to improve the Skylake-Server strcmp for long strings. The rest of the numbers are only to test for regressions. Tigerlake Results Strings <= 512: strcmp : 1.026 strncmp: 0.949 Tigerlake Results Strings > 512: strcmp : 0.994 strncmp: 0.998 Skylake-Server Results Strings <= 512: strcmp : 0.945 strncmp: 0.943 Skylake-Server Results Strings > 512: strcmp : 0.778 strncmp: 1.000 The 2.6% regression on TGL-strcmp is due to slowdowns caused by changes in alignment of code handling small sizes (most on the page-cross logic). These should be safe to ignore because 1) We previously only 16-byte aligned the function so this behavior is not new and was essentially up to chance before this patch and 2) this type of alignment related regression on small sizes really only comes up in tight micro-benchmark loops and is unlikely to have any affect on realworld performance. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-09-30 07:40:40 -07:00
Matthew Sterrett	294a892769	x86: Unifies 'strnlen-evex' and 'strnlen-evex512' implementations. This commit uses a common implementation 'strnlen-evex-base.S' for both 'strnlen-evex' and 'strnlen-evex512' This patch serves both to reduce the number of implementations, and it also does some small optimizations that benefit strnlen-evex and strnlen-evex512. All tests pass on x86. Benchmarks were taken on SKX. https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html Geometric mean for strnlen-evex over all benchmarks (N=10) was (new/old) 0.881 Geometric mean for strnlen-evex512 over all benchmarks (N=10) was (new/old) 0.953 Code Size Changes: strnlen-evex : +31 bytes strnlen-evex512 : +156 bytes Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-08-19 07:34:02 -07:00
Noah Goldstein	f446d90fe6	x86: Add `Avoid_STOSB` tunable to allow NT memset without ERMS The goal of this flag is to allow targets which don't prefer/have ERMS to still access the non-temporal memset implementation. There are 4 cases for tuning memset: 1) `Avoid_STOSB && Avoid_Non_Temporal_Memset` - Memset with temporal stores 2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset` - Memset with temporal/non-temporal stores. Non-temporal path goes through `rep stosb` path. We accomplish this by setting `x86_rep_stosb_threshold` to `x86_memset_non_temporal_threshold`. 3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset` - Memset with temporal stores/`rep stosb` 3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset` - Memset with temporal stores/`rep stosb`/non-temporal stores. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-08-15 08:19:15 -07:00
Noah Goldstein	7da0886247	x86: Fix bug in strchrnul-evex512 [BZ #32078 ] Issue was we were expecting not matches with CHAR before the start of the string in the page cross case. The check code in the page cross case: ``` and $0xffffffffffffffc0,%rax vmovdqa64 (%rax),%zmm17 vpcmpneqb %zmm17,%zmm16,%k1 vptestmb %zmm17,%zmm17,%k0{%k1} kmovq %k0,%rax inc %rax shr %cl,%rax je L(continue) ``` expects that all characters that neither match null nor CHAR will be 1s in `rax` prior to the `inc`. Then the `inc` will overflow all of the 1s where no relevant match was found. This is incorrect in the page-cross case, as the `vmovdqa64 (%rax),%zmm17` loads from before the start of the input string. If there are matches with CHAR before the start of the string, `rax` won't properly overflow. The fix is quite simple. Just replace: ``` inc %rax shr %cl,%rax ``` With: ``` sar %cl,%rax inc %rax ``` The arithmetic shift will clear any matches prior to the start of the string while maintaining the signbit so the 1s can properly overflow to zero in the case of no matches. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-08-15 08:11:33 -07:00
MayShao-oc	c19457aec6	x86_64: Optimize large size copy in memmove-ssse3 This patch optimizes large size copy using normal store when src > dst and overlap. Make it the same as the logic in memmove-vec-unaligned-erms.S. Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non- temporal threshold, this patch updates that value to '__x86_shared_non_temporal_threshold'. Currently, the __x86_shared_non_temporal_threshold is cpu-specific, and different CPUs will have different values based on the related nt-benchmark results. However, in memmove-ssse3, the nontemporal threshold uses '__x86_shared_cache_size_half', which sounds unreasonable. The performance is not changed drastically although shows overall improvements without any major regressions or gains. Results on Zhaoxin KX-7000: bench-memcpy geometric_mean(N=20) New / Original: 0.999 bench-memcpy-random geometric_mean(N=20) New / Original: 0.999 bench-memcpy-large geometric_mean(N=20) New / Original: 0.978 bench-memmove geometric_mean(N=20) New / Original: 1.000 bench-memmmove-large geometric_mean(N=20) New / Original: 0.962 Results on Intel Core i5-6600K: bench-memcpy geometric_mean(N=20) New / Original: 1.001 bench-memcpy-random geometric_mean(N=20) New / Original: 0.999 bench-memcpy-large geometric_mean(N=20) New / Original: 1.001 bench-memmove geometric_mean(N=20) New / Original: 0.995 bench-memmmove-large geometric_mean(N=20) New / Original: 0.936 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-06-30 06:26:43 -07:00
Noah Goldstein	46b5e98ef6	x86: Add seperate non-temporal tunable for memset The tuning for non-temporal stores for memset vs memcpy is not always the same. This includes both the exact value and whether non-temporal stores are profitable at all for a given arch. This patch add `x86_memset_non_temporal_threshold`. Currently we disable non-temporal stores for non Intel vendors as the only benchmarks showing its benefit have been on Intel hardware. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-30 12:36:09 -05:00
Noah Goldstein	5bf0ab8057	x86: Improve large memset perf with non-temporal stores [RHEL-29312] Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-05-30 12:36:09 -05:00
Gabi Falk	dd5f891c1a	x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) This code expects the WCSCAT preprocessor macro to be predefined in case the evex implementation of the function should be defined with a name different from __wcsncat_evex. However, when glibc is built for x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the variable to WCSNCAT, as it is actually a better naming choice for the variable in this case. Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: `64b8b6516b` ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-05-08 07:37:59 -07:00
Adhemerval Zanella	721314c980	x86_64: Remove avx512 strstr implementation As indicated in a recent thread, this it is a simple brute-force algorithm that checks the whole needle at a matching character pair (and does so 1 byte at a time after the first 64 bytes of a needle). Also it never skips ahead and thus can match at every haystack position after trying to match all of the needle, which generic implementation avoids. As indicated by Wilco, a 4x larger needle and 16x larger haystack gives a clear 65x slowdown both basic_strstr and __strstr_avx512: "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512", "__strstr_sse2_unaligned", "__strstr_generic"], { "len_haystack": 65536, "len_needle": 1024, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2] }, { "len_haystack": 1048576, "len_needle": 4096, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9] } PS: I don't have an AVX512 capable machine to verify this issues, but skimming through the code it does seems to follow what Wilco has described. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-03-27 13:48:16 -03:00
H.J. Lu	ef7f4b1fef	Apply the Makefile sorting fix Apply the Makefile sorting fix generated by sort-makefile-lines.py.	2024-02-15 11:19:56 -08:00
Adhemerval Zanella	491e55beab	x86: Expand the comment on when REP STOSB is used on memset Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-02-13 08:49:43 -08:00
Paul Eggert	dff8da6b3e	Update copyright dates with scripts/update-copyrights	2024-01-01 10:53:40 -08:00
Matthew Sterrett	e957308723	x86: Unifies 'strlen-evex' and 'strlen-evex512' implementations. This commit uses a common implementation 'strlen-evex-base.S' for both 'strlen-evex' and 'strlen-evex512' The motivation is to reduce the number of implementations to maintain. This incidentally gives a small performance improvement. All tests pass on x86. Benchmarks were taken on SKX. https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html Geometric mean for strlen-evex512 over all benchmarks (N=10) was (new/old) 0.939 Geometric mean for wcslen-evex512 over all benchmarks (N=10) was (new/old) 0.965 Code Size Changes: strlen-evex512.S : +24 bytes wcslen-evex512.S : +54 bytes Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2023-12-18 12:38:01 -06:00
Noah Goldstein	9469261cf1	x86: Only align destination to 1x VEC_SIZE in memset 4x loop Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on performance other than potentially resulting in an additional iteration of the loop. 1x maintains aligned stores (the only reason to align in this case) and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2023-11-28 12:06:19 -06:00
Adhemerval Zanella	4862d546c0	x86: Use dl-symbol-redir-ifunc.h on cpu-tunables The dl-symbol-redir-ifunc.h redirects compiler-generated libcalls to arch-specific memory implementations to avoid ifunc calls where it is not yet possible. The memcmp-isa-default-impl.h aims to fix the same issue by calling the specific memset implementation directly. Using the memcmp symbol directly allows the compiler to inline the memset calls (especially because _dl_tunable_set_hwcaps uses constants values), generating better code. Checked on x86_64-linux-gnu. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-11-21 16:15:42 -03:00
Noah Goldstein	b7f8b6b64b	x86: Fix unchecked AVX512-VBMI2 usage in strrchr-evex-base.S strrchr-evex-base used `vpcompress{b\|d}` in the page cross logic but was missing the CPU_FEATURE checks for VBMI2 in the ifunc/ifunc-impl-list. The fix is either to add those checks or change the logic to not use `vpcompress{b\|d}`. Choosing the latter here so that the strrchr-evex implementation is usable on SKX. New implementation is a bit slower, but this is in a cold path so its probably okay.	2023-11-15 11:09:44 -06:00
Noah Goldstein	a3c50bf46a	x86: Prepare `strrchr-evex` and `strrchr-evex512` for AVX10 This commit refactors `strrchr-evex` and `strrchr-evex512` to use a common implementation: `strrchr-evex-base.S`. The motivation is `strrchr-evex` needed to be refactored to not use 64-bit masked registers in preperation for AVX10. Once vec-width masked register combining was removed, the EVEX and EVEX512 implementations can easily be implemented in the same file without any major overhead. The net result is performance improvements (measured on TGL) for both `strrchr-evex` and `strrchr-evex512`. Although, note there are some regressions in the test suite and it may be many of the cases that make the total-geomean of improvement/regression across bench-strrchr are cold. The point of the performance measurement is to show there are no major regressions, but the primary motivation is preperation for AVX10. Benchmarks where taken on TGL: https://www.intel.com/content/www/us/en/products/sku/213799/intel-core-i711850h-processor-24m-cache-up-to-4-80-ghz/specifications.html EVEX geometric_mean(N=5) of all benchmarks New / Original : 0.74 EVEX512 geometric_mean(N=5) of all benchmarks New / Original: 0.87 Full check passes on x86.	2023-10-06 00:18:55 -05:00
Frédéric Bérat	64f9857507	wchar: Avoid PLT entries with _FORTIFY_SOURCE The change is meant to avoid unwanted PLT entries for the wmemset and wcrtomb routines when _FORTIFY_SOURCE is set. On top of that, ensure that *_chk routines have their hidden builtin definitions available. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-07-05 16:59:48 +02:00
Frédéric Bérat	dd8486ffc1	string: Ensure _chk routines have their hidden builtin definition available If libc_hidden_builtin_{def,proto} isn't properly set for _chk routines, there are unwanted PLT entries in libc.so. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-07-05 16:59:48 +02:00
H.J. Lu	a8c8889978	x86-64: Use YMM registers in memcmpeq-evex.S Since the assembly source file with -evex suffix should use YMM registers, not ZMM registers, include x86-evex256-vecs.h by default to use YMM registers in memcmpeq-evex.S Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2023-06-01 09:21:14 -07:00
Paul Pluzhnikov	1e9d5987fd	Fix misspellings in sysdeps/x86_64 -- BZ 25337. Applying this commit results in bit-identical rebuild of libc.so.6 math/libm.so.6 elf/ld-linux-x86-64.so.2 mathvec/libmvec.so.1 Reviewed-by: Florian Weimer <fweimer@redhat.com>	2023-05-23 10:25:11 +00:00
Sergey Bugaev	8d873a4904	x86_64: Add rtld-stpncpy & rtld-strncpy Just like the other existing rtld-str* files, this provides rtld with usable versions of stpncpy and strncpy. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20230319151017.531737-22-bugaevc@gmail.com>	2023-04-03 01:17:56 +02:00
Adhemerval Zanella	22999b2f0f	string: Add libc_hidden_proto for memrchr Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:58 -03:00
Adhemerval Zanella	7ea510127e	string: Add libc_hidden_proto for strchrnul Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:56 -03:00
Noah Goldstein	b2c474f8de	x86: Fix strncat-avx2.S reading past length [BZ #30065 ] Occurs when `src` has no null-term. Two cases: 1) Zero-length check is doing: ``` test %rdx, %rdx jl L(zero_len) ``` which doesn't actually check zero (was at some point `decq` and the flag never got updated). The fix is just make the flag `jle` i.e: ``` test %rdx, %rdx jle L(zero_len) ``` 2) Length check in page-cross case checking if we should continue is doing: ``` cmpq %r8, %rdx jb L(page_cross_small) ``` which means we will continue searching for null-term if length ends at the end of a page and there was no null-term in `src`. The fix is to make the flag: ``` cmpq %r8, %rdx jbe L(page_cross_small) ```	2023-01-31 19:13:46 -06:00
Joseph Myers	6d7e8eda9b	Update copyright dates with scripts/update-copyrights	2023-01-06 21:14:39 +00:00
Noah Goldstein	b712be5264	x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863 ] In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b` are concurrently modified as `memcmp` runs, there can be a SIGSEGV in `L(ret_nonzero_vec_end_0)` because the sequential logic assumes that `(rdx - 32 + rax)` is a positive 32-bit integer. To be clear, this change does not mean the usage of `memcmp` is supported. The program behaviour is undefined (UB) in the presence of data races, and `memcmp` is incorrect when the values of `a` and/or `b` are modified concurrently (data race). This UB may manifest itself as a SIGSEGV. That being said, if we can allow the idiomatic use cases, like those in yottadb with opportunistic concurrency control (OCC), to execute without a SIGSEGV, at no cost to regular use cases, then we can aim to minimize harm to those existing users. The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant `addq %rdx, %rax`. The 1-extra byte of code size from using the 64-bit instruction doesn't contribute to overall code size as the next target is aligned and has multiple bytes of `nop` padding before it. As well all the logic between the add and `ret` still fits in the same fetch block, so the cost of this change is basically zero. The relevant sequential logic can be seen in the following pseudo-code: ``` /* * rsi = a * rdi = b * rdx = len - 32 / / cmp a[0:15] and b[0:15]. Since length is known to be [17, 32] in this case, this check is also assumed to cover a[0:(31 - len)] and b[0:(31 - len)]. / movups (%rsi), %xmm0 movups (%rdi), %xmm1 PCMPEQ %xmm0, %xmm1 pmovmskb %xmm1, %eax subl %ecx, %eax jnz L(END_NEQ) / cmp a[len-16:len-1] and b[len-16:len-1]. / movups 16(%rsi, %rdx), %xmm0 movups 16(%rdi, %rdx), %xmm1 PCMPEQ %xmm0, %xmm1 pmovmskb %xmm1, %eax subl %ecx, %eax jnz L(END_NEQ2) ret L(END2): / Position first mismatch. / bsfl %eax, %eax / The sequential version is able to assume this value is a positive 32-bit value because the first check included bytes in range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be greater than `31 - len` so the minimum value of `edx` + `eax` is `(len - 32) + (32 - len) >= 0`. In the concurrent case, however, `a` or `b` could have been changed so a mismatch in `eax` less or equal than `(31 - len)` is possible (the new low bound is `(16 - len)`. This can result in a negative 32-bit signed integer, which when zero extended to 64-bits is a random large value this out out of bounds. / addl %edx, %eax / Crash here because 32-bit negative number in `eax` zero extends to out of bounds 64-bit offset. */ movzbl 16(%rdi, %rax), %ecx movzbl 16(%rsi, %rax), %eax ``` This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e `addq %rdx, %rax`). This prevents the 32-bit zero extension and since `eax` is still a low bound of `16 - len` the `rdx + rax` is bound by `(len - 32) - (16 - len) >= -16`. Since we have a fixed offset of `16` in the memory access this must be in bounds.	2022-12-15 09:09:35 -08:00
H.J. Lu	e5672763c4	x86-64 strncpy: Properly handle the length parameter [BZ# 29839] On x32, the size_t parameter may be passed in the lower 32 bits of a 64-bit register with the non-zero upper 32 bits. The string/memory functions written in assembly can only use the lower 32 bits of a 64-bit register as length or must clear the upper 32 bits before using the full 64-bit register for length. This pach fixes strncpy for x32. Tested on x86-64 and x32. On x86-64, libc.so is the same with and without the fix. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-12-02 08:18:41 -08:00
H.J. Lu	f566b02852	x86-64 strncat: Properly handle the length parameter [BZ# 24097] On x32, the size_t parameter may be passed in the lower 32 bits of a 64-bit register with the non-zero upper 32 bits. The string/memory functions written in assembly can only use the lower 32 bits of a 64-bit register as length or must clear the upper 32 bits before using the full 64-bit register for length. This pach fixes strncat for x32. Tested on x86-64 and x32. On x86-64, libc.so is the same with and without the fix. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-12-02 08:18:10 -08:00
Noah Goldstein	52cf11004e	x86: Add avx2 optimized functions for the wchar_t strcpy family Implemented: wcscat-avx2 (+ 744 bytes wcscpy-avx2 (+ 539 bytes) wcpcpy-avx2 (+ 577 bytes) wcsncpy-avx2 (+1108 bytes) wcpncpy-avx2 (+1214 bytes) wcsncat-avx2 (+1085 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-avx2 -> 0.975 wcscpy-avx2 -> 0.591 wcpcpy-avx2 -> 0.698 wcsncpy-avx2 -> 0.730 wcpncpy-avx2 -> 0.711 wcsncat-avx2 -> 0.954 Code Size Changes: This change increase the size of libc.so by ~5.5kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.2kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	64b8b6516b	x86: Add evex optimized functions for the wchar_t strcpy family Implemented: wcscat-evex (+ 905 bytes) wcscpy-evex (+ 674 bytes) wcpcpy-evex (+ 709 bytes) wcsncpy-evex (+1358 bytes) wcpncpy-evex (+1467 bytes) wcsncat-evex (+1213 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-evex -> 0.991 wcscpy-evex -> 0.587 wcpcpy-evex -> 0.695 wcsncpy-evex -> 0.719 wcpncpy-evex -> 0.694 wcsncat-evex -> 0.979 Code Size Changes: This change increase the size of libc.so by ~6.3kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.7kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	642933158e	x86: Optimize and shrink st{r\|p}{n}{cat\|cpy}-avx2 functions Optimizations are: 1. Use more overlapping stores to avoid branches. 2. Reduce how unrolled the aligning copies are (this is more of a code-size save, its a negative for some sizes in terms of perf). 3. For st{r\|p}n{cat\|cpy} re-order the branches to minimize the number that are taken. Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Old Implementation. strcat-avx2 -> 0.998 strcpy-avx2 -> 0.937 stpcpy-avx2 -> 0.971 strncpy-avx2 -> 0.793 stpncpy-avx2 -> 0.775 strncat-avx2 -> 0.962 Code Size Changes: function -> Bytes New / Bytes Old -> Ratio strcat-avx2 -> 685 / 1639 -> 0.418 strcpy-avx2 -> 560 / 903 -> 0.620 stpcpy-avx2 -> 592 / 939 -> 0.630 strncpy-avx2 -> 1176 / 2390 -> 0.492 stpncpy-avx2 -> 1268 / 2438 -> 0.520 strncat-avx2 -> 1042 / 2563 -> 0.407 Notes: 1. Because of the significant difference between the implementations they are split into three files. strcpy-avx2.S -> strcpy, stpcpy, strcat strncpy-avx2.S -> strncpy strncat-avx2.S > strncat I couldn't find a way to merge them without making the ifdefs incredibly difficult to follow. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	f049f52dfe	x86: Optimize and shrink st{r\|p}{n}{cat\|cpy}-evex functions Optimizations are: 1. Use more overlapping stores to avoid branches. 2. Reduce how unrolled the aligning copies are (this is more of a code-size save, its a negative for some sizes in terms of perf). 3. Improve the loop a bit (similiar to what we do in strlen with 2x vpminu + kortest instead of 3x vpminu + kmov + test). 4. For st{r\|p}n{cat\|cpy} re-order the branches to minimize the number that are taken. Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Old Implementation. stpcpy-evex -> 0.922 strcat-evex -> 0.985 strcpy-evex -> 0.880 strncpy-evex -> 0.831 stpncpy-evex -> 0.780 strncat-evex -> 0.958 Code Size Changes: function -> Bytes New / Bytes Old -> Ratio strcat-evex -> 819 / 1874 -> 0.437 strcpy-evex -> 700 / 1074 -> 0.652 stpcpy-evex -> 735 / 1094 -> 0.672 strncpy-evex -> 1397 / 2611 -> 0.535 stpncpy-evex -> 1489 / 2691 -> 0.553 strncat-evex -> 1184 / 2832 -> 0.418 Notes: 1. Because of the significant difference between the implementations they are split into three files. strcpy-evex.S -> strcpy, stpcpy, strcat strncpy-evex.S -> strncpy strncat-evex.S > strncat I couldn't find a way to merge them without making the ifdefs incredibly difficult to follow. 2. All implementations can be made evex512 by including "x86-evex512-vecs.h" at the top. 3. All implementations have an optional define: `USE_EVEX_MASKED_STORE` Setting to one uses evex-masked stores for handling short strings. This saves code size and branches. It's disabled for all implementations are the moment as there are some serious drawbacks to masked stores in certain cases, but that may be fixed on future architectures. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	2d2493a644	x86: Use VMM API in memcmpeq-evex.S and minor changes Changes to generated code are: 1. In a few places use `vpcmpeqb` instead of `vpcmpneq` to save a byte of code size. 2. Add a branch for length <= (VEC_SIZE * 6) as opposed to doing the entire block of [VEC_SIZE * 4 + 1, VEC_SIZE * 8] in a single basic-block (the space to add the extra branch without changing code size is bought with the above change). Change (2) has roughly a 20-25% speedup for sizes in [VEC_SIZE * 4 + 1, VEC_SIZE * 6] and negligible to no-cost for [VEC_SIZE * 6 + 1, VEC_SIZE * 8] From N=10 runs on Tigerlake: align1,align2 ,length ,result ,New Time ,Cur Time ,New Time / Old Time 0 ,0 ,129 ,0 ,5.404 ,6.887 ,0.785 0 ,0 ,129 ,1 ,5.308 ,6.826 ,0.778 0 ,0 ,129 ,18446744073709551615 ,5.359 ,6.823 ,0.785 0 ,0 ,161 ,0 ,5.284 ,6.827 ,0.774 0 ,0 ,161 ,1 ,5.317 ,6.745 ,0.788 0 ,0 ,161 ,18446744073709551615 ,5.406 ,6.778 ,0.798 0 ,0 ,193 ,0 ,6.804 ,6.802 ,1.000 0 ,0 ,193 ,1 ,6.950 ,6.754 ,1.029 0 ,0 ,193 ,18446744073709551615 ,6.792 ,6.719 ,1.011 0 ,0 ,225 ,0 ,6.625 ,6.699 ,0.989 0 ,0 ,225 ,1 ,6.776 ,6.735 ,1.003 0 ,0 ,225 ,18446744073709551615 ,6.758 ,6.738 ,0.992 0 ,0 ,256 ,0 ,5.402 ,5.462 ,0.989 0 ,0 ,256 ,1 ,5.364 ,5.483 ,0.978 0 ,0 ,256 ,18446744073709551615 ,5.341 ,5.539 ,0.964 Rewriting with VMM API allows for memcmpeq-evex to be used with evex512 by including "x86-evex512-vecs.h" at the top. Complete check passes on x86-64.	2022-11-08 19:22:08 -08:00
Noah Goldstein	419c832aba	x86: Use VMM API in memcmp-evex-movbe.S and minor changes The only change to the existing generated code is `tzcnt` -> `bsf` to save a byte of code size here and there. Rewriting with VMM API allows for memcmp-evex-movbe to be used with evex512 by including "x86-evex512-vecs.h" at the top. Complete check passes on x86-64.	2022-11-08 19:19:35 -08:00
Sunil K Pandey	faaf733f49	x86_64: Implement evex512 version of strrchr and wcsrchr Changes from v1: Use vec api for register. Replace VPCMP with VPCMPEQ Restructure and remove 1 unconditional jump. Change page cross logic to use sall. This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strrchr function using 512 bit vectors. - wcsrchr function using 512 bit vectors. Code size data: strrchr-evex.o 879 byte strrchr-evex512.o 601 byte (-32%) wcsrchr-evex.o 882 byte wcsrchr-evex512.o 572 byte (-35%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-11-03 15:51:52 -07:00
Sunil K Pandey	e96971482d	x86-64: Improve evex512 version of strlen functions This patch improves following functionality - Replace VPCMP with VPCMPEQ. - Replace page cross check logic with sall. - Remove extra lea from align_more. - Remove uncondition loop jump. - Use bsf to check max length in first vector. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-10-30 13:09:56 -07:00
Sunil K Pandey	59e501f204	x86_64: Implement evex512 version of strchrnul, strchr and wcschr This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strchrnul function using 512 bit vectors. - strchr function using 512 bit vectors. - wcschr function using 512 bit vectors. Code size data: strchrnul-evex.o 599 byte strchrnul-evex512.o 569 byte (-5%) strchr-evex.o 639 byte strchr-evex512.o 595 byte (-7%) wcschr-evex.o 644 byte wcschr-evex512.o 607 byte (-6%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-10-25 22:39:35 -07:00
Noah Goldstein	8775479804	x86: Use `testb` for FSRM check in memmove-vec-unaligned-erms `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	f04f8373dd	x86: Use `testb` for case-locale check in str{n}casecmp-sse42 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	7775574ce0	x86: Use `testb` for case-locale check in str{n}casecmp-sse2 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	b6d02d6457	x86: Use `testb` for case-locale check in str{n}casecmp-avx2 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	5ce9766417	x86: Add support for VEC_SIZE == 64 in strcmp-evex.S impl Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and strncasecmp{l} functions can be added by including strcmp-evex.S with "x86-evex512-vecs.h" defined. In addition save code size a bit in a few places. 1. tzcnt ... -> bsf ... 2. vpcmp{b\|d} $0 ... -> vpcmpeq{b\|d} This saves a touch of code size but has minimal net affect. Full check passes on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	c25eb94aed	x86: Remove AVX512-BVMI2 instruction from strrchr-evex.S commit `b412213eee` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Tue Oct 18 17:44:07 2022 -0700 x86: Optimize strrchr-evex.S and implement with VMM headers Added `vpcompress{b\|d}` to the page-cross logic with is an AVX512-VBMI2 instruction. This is not supported on SKX. Since the page-cross logic is relatively cold and the benefit is minimal revert the page-cross case back to the old logic which is supported on SKX. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	b412213eee	x86: Optimize strrchr-evex.S and implement with VMM headers Optimization is: 1. Cache latest result in "fast path" loop with `vmovdqu` instead of `kunpckdq`. This helps if there are more than one matches. Code Size Changes: strrchr-evex.S : +30 bytes (Same number of cache lines) Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strrchr-evex.S : 0.932 (From cases with higher match frequency) Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	4af6844aa5	x86: Optimize memrchr-evex.S Optimizations are: 1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... Code Size Changes: memrchr-evex.S : -29 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memrchr-evex.S : 0.949 (Mostly from improvements in small strings) Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	b79f8ff26a	x86: Optimize strnlen-evex.S and implement with VMM headers Optimizations are: 1. Use the fact that bsf(0) leaves the destination unchanged to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the strnlen and strlen code essentially incompatible so split strnlen-evex to a new file. Code Size Changes: strlen-evex.S : -23 bytes strnlen-evex.S : -167 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strlen-evex.S : 0.992 (No real change) strnlen-evex.S : 0.947 Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	69717709ec	x86: Shrink / minorly optimize strchr-evex and implement with VMM headers Size Optimizations: 1. Condence hot path for better cache-locality. - This is most impact for strchrnul where the logic strings with len <= VEC_SIZE or with a match in the first VEC no fits entirely in the first cache line. 2. Reuse common targets in first 4x VEC and after the loop. 3. Don't align targets so aggressively if it doesn't change the number of fetch blocks it will require and put more care in avoiding the case where targets unnecessarily split cache lines. 4. Align the loop better for DSB/LSD 5. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 6. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. Code Size Changes: strchr-evex.S : -63 bytes strchrnul-evex.S: -48 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strchr-evex.S (Fixed) : 0.971 strchr-evex.S (Rand) : 0.932 strchrnul-evex.S : 0.965 Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00

1 2 3 4 5 ...

502 Commits