glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-22 21:10:07 +00:00

Author	SHA1	Message	Date
Gabi Falk	dd5f891c1a	x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) This code expects the WCSCAT preprocessor macro to be predefined in case the evex implementation of the function should be defined with a name different from __wcsncat_evex. However, when glibc is built for x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the variable to WCSNCAT, as it is actually a better naming choice for the variable in this case. Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: `64b8b6516b` ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2024-05-08 07:37:59 -07:00
Adhemerval Zanella	721314c980	x86_64: Remove avx512 strstr implementation As indicated in a recent thread, this it is a simple brute-force algorithm that checks the whole needle at a matching character pair (and does so 1 byte at a time after the first 64 bytes of a needle). Also it never skips ahead and thus can match at every haystack position after trying to match all of the needle, which generic implementation avoids. As indicated by Wilco, a 4x larger needle and 16x larger haystack gives a clear 65x slowdown both basic_strstr and __strstr_avx512: "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512", "__strstr_sse2_unaligned", "__strstr_generic"], { "len_haystack": 65536, "len_needle": 1024, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2] }, { "len_haystack": 1048576, "len_needle": 4096, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9] } PS: I don't have an AVX512 capable machine to verify this issues, but skimming through the code it does seems to follow what Wilco has described. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2024-03-27 13:48:16 -03:00
H.J. Lu	ef7f4b1fef	Apply the Makefile sorting fix Apply the Makefile sorting fix generated by sort-makefile-lines.py.	2024-02-15 11:19:56 -08:00
Adhemerval Zanella	491e55beab	x86: Expand the comment on when REP STOSB is used on memset Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2024-02-13 08:49:43 -08:00
Paul Eggert	dff8da6b3e	Update copyright dates with scripts/update-copyrights	2024-01-01 10:53:40 -08:00
Matthew Sterrett	e957308723	x86: Unifies 'strlen-evex' and 'strlen-evex512' implementations. This commit uses a common implementation 'strlen-evex-base.S' for both 'strlen-evex' and 'strlen-evex512' The motivation is to reduce the number of implementations to maintain. This incidentally gives a small performance improvement. All tests pass on x86. Benchmarks were taken on SKX. https://www.intel.com/content/www/us/en/products/sku/123613/intel-core-i97900x-xseries-processor-13-75m-cache-up-to-4-30-ghz/specifications.html Geometric mean for strlen-evex512 over all benchmarks (N=10) was (new/old) 0.939 Geometric mean for wcslen-evex512 over all benchmarks (N=10) was (new/old) 0.965 Code Size Changes: strlen-evex512.S : +24 bytes wcslen-evex512.S : +54 bytes Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2023-12-18 12:38:01 -06:00
Noah Goldstein	9469261cf1	x86: Only align destination to 1x VEC_SIZE in memset 4x loop Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on performance other than potentially resulting in an additional iteration of the loop. 1x maintains aligned stores (the only reason to align in this case) and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>	2023-11-28 12:06:19 -06:00
Adhemerval Zanella	4862d546c0	x86: Use dl-symbol-redir-ifunc.h on cpu-tunables The dl-symbol-redir-ifunc.h redirects compiler-generated libcalls to arch-specific memory implementations to avoid ifunc calls where it is not yet possible. The memcmp-isa-default-impl.h aims to fix the same issue by calling the specific memset implementation directly. Using the memcmp symbol directly allows the compiler to inline the memset calls (especially because _dl_tunable_set_hwcaps uses constants values), generating better code. Checked on x86_64-linux-gnu. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-11-21 16:15:42 -03:00
Noah Goldstein	b7f8b6b64b	x86: Fix unchecked AVX512-VBMI2 usage in strrchr-evex-base.S strrchr-evex-base used `vpcompress{b\|d}` in the page cross logic but was missing the CPU_FEATURE checks for VBMI2 in the ifunc/ifunc-impl-list. The fix is either to add those checks or change the logic to not use `vpcompress{b\|d}`. Choosing the latter here so that the strrchr-evex implementation is usable on SKX. New implementation is a bit slower, but this is in a cold path so its probably okay.	2023-11-15 11:09:44 -06:00
Noah Goldstein	a3c50bf46a	x86: Prepare `strrchr-evex` and `strrchr-evex512` for AVX10 This commit refactors `strrchr-evex` and `strrchr-evex512` to use a common implementation: `strrchr-evex-base.S`. The motivation is `strrchr-evex` needed to be refactored to not use 64-bit masked registers in preperation for AVX10. Once vec-width masked register combining was removed, the EVEX and EVEX512 implementations can easily be implemented in the same file without any major overhead. The net result is performance improvements (measured on TGL) for both `strrchr-evex` and `strrchr-evex512`. Although, note there are some regressions in the test suite and it may be many of the cases that make the total-geomean of improvement/regression across bench-strrchr are cold. The point of the performance measurement is to show there are no major regressions, but the primary motivation is preperation for AVX10. Benchmarks where taken on TGL: https://www.intel.com/content/www/us/en/products/sku/213799/intel-core-i711850h-processor-24m-cache-up-to-4-80-ghz/specifications.html EVEX geometric_mean(N=5) of all benchmarks New / Original : 0.74 EVEX512 geometric_mean(N=5) of all benchmarks New / Original: 0.87 Full check passes on x86.	2023-10-06 00:18:55 -05:00
Frédéric Bérat	64f9857507	wchar: Avoid PLT entries with _FORTIFY_SOURCE The change is meant to avoid unwanted PLT entries for the wmemset and wcrtomb routines when _FORTIFY_SOURCE is set. On top of that, ensure that *_chk routines have their hidden builtin definitions available. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-07-05 16:59:48 +02:00
Frédéric Bérat	dd8486ffc1	string: Ensure _chk routines have their hidden builtin definition available If libc_hidden_builtin_{def,proto} isn't properly set for _chk routines, there are unwanted PLT entries in libc.so. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2023-07-05 16:59:48 +02:00
H.J. Lu	a8c8889978	x86-64: Use YMM registers in memcmpeq-evex.S Since the assembly source file with -evex suffix should use YMM registers, not ZMM registers, include x86-evex256-vecs.h by default to use YMM registers in memcmpeq-evex.S Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2023-06-01 09:21:14 -07:00
Paul Pluzhnikov	1e9d5987fd	Fix misspellings in sysdeps/x86_64 -- BZ 25337. Applying this commit results in bit-identical rebuild of libc.so.6 math/libm.so.6 elf/ld-linux-x86-64.so.2 mathvec/libmvec.so.1 Reviewed-by: Florian Weimer <fweimer@redhat.com>	2023-05-23 10:25:11 +00:00
Sergey Bugaev	8d873a4904	x86_64: Add rtld-stpncpy & rtld-strncpy Just like the other existing rtld-str* files, this provides rtld with usable versions of stpncpy and strncpy. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20230319151017.531737-22-bugaevc@gmail.com>	2023-04-03 01:17:56 +02:00
Adhemerval Zanella	22999b2f0f	string: Add libc_hidden_proto for memrchr Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:58 -03:00
Adhemerval Zanella	7ea510127e	string: Add libc_hidden_proto for strchrnul Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>	2023-02-08 17:13:56 -03:00
Noah Goldstein	b2c474f8de	x86: Fix strncat-avx2.S reading past length [BZ #30065 ] Occurs when `src` has no null-term. Two cases: 1) Zero-length check is doing: ``` test %rdx, %rdx jl L(zero_len) ``` which doesn't actually check zero (was at some point `decq` and the flag never got updated). The fix is just make the flag `jle` i.e: ``` test %rdx, %rdx jle L(zero_len) ``` 2) Length check in page-cross case checking if we should continue is doing: ``` cmpq %r8, %rdx jb L(page_cross_small) ``` which means we will continue searching for null-term if length ends at the end of a page and there was no null-term in `src`. The fix is to make the flag: ``` cmpq %r8, %rdx jbe L(page_cross_small) ```	2023-01-31 19:13:46 -06:00
Joseph Myers	6d7e8eda9b	Update copyright dates with scripts/update-copyrights	2023-01-06 21:14:39 +00:00
Noah Goldstein	b712be5264	x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863 ] In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b` are concurrently modified as `memcmp` runs, there can be a SIGSEGV in `L(ret_nonzero_vec_end_0)` because the sequential logic assumes that `(rdx - 32 + rax)` is a positive 32-bit integer. To be clear, this change does not mean the usage of `memcmp` is supported. The program behaviour is undefined (UB) in the presence of data races, and `memcmp` is incorrect when the values of `a` and/or `b` are modified concurrently (data race). This UB may manifest itself as a SIGSEGV. That being said, if we can allow the idiomatic use cases, like those in yottadb with opportunistic concurrency control (OCC), to execute without a SIGSEGV, at no cost to regular use cases, then we can aim to minimize harm to those existing users. The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant `addq %rdx, %rax`. The 1-extra byte of code size from using the 64-bit instruction doesn't contribute to overall code size as the next target is aligned and has multiple bytes of `nop` padding before it. As well all the logic between the add and `ret` still fits in the same fetch block, so the cost of this change is basically zero. The relevant sequential logic can be seen in the following pseudo-code: ``` /* * rsi = a * rdi = b * rdx = len - 32 / / cmp a[0:15] and b[0:15]. Since length is known to be [17, 32] in this case, this check is also assumed to cover a[0:(31 - len)] and b[0:(31 - len)]. / movups (%rsi), %xmm0 movups (%rdi), %xmm1 PCMPEQ %xmm0, %xmm1 pmovmskb %xmm1, %eax subl %ecx, %eax jnz L(END_NEQ) / cmp a[len-16:len-1] and b[len-16:len-1]. / movups 16(%rsi, %rdx), %xmm0 movups 16(%rdi, %rdx), %xmm1 PCMPEQ %xmm0, %xmm1 pmovmskb %xmm1, %eax subl %ecx, %eax jnz L(END_NEQ2) ret L(END2): / Position first mismatch. / bsfl %eax, %eax / The sequential version is able to assume this value is a positive 32-bit value because the first check included bytes in range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be greater than `31 - len` so the minimum value of `edx` + `eax` is `(len - 32) + (32 - len) >= 0`. In the concurrent case, however, `a` or `b` could have been changed so a mismatch in `eax` less or equal than `(31 - len)` is possible (the new low bound is `(16 - len)`. This can result in a negative 32-bit signed integer, which when zero extended to 64-bits is a random large value this out out of bounds. / addl %edx, %eax / Crash here because 32-bit negative number in `eax` zero extends to out of bounds 64-bit offset. */ movzbl 16(%rdi, %rax), %ecx movzbl 16(%rsi, %rax), %eax ``` This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e `addq %rdx, %rax`). This prevents the 32-bit zero extension and since `eax` is still a low bound of `16 - len` the `rdx + rax` is bound by `(len - 32) - (16 - len) >= -16`. Since we have a fixed offset of `16` in the memory access this must be in bounds.	2022-12-15 09:09:35 -08:00
H.J. Lu	e5672763c4	x86-64 strncpy: Properly handle the length parameter [BZ# 29839] On x32, the size_t parameter may be passed in the lower 32 bits of a 64-bit register with the non-zero upper 32 bits. The string/memory functions written in assembly can only use the lower 32 bits of a 64-bit register as length or must clear the upper 32 bits before using the full 64-bit register for length. This pach fixes strncpy for x32. Tested on x86-64 and x32. On x86-64, libc.so is the same with and without the fix. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-12-02 08:18:41 -08:00
H.J. Lu	f566b02852	x86-64 strncat: Properly handle the length parameter [BZ# 24097] On x32, the size_t parameter may be passed in the lower 32 bits of a 64-bit register with the non-zero upper 32 bits. The string/memory functions written in assembly can only use the lower 32 bits of a 64-bit register as length or must clear the upper 32 bits before using the full 64-bit register for length. This pach fixes strncat for x32. Tested on x86-64 and x32. On x86-64, libc.so is the same with and without the fix. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-12-02 08:18:10 -08:00
Noah Goldstein	52cf11004e	x86: Add avx2 optimized functions for the wchar_t strcpy family Implemented: wcscat-avx2 (+ 744 bytes wcscpy-avx2 (+ 539 bytes) wcpcpy-avx2 (+ 577 bytes) wcsncpy-avx2 (+1108 bytes) wcpncpy-avx2 (+1214 bytes) wcsncat-avx2 (+1085 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-avx2 -> 0.975 wcscpy-avx2 -> 0.591 wcpcpy-avx2 -> 0.698 wcsncpy-avx2 -> 0.730 wcpncpy-avx2 -> 0.711 wcsncat-avx2 -> 0.954 Code Size Changes: This change increase the size of libc.so by ~5.5kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.2kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	64b8b6516b	x86: Add evex optimized functions for the wchar_t strcpy family Implemented: wcscat-evex (+ 905 bytes) wcscpy-evex (+ 674 bytes) wcpcpy-evex (+ 709 bytes) wcsncpy-evex (+1358 bytes) wcpncpy-evex (+1467 bytes) wcsncat-evex (+1213 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-evex -> 0.991 wcscpy-evex -> 0.587 wcpcpy-evex -> 0.695 wcsncpy-evex -> 0.719 wcpncpy-evex -> 0.694 wcsncat-evex -> 0.979 Code Size Changes: This change increase the size of libc.so by ~6.3kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.7kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	642933158e	x86: Optimize and shrink st{r\|p}{n}{cat\|cpy}-avx2 functions Optimizations are: 1. Use more overlapping stores to avoid branches. 2. Reduce how unrolled the aligning copies are (this is more of a code-size save, its a negative for some sizes in terms of perf). 3. For st{r\|p}n{cat\|cpy} re-order the branches to minimize the number that are taken. Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Old Implementation. strcat-avx2 -> 0.998 strcpy-avx2 -> 0.937 stpcpy-avx2 -> 0.971 strncpy-avx2 -> 0.793 stpncpy-avx2 -> 0.775 strncat-avx2 -> 0.962 Code Size Changes: function -> Bytes New / Bytes Old -> Ratio strcat-avx2 -> 685 / 1639 -> 0.418 strcpy-avx2 -> 560 / 903 -> 0.620 stpcpy-avx2 -> 592 / 939 -> 0.630 strncpy-avx2 -> 1176 / 2390 -> 0.492 stpncpy-avx2 -> 1268 / 2438 -> 0.520 strncat-avx2 -> 1042 / 2563 -> 0.407 Notes: 1. Because of the significant difference between the implementations they are split into three files. strcpy-avx2.S -> strcpy, stpcpy, strcat strncpy-avx2.S -> strncpy strncat-avx2.S > strncat I couldn't find a way to merge them without making the ifdefs incredibly difficult to follow. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	f049f52dfe	x86: Optimize and shrink st{r\|p}{n}{cat\|cpy}-evex functions Optimizations are: 1. Use more overlapping stores to avoid branches. 2. Reduce how unrolled the aligning copies are (this is more of a code-size save, its a negative for some sizes in terms of perf). 3. Improve the loop a bit (similiar to what we do in strlen with 2x vpminu + kortest instead of 3x vpminu + kmov + test). 4. For st{r\|p}n{cat\|cpy} re-order the branches to minimize the number that are taken. Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Old Implementation. stpcpy-evex -> 0.922 strcat-evex -> 0.985 strcpy-evex -> 0.880 strncpy-evex -> 0.831 stpncpy-evex -> 0.780 strncat-evex -> 0.958 Code Size Changes: function -> Bytes New / Bytes Old -> Ratio strcat-evex -> 819 / 1874 -> 0.437 strcpy-evex -> 700 / 1074 -> 0.652 stpcpy-evex -> 735 / 1094 -> 0.672 strncpy-evex -> 1397 / 2611 -> 0.535 stpncpy-evex -> 1489 / 2691 -> 0.553 strncat-evex -> 1184 / 2832 -> 0.418 Notes: 1. Because of the significant difference between the implementations they are split into three files. strcpy-evex.S -> strcpy, stpcpy, strcat strncpy-evex.S -> strncpy strncat-evex.S > strncat I couldn't find a way to merge them without making the ifdefs incredibly difficult to follow. 2. All implementations can be made evex512 by including "x86-evex512-vecs.h" at the top. 3. All implementations have an optional define: `USE_EVEX_MASKED_STORE` Setting to one uses evex-masked stores for handling short strings. This saves code size and branches. It's disabled for all implementations are the moment as there are some serious drawbacks to masked stores in certain cases, but that may be fixed on future architectures. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.	2022-11-08 19:22:33 -08:00
Noah Goldstein	2d2493a644	x86: Use VMM API in memcmpeq-evex.S and minor changes Changes to generated code are: 1. In a few places use `vpcmpeqb` instead of `vpcmpneq` to save a byte of code size. 2. Add a branch for length <= (VEC_SIZE * 6) as opposed to doing the entire block of [VEC_SIZE * 4 + 1, VEC_SIZE * 8] in a single basic-block (the space to add the extra branch without changing code size is bought with the above change). Change (2) has roughly a 20-25% speedup for sizes in [VEC_SIZE * 4 + 1, VEC_SIZE * 6] and negligible to no-cost for [VEC_SIZE * 6 + 1, VEC_SIZE * 8] From N=10 runs on Tigerlake: align1,align2 ,length ,result ,New Time ,Cur Time ,New Time / Old Time 0 ,0 ,129 ,0 ,5.404 ,6.887 ,0.785 0 ,0 ,129 ,1 ,5.308 ,6.826 ,0.778 0 ,0 ,129 ,18446744073709551615 ,5.359 ,6.823 ,0.785 0 ,0 ,161 ,0 ,5.284 ,6.827 ,0.774 0 ,0 ,161 ,1 ,5.317 ,6.745 ,0.788 0 ,0 ,161 ,18446744073709551615 ,5.406 ,6.778 ,0.798 0 ,0 ,193 ,0 ,6.804 ,6.802 ,1.000 0 ,0 ,193 ,1 ,6.950 ,6.754 ,1.029 0 ,0 ,193 ,18446744073709551615 ,6.792 ,6.719 ,1.011 0 ,0 ,225 ,0 ,6.625 ,6.699 ,0.989 0 ,0 ,225 ,1 ,6.776 ,6.735 ,1.003 0 ,0 ,225 ,18446744073709551615 ,6.758 ,6.738 ,0.992 0 ,0 ,256 ,0 ,5.402 ,5.462 ,0.989 0 ,0 ,256 ,1 ,5.364 ,5.483 ,0.978 0 ,0 ,256 ,18446744073709551615 ,5.341 ,5.539 ,0.964 Rewriting with VMM API allows for memcmpeq-evex to be used with evex512 by including "x86-evex512-vecs.h" at the top. Complete check passes on x86-64.	2022-11-08 19:22:08 -08:00
Noah Goldstein	419c832aba	x86: Use VMM API in memcmp-evex-movbe.S and minor changes The only change to the existing generated code is `tzcnt` -> `bsf` to save a byte of code size here and there. Rewriting with VMM API allows for memcmp-evex-movbe to be used with evex512 by including "x86-evex512-vecs.h" at the top. Complete check passes on x86-64.	2022-11-08 19:19:35 -08:00
Sunil K Pandey	faaf733f49	x86_64: Implement evex512 version of strrchr and wcsrchr Changes from v1: Use vec api for register. Replace VPCMP with VPCMPEQ Restructure and remove 1 unconditional jump. Change page cross logic to use sall. This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strrchr function using 512 bit vectors. - wcsrchr function using 512 bit vectors. Code size data: strrchr-evex.o 879 byte strrchr-evex512.o 601 byte (-32%) wcsrchr-evex.o 882 byte wcsrchr-evex512.o 572 byte (-35%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-11-03 15:51:52 -07:00
Sunil K Pandey	e96971482d	x86-64: Improve evex512 version of strlen functions This patch improves following functionality - Replace VPCMP with VPCMPEQ. - Replace page cross check logic with sall. - Remove extra lea from align_more. - Remove uncondition loop jump. - Use bsf to check max length in first vector. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-10-30 13:09:56 -07:00
Sunil K Pandey	59e501f204	x86_64: Implement evex512 version of strchrnul, strchr and wcschr This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strchrnul function using 512 bit vectors. - strchr function using 512 bit vectors. - wcschr function using 512 bit vectors. Code size data: strchrnul-evex.o 599 byte strchrnul-evex512.o 569 byte (-5%) strchr-evex.o 639 byte strchr-evex512.o 595 byte (-7%) wcschr-evex.o 644 byte wcschr-evex512.o 607 byte (-6%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-10-25 22:39:35 -07:00
Noah Goldstein	8775479804	x86: Use `testb` for FSRM check in memmove-vec-unaligned-erms `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	f04f8373dd	x86: Use `testb` for case-locale check in str{n}casecmp-sse42 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	7775574ce0	x86: Use `testb` for case-locale check in str{n}casecmp-sse2 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	b6d02d6457	x86: Use `testb` for case-locale check in str{n}casecmp-avx2 `testb` saves a bit of code size is the imm-operand can be encoded 1-bytes. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	5ce9766417	x86: Add support for VEC_SIZE == 64 in strcmp-evex.S impl Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and strncasecmp{l} functions can be added by including strcmp-evex.S with "x86-evex512-vecs.h" defined. In addition save code size a bit in a few places. 1. tzcnt ... -> bsf ... 2. vpcmp{b\|d} $0 ... -> vpcmpeq{b\|d} This saves a touch of code size but has minimal net affect. Full check passes on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	c25eb94aed	x86: Remove AVX512-BVMI2 instruction from strrchr-evex.S commit `b412213eee` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Tue Oct 18 17:44:07 2022 -0700 x86: Optimize strrchr-evex.S and implement with VMM headers Added `vpcompress{b\|d}` to the page-cross logic with is an AVX512-VBMI2 instruction. This is not supported on SKX. Since the page-cross logic is relatively cold and the benefit is minimal revert the page-cross case back to the old logic which is supported on SKX. Tested on x86-64.	2022-10-20 11:29:05 -07:00
Noah Goldstein	b412213eee	x86: Optimize strrchr-evex.S and implement with VMM headers Optimization is: 1. Cache latest result in "fast path" loop with `vmovdqu` instead of `kunpckdq`. This helps if there are more than one matches. Code Size Changes: strrchr-evex.S : +30 bytes (Same number of cache lines) Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strrchr-evex.S : 0.932 (From cases with higher match frequency) Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	4af6844aa5	x86: Optimize memrchr-evex.S Optimizations are: 1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... Code Size Changes: memrchr-evex.S : -29 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memrchr-evex.S : 0.949 (Mostly from improvements in small strings) Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	b79f8ff26a	x86: Optimize strnlen-evex.S and implement with VMM headers Optimizations are: 1. Use the fact that bsf(0) leaves the destination unchanged to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the strnlen and strlen code essentially incompatible so split strnlen-evex to a new file. Code Size Changes: strlen-evex.S : -23 bytes strnlen-evex.S : -167 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strlen-evex.S : 0.992 (No real change) strnlen-evex.S : 0.947 Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	69717709ec	x86: Shrink / minorly optimize strchr-evex and implement with VMM headers Size Optimizations: 1. Condence hot path for better cache-locality. - This is most impact for strchrnul where the logic strings with len <= VEC_SIZE or with a match in the first VEC no fits entirely in the first cache line. 2. Reuse common targets in first 4x VEC and after the loop. 3. Don't align targets so aggressively if it doesn't change the number of fetch blocks it will require and put more care in avoiding the case where targets unnecessarily split cache lines. 4. Align the loop better for DSB/LSD 5. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 6. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. Code Size Changes: strchr-evex.S : -63 bytes strchrnul-evex.S: -48 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strchr-evex.S (Fixed) : 0.971 strchr-evex.S (Rand) : 0.932 strchrnul-evex.S : 0.965 Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Noah Goldstein	330881763e	x86: Optimize memchr-evex.S and implement with VMM headers Optimizations are: 1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the memchr and rawmemchr code essentially incompatible so split rawmemchr-evex to a new file. Code Size Changes: memchr-evex.S : -107 bytes rawmemchr-evex.S : -53 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memchr-evex.S : 0.928 rawmemchr-evex.S : 0.986 (Less targets cross cache lines) Full results attached in email. Full check passes on x86-64.	2022-10-19 17:31:03 -07:00
Sunil K Pandey	451c6e5854	x86_64: Implement evex512 version of memchr, rawmemchr and wmemchr This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - memchr function using 512 bit vectors. - rawmemchr function using 512 bit vectors. - wmemchr function using 512 bit vectors. Code size data: memchr-evex.o 762 byte memchr-evex512.o 576 byte (-24%) rawmemchr-evex.o 461 byte rawmemchr-evex512.o 412 byte (-11%) wmemchr-evex.o 794 byte wmemchr-evex512.o 552 byte (-30%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2022-10-18 13:26:33 -07:00
Noah Goldstein	be066536bd	x86: Update strlen-evex-base to use new reg/vec macros. To avoid duplicate the VMM / GPR / mask insn macros in all incoming evex512 files use the macros defined in 'reg-macros.h' and '{vec}-macros.h' This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Noah Goldstein	47f5d51461	x86: Remove now unused vec header macros. This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Noah Goldstein	a6784653f7	x86: Update memset to use new VEC macros Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Noah Goldstein	4fb7d8a938	x86: Update memmove to use new VEC macros Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Noah Goldstein	3088a66ff8	x86: Update memrchr to use new VEC macros Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Noah Goldstein	52ab7604db	x86: Update VEC macros to complete API for evex/evex512 impls 1) Copy so that backport will be easier. 2) Make section only define if there is not a previous definition 3) Add `VEC_lo` definition for proper reg-width but in the ymm/zmm0-15 range. 4) Add macros for accessing GPRs based on VEC_SIZE This is to make it easier to do think like: ``` vpcmpb %VEC(0), %VEC(1), %k0 kmov{d\|q} %k0, %{eax\|rax} test %{eax\|rax} ``` It adds macro s.t any GPR can get the proper width with: `V{upcase_GPR_name}` and any mask insn can get the proper width with: `{upcase_mask_insn_without_postfix}` This commit does not change libc.so Tested build on x86-64	2022-10-14 21:21:58 -07:00
Adhemerval Zanella	5355f9ca7b	elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2022-10-10 10:32:28 -03:00

1 2 3 4 5 ...

493 Commits