glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-15 09:30:06 +00:00

Author	SHA1	Message	Date
Noah Goldstein	0679442def	x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064 ] The following commit commit `6f573a27b6` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc implementation list and adding wcslen-sse4.1 to the ifunc implementation list. Testing: test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as well as all other tests in wcsmbs and string. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-07-08 18:55:43 -04:00
H.J. Lu	a6e7c3745d	x86-64: Test strlen and wcslen with 0 in the RSI register [BZ #28064 ] commit `6f573a27b6` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the random value in the the RSI register is larger than the wide-character string length in the existing wcslen test, it didn't trigger the wcslen test failure. Add a test to force 0 into the RSI register before calling wcslen.	2021-07-08 18:55:40 -04:00
Fangrui Song	115d242456	x86_64: Remove unneeded static PIE check for undefined weak diagnostic https://sourceware.org/bugzilla/show_bug.cgi?id=21782 dropped an ld diagnostic for R_X86_64_PC32 referencing an undefined weak symbol in -pie links. Arguably keeping the diagnostic like other ports is more correct, since statically resolving movl foo(%rip), %eax to the link-time zero address produces a corrupted output. It turns out that --enable-static-pie builds do not depend on the ld behavior. GCC generates GOT indirection for weak declarations for -fPIE/-fPIC, so what ld does with the PC-relative relocation doesn't really matter. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-07-08 14:26:22 -07:00
Shen-Ta Hsieh	1683249d17	x86_64: roundeven with sse4.1 support This patch adds support for the sse4.1 hardware floating point roundeven. Here is some benchmark results on my systems: =AMD Ryzen 9 3900X 12-Core Processor= * benchmark result before this commit \| \| roundeven \| roundevenf \| \|------------\|--------------\|--------------\| \| duration \| 3.75587e+09 \| 3.75114e+09 \| \| iterations \| 3.93053e+08 \| 4.35402e+08 \| \| max \| 52.592 \| 58.71 \| \| min \| 7.98 \| 7.22 \| \| mean \| 9.55563 \| 8.61535 \| * benchmark result after this commit \| \| roundeven \| roundevenf \| \|------------\|---------------\|--------------\| \| duration \| 3.73815e+09 \| 3.73738e+09 \| \| iterations \| 5.82692e+08 \| 5.91498e+08 \| \| max \| 56.468 \| 51.642 \| \| min \| 6.27 \| 6.156 \| \| mean \| 6.41532 \| 6.3185 \| =Intel(R) Pentium(R) CPU D1508 @ 2.20GHz= * benchmark result before this commit \| \| roundeven \| roundevenf \| \|------------\|--------------\|--------------\| \| duration \| 2.18208e+09 \| 2.18258e+09 \| \| iterations \| 2.39932e+08 \| 2.46924e+08 \| \| max \| 96.378 \| 98.035 \| \| min \| 6.776 \| 5.94 \| \| mean \| 9.09456 \| 8.83907 \| * benchmark result after this commit \| \| roundeven \| roundevenf \| \|------------\|--------------\|--------------\| \| duration \| 2.17415e+09 \| 2.17005e+09 \| \| iterations \| 3.56193e+08 \| 4.09824e+08 \| \| max \| 51.693 \| 97.192 \| \| min \| 5.926 \| 5.093 \| \| mean \| 6.10385 \| 5.29507 \| Signed-off-by: Shen-Ta Hsieh <ibmibmibm.tw@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-06-27 07:56:57 -07:00
Noah Goldstein	08cbcd4dbc	x86: Remove unnecessary overflow check from wcsnlen-sse4_1.S No bug. The way wcsnlen will check if near the end of maxlen is the following macro: mov %r11, %rsi; \ subq %rax, %rsi; \ andq $-64, %rax; \ testq $-64, %rsi; \ je L(strnlen_ret) Which words independently of s + maxlen overflowing. So the second overflow check is unnecissary for correctness and just extra overhead in the common no overflow case. test-strlen.c, test-wcslen.c, test-strnlen.c and test-wcsnlen.c are all passing Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-06-24 19:14:19 -04:00
Noah Goldstein	a775a7a3eb	x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974 ] This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on maxlen * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-06-23 14:13:06 -04:00
Noah Goldstein	645a158978	x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974 ] This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on n * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-06-23 14:13:03 -04:00
Noah Goldstein	6f573a27b6	x86-64: Add wcslen optimize for sse4.1 No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-06-23 14:12:36 -04:00
H.J. Lu	a0db678071	x86-64: Move strlen.S to multiarch/strlen-vec.S Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1 version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants. This also removes the unused symbols, __GI___strlen_sse2 and __GI___wcsnlen_sse4_1.	2021-06-23 10:24:35 -07:00
H.J. Lu	79aec84102	Properly check stack alignment [BZ #27901 ] 1. Replace if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0) which may be optimized out by compiler, with int __attribute__ ((weak, noclone, noinline)) is_aligned (void *p, int align) { return (((uintptr_t) p) & (align - 1)) != 0; } 2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN. 3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment for both i386 and x86-64. 4. Update powerpc to use TEST_STACK_ALIGN_INIT. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2021-05-24 07:42:12 -07:00
Noah Goldstein	1b992204f6	x86: Improve memmove-vec-unaligned-erms.S This patch changes the condition for copy 4x VEC so that if length is exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of 8x VEC case. Results For Skylake memcpy-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.137 , 6.873 , New , 75.22 128 , 7 , 0 , 12.933 , 7.732 , New , 59.79 128 , 0 , 7 , 11.852 , 6.76 , New , 57.04 128 , 7 , 7 , 12.587 , 6.808 , New , 54.09 Results For Icelake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.963 , 5.416 , New , 54.36 128 , 7 , 0 , 16.467 , 8.061 , New , 48.95 128 , 0 , 7 , 14.388 , 7.644 , New , 53.13 128 , 7 , 7 , 14.546 , 7.642 , New , 52.54 Results For Tigerlake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 8.979 , 4.95 , New , 55.13 128 , 7 , 0 , 14.245 , 7.122 , New , 50.0 128 , 0 , 7 , 12.668 , 6.675 , New , 52.69 128 , 7 , 7 , 13.042 , 6.802 , New , 52.15 Results For Skylake memmove-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 6.181 , 5.691 , New , 92.07 128 , 32 , 0 , 6.165 , 5.752 , New , 93.3 128 , 0 , 7 , 13.923 , 9.37 , New , 67.3 128 , 7 , 0 , 12.049 , 10.182 , New , 84.5 Results For Icelake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.479 , 4.889 , New , 89.23 128 , 32 , 0 , 5.127 , 4.911 , New , 95.79 128 , 0 , 7 , 18.885 , 13.547 , New , 71.73 128 , 7 , 0 , 15.565 , 14.436 , New , 92.75 Results For Tigerlake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.275 , 4.815 , New , 91.28 128 , 32 , 0 , 5.376 , 4.565 , New , 84.91 128 , 0 , 7 , 19.426 , 14.273 , New , 73.47 128 , 7 , 0 , 15.924 , 14.951 , New , 93.89 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-05-23 22:50:49 -04:00
Noah Goldstein	6abf27980a	x86: Improve memset-vec-unaligned-erms.S No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-20 17:28:33 -04:00
Noah Goldstein	4ad473e97a	x86: Optimize memcmp-evex-movbe.S No bug. This commit optimizes memcmp-evex.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, removing some unnecissary ALU instructions from the main loop, and most importantly replacing the heavy use of vpcmp + kand logic with vpxor + vptern. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-18 22:57:51 -04:00
Noah Goldstein	16d12015c5	x86: Optimize memcmp-avx2-movbe.S No bug. This commit optimizes memcmp-avx2.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, and removing some unnecissary ALU instructions from the main loop. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-18 22:57:44 -04:00
Szabolcs Nagy	f4f8f4d4e0	elf: Use relaxed atomics for racy accesses [BZ #19329 ] This is a follow up patch to the fix for bug 19329. This adds relaxed MO atomics to accesses that were previously data races but are now race conditions, and where relaxed MO is sufficient. The race conditions all follow the pattern that the write is behind the dlopen lock, but a read can happen concurrently (e.g. during tls access) without holding the lock. For slotinfo entries the read value only matters if it reads from a synchronized write in dlopen or dlclose, otherwise the related dtv entry is not valid to access so it is fine to leave it in an inconsistent state. The same applies for GL(dl_tls_max_dtv_idx) and GL(dl_tls_generation), but there the algorithm relies on the fact that the read of the last synchronized write is an increasing value. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-05-11 17:16:37 +01:00
Noah Goldstein	104c7b1967	x86: Add EVEX optimized memchr family not safe for RTM No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-08 16:26:30 -04:00
Alice Xu	6ea916adfa	x86-64: Fix an unknown vector operation in memchr-evex.S An unknown vector operation occurred in commit `2a76821c30`. Fixed it by using "ymm{k1}{z}" but not "ymm {k1} {z}". Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-07 19:03:21 -07:00
Adhemerval Zanella	db373e4c57	Remove architecture specific sched_cpucount optimizations And replace the generic algorithm with the Brian Kernighan's one. GCC optimize it with popcnt if the architecture supports, so there is no need to add the extra POPCNT define to enable it. This is really a micro-optimization that only adds complexity: recent ABIs already support it (x86-64-v2 or power64le) and it simplifies the code for internal usage, since i686 does not allow an internal iFUNC call. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu.	2021-05-07 13:35:29 -03:00
Noah Goldstein	2a76821c30	x86: Optimize memchr-evex.S No bug. This commit optimizes memchr-evex.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, saving some ALU in the alignment process, and most importantly increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-03 21:18:03 -04:00
Noah Goldstein	acfd088a19	x86: Optimize memchr-avx2.S No bug. This commit optimizes memchr-avx2.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, asaving a few instructions the in loop return loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-05-03 21:17:21 -04:00
Paul Zimmermann	e6eef0adc5	regenerate ulps on x86_64 with -march=native On x86_64, when configuring glibc with CFLAGS="-O2 -g -march=native", some tests fail. After this patch, "make check" succeeds. Tested on Intel Core i5-4590 with gcc 10.2.1.	2021-04-28 12:46:00 +02:00
Noah Goldstein	7f3e7c262c	x86: Optimize strchr-evex.S No bug. This commit optimizes strchr-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-25 10:04:39 -07:00
Noah Goldstein	ccabe7971f	x86: Optimize strchr-avx2.S No bug. This commit optimizes strchr-avx2.S. The optimizations are all small things such as save an ALU in the alignment process, saving a few instructions in the loop return, saving some bytes in the main loop, and increasing the ILP in the return cases. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-25 10:04:31 -07:00
Florian Weimer	4baf02b332	nptl: Move pthread_spin_trylock into libc The symbol was moved using scripts/move-symbol-to-libc.py.	2021-04-23 17:06:48 +02:00
Florian Weimer	da8e3710d8	nptl: Move pthread_spin_lock into libc The symbol was moved using scripts/move-symbol-to-libc.py.	2021-04-23 17:06:46 +02:00
Florian Weimer	ce4b3b7bef	nptl: Move pthread_spin_init, Move pthread_spin_unlock into libc For some architectures, the two functions are aliased, so these symbols need to be moved at the same time. The symbols were moved using scripts/move-symbol-to-libc.py.	2021-04-23 17:06:44 +02:00
Florian Weimer	60d5e40ab2	x86: Remove low-level lock optimization The current approach is to do this optimizations at a higher level, in generic code, so that single-threaded cases can be specifically targeted. Furthermore, using IS_IN (libc) as a compile-time indicator that all locks are private is no longer correct once process-shared lock implementations are moved into libc. The generic <lowlevellock.h> is not compatible with assembler code (obviously), so it's necessary to remove two long-unused #includes. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-04-21 19:49:51 +02:00
Szabolcs Nagy	2208066603	elf: Remove lazy tlsdesc relocation related code Remove generic tlsdesc code related to lazy tlsdesc processing since lazy tlsdesc relocation is no longer supported. This includes removing GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at load time when that lock is already held. Added a documentation comment too. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-04-21 14:35:53 +01:00
Noah Goldstein	aaa23c3507	x86: Optimize strlen-avx2.S No bug. This commit optimizes strlen-avx2.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-19 18:03:49 -07:00
Noah Goldstein	4ba6558684	x86: Optimize strlen-evex.S No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-19 18:03:49 -07:00
Noah Goldstein	f53790272c	x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-19 15:08:04 -07:00
H.J. Lu	83c5b36822	x86-64: Require BMI2 for strchr-avx2.S Since strchr-avx2.S updated by commit `1f745ecc21` Author: noah <goldstein.w.n@gmail.com> Date: Wed Feb 3 00:38:59 2021 -0500 x86-64: Refactor and improve performance of strchr-avx2.S uses sarx: c4 e2 72 f7 c0 sarx %ecx,%eax,%eax for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and ifunc-avx2.h.	2021-04-19 11:01:45 -07:00
H.J. Lu	55bf411b45	x86-64: Require BMI2 for __strlen_evex and __strnlen_evex Since __strlen_evex and __strnlen_evex added by commit `1fd8c163a8` Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 5 06:24:52 2021 -0800 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX use sarx: c4 e2 6a f7 c0 sarx %edx,%eax,%eax require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c. ifunc-avx2.h already requires BMI2 for EVEX implementation.	2021-04-19 07:51:33 -07:00
noah	1a8605b6cd	x86: Update large memcpy case in memmove-vec-unaligned-erms.S No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-04-16 10:06:56 -07:00
Szabolcs Nagy	55c9f32380	x86_64: Remove lazy tlsdesc relocation related code _dl_tlsdesc_resolve_rela and _dl_tlsdesc_resolve_hold are only used for lazy tlsdesc relocation processing which is no longer supported.	2021-04-15 09:47:47 +01:00
Szabolcs Nagy	8f7e09f4db	x86_64: Avoid lazy relocation of tlsdesc [BZ #27137 ] Lazy tlsdesc relocation is racy because the static tls optimization and tlsdesc management operations are done without holding the dlopen lock. This similar to the commit `b7cf203b5c` for aarch64, but it fixes a different race: bug 27137. Another issue is that ld auditing ignores DT_BIND_NOW and thus tries to relocate tlsdesc lazily, but that does not work in a BIND_NOW module due to missing DT_TLSDESC_PLT. Unconditionally relocating tlsdesc at load time fixes this bug 27721 too.	2021-04-15 09:47:37 +01:00
Paul Zimmermann	43576de04a	Improve the accuracy of tgamma (BZ #26983 ) With this patch, the maximal known error for tgamma is now reduced to 9 ulps for dbl-64, for all rounding modes. Since exhaustive testing is not possible for dbl-64, it might be that there are still cases with an error larger than 9 ulps, but all known cases are fixed (intensive tests were done to find cases with large errors). Tested on x86_64 and powerpc (and by Adhemerval Zanella on aarch64, arm, s390x, sparc, and i686). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-04-07 13:23:39 +02:00
Paul Zimmermann	9acda61d94	Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469 , #14470 , #14471 , #14472 ] For j0f/j1f/y0f/y1f, the largest error for all binary32 inputs is reduced to at most 9 ulps for all rounding modes. The new code is enabled only when there is a cancellation at the very end of the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not give any visible slowdown on average. Two different algorithms are used: * around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of degree 3 are used, computed using the Sollya tool (https://www.sollya.org/) * for large inputs, an asymptotic formula from [1] is used [1] Fast and Accurate Bessel Function Computation, John Harrison, Proceedings of Arith 19, 2009. Inputs yielding the new largest errors are added to auto-libm-test-in, and ulps are regenerated for various targets (thanks Adhemerval Zanella). Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-04-02 06:15:48 +02:00
Sunil K Pandey	595c22ecd8	x86-64: Fix ifdef indentation in strlen-evex.S Fix some indentations of ifdef in file strlen-evex.S which are off by 1 and confusing to read.	2021-04-01 16:13:33 -07:00
H.J. Lu	b1ec623ed5	x86_64: Correct THREAD_SETMEM/THREAD_SETMEM_NC for movq [BZ #27591 ] config/i386/constraints.md in GCC has (define_constraint "e" "32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions)." (match_operand 0 "x86_64_immediate_operand")) Since movq takes a signed 32-bit immediate or a register source operand, use "er", instead of "nr"/"ir", constraint for 32-bit signed integer constant or register on movq. Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2021-04-01 07:00:22 -07:00
H.J. Lu	e4fda46310	x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	4e2d8f3527	x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	7ebba91361	x86-64: Add AVX optimized string/memory functions for RTM Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region.	2021-03-29 07:40:17 -07:00
H.J. Lu	91264fe357	x86-64: Add memcmp family functions with 256-bit EVEX Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	1b968b6b9b	x86-64: Add memset family functions with 256-bit EVEX Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	63ad43566f	x86-64: Add memmove family functions with 256-bit EVEX Update ifunc-memmove.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	525bc2a32c	x86-64: Add strcpy family functions with 256-bit EVEX Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.	2021-03-29 07:40:17 -07:00
H.J. Lu	1fd8c163a8	x86-64: Add ifunc-avx2.h functions with 256-bit EVEX Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and BMI2 since VZEROUPPER isn't needed at function exit. For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP is set.	2021-03-29 07:40:17 -07:00
H.J. Lu	3e2f285c5f	nptl: Remove MULTI_PAGE_ALIASING [BZ #23554 ] MULTI_PAGE_ALIASING was introduced to mitigate an aliasing issue on Pentium 4. It is no longer needed for processors after Pentium 4.	2021-03-19 15:04:17 -07:00
Wilco Dijkstra	47ad14d789	math: Remove mpa files [BZ #15267 ] Finally remove all mpa related files, headers, declarations, probes, unused tables and update makefiles. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2021-03-11 14:26:36 +00:00

1 2 3 4 5 ...

1361 Commits