glibc

mirror of https://sourceware.org/git/glibc.git synced 2025-01-10 11:20:10 +00:00

Author	SHA1	Message	Date
H.J. Lu	c5c666f349	s_sincosf.h: Change pio4 type to float [BZ #28713 ] s_cosf.c and s_sinf.c have if (abstop12 (y) < abstop12 (pio4)) where abstop12 takes a float argument, but pio4 is static const double. pio4 is used only in calls to abstop12 and never in arithmetic. Apply -static const double pio4 = 0x1.921FB54442D18p-1; +static const float pio4 = 0x1.921FB6p-1f; to fix: FAIL: math/test-float-cos FAIL: math/test-float-sin FAIL: math/test-float-sincos FAIL: math/test-float32-cos FAIL: math/test-float32-sin FAIL: math/test-float32-sincos when compiling with GCC 12. Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> (cherry picked from commit `d3e4f5a101`)	2023-01-11 15:29:37 +01:00
H.J. Lu	dd0c72fb46	Regenerate ulps on x86_64 with GCC 12 Fix FAIL: math/test-float-clog10 FAIL: math/test-float32-clog10 on Intel Core i7-1165G7 with GCC 12. (cherry picked from commit `de8a0897e3`)	2023-01-11 14:35:10 +01:00
Fangrui Song	bbe4bbb6e8	elf: Drop elf/tls-macros.h in favor of __thread and tls_model attributes [BZ #28152 ] [BZ #28205 ] elf/tls-macros.h was added for TLS testing when GCC did not support __thread. __thread and tls_model attributes are mature now and have been used by many newer tests. Also delete tst-tls2.c which tests .tls_common (unused by modern GCC and unsupported by Clang/LLD). .tls_common and .tbss definition are almost identical after linking, so the runtime test doesn't add additional coverage. Assembler and linker tests should be on the binutils side. When LLD 13.0.0 is allowed in configure.ac (https://sourceware.org/pipermail/libc-alpha/2021-August/129866.html), `make check` result is on par with glibc built with GNU ld on aarch64 and x86_64. As a future clean-up, TLS_GD/TLS_LD/TLS_IE/TLS_IE macros can be removed from sysdeps/*/tls-macros.h. We can add optional -mtls-dialect={gnu2,trad} tests to ensure coverage. Tested on aarch64-linux-gnu, powerpc64le-linux-gnu, and x86_64-linux-gnu. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit `33c50ef428`)	2023-01-10 19:09:21 +01:00
Alan Modra	d0e2ac0c59	elf/tst-tlsopt-powerpc fails when compiled with -mcpu=power10 (BZ# 29776) Supports pcrel addressing of TLS GOT entry. Also tweak the non-pcrel asm constraint to better reflect how the reg is used. (cherry picked from commit `94628de778`)	2023-01-10 17:33:30 +01:00
Noah Goldstein	e3255e7d21	x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591 ] Previous implementation was adjusting length (rsi) to match bytes (eax), but since there is no bound to length this can cause overflow. Fix is to just convert the byte-count (eax) to length by dividing by sizeof (wchar_t) before the comparison. Full check passes on x86-64 and build succeeds w/ and w/o multiarch. (cherry picked from commit `b0969fa53a`)	2022-11-24 14:42:41 -08:00
Vladislav Khmelevsky	691f70b84a	elf: Fix rtld-audit trampoline for aarch64 This patch fixes two problems with audit: 1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS, resulting in x2 register value nulling in RG structure. 2. We need to preserve the x8 register before function call, but don't have to save it's new value and restore it before return. Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value which is wrong (althoug doesn't affect anything). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `eb4181e9f4`)	2022-11-22 10:34:14 -03:00
Florian Weimer	d57cdc1b5a	Linux: Support __IPC_64 in sysvctl *ctl command arguments (bug 29771) Old applications pass __IPC_64 as part of the command argument because old glibc did not check for unknown commands, and passed through the arguments directly to the kernel, without adding __IPC_64. Applications need to continue doing that for old glibc compatibility, so this commit enables this approach in current glibc. For msgctl and shmctl, if no translation is required, make direct system calls, as we did before the time64 changes. If translation is required, mask __IPC_64 from the command argument. For semctl, the union-in-vararg argument handling means that translation is needed on all architectures. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `22a46dee24`)	2022-11-11 18:06:25 +01:00
Adhemerval Zanella	ca5df79545	linux: Fix generic struct_stat for 64 bit time (BZ# 29657) The generic Linux struct_stat misses the conditionals to use bits/struct_stat_time64_helper.h in the __USE_TIME_BITS64 for architecture that uses __TIMESIZE == 32 (currently csky and nios2). Since newer ports should not support 32 bit time_t, the generic implementation should be used as default. For arm, hppa, and sh a copy of default struct_stat is added, while for csky and nios a new one based on generic is used, along with conditionals to use bits/struct_stat_time64_helper.h. The default struct_stat is also replaced with the generic one. Checked on aarch64-linux-gnu and arm-linux-gnueabihf. (cherry picked from commit `7a6ca82f80`)	2022-10-25 16:12:20 -03:00
Aurelien Jarno	e570b865b5	x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementations The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: `df7e295d18` ("x86: Optimize {str\|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `7e8283170c`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	36d6b9be3d	x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementation The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: `af5306a735` ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `3c0c78afab`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	94b9c1b640	x86-64: Require BMI2 for AVX2 (raw\|w)memchr implementations The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi' and 'sarx' instructions, which belongs to the BMI2 CPU feature. Fixes: `acfd088a19` ("x86: Optimize memchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `e3e7fab7fe`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	67e863742d	x86-64: Require BMI2 for AVX2 wcs(n)cmp implementations The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: `b77b06e0e2` ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `f31a5a884e`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	b9cbb8dd48	x86-64: Require BMI2 for AVX2 strncmp implementation The AVX2 strncmp implementations uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: `b77b06e0e2` ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `fc7de1d9b9`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	e1561d8cf0	x86-64: Require BMI2 for AVX2 strcmp implementation The AVX2 strcmp implementation uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: `b77b06e0e2` ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `4d64c64457`)	2022-10-04 00:00:59 +02:00
Aurelien Jarno	414fc856ff	x86-64: Require BMI2 for AVX2 str(n)casecmp implementations The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: `b77b06e0e2` ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `10f79d3670`)	2022-10-04 00:00:58 +02:00
Aurelien Jarno	95f5089d4a	x86: include BMI1 and BMI2 in x86-64-v3 level The "System V Application Binary Interface AMD64 Architecture Processor Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3 level. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `b80f16adbd`)	2022-10-04 00:00:58 +02:00
Wangyang Guo	ea69248445	nptl: Add backoff mechanism to spinlock loop When mutiple threads waiting for lock at the same time, once lock owner releases the lock, waiters will see lock available and all try to lock, which may cause an expensive CAS storm. Binary exponential backoff with random jitter is introduced. As try-lock attempt increases, there is more likely that a larger number threads compete for adaptive mutex lock, so increase wait time in exponential. A random jitter is also added to avoid synchronous try-lock from other threads. v2: Remove read-check before try-lock for performance. v3: 1. Restore read-check since it works well in some platform. 2. Make backoff arch dependent, and enable it for x86_64. 3. Limit max backoff to reduce latency in large critical section. v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h v5: Commit log updated for regression in large critical section. Result of pthread-mutex-locks bench Test Platform: Xeon 8280L (2 socket, 112 CPUs in total) First Row: thread number First Col: critical section length Values: backoff vs upstream, time based, low is better non-critical-length: 1 1 2 4 8 16 32 64 112 140 0 0.99 0.58 0.52 0.49 0.43 0.44 0.46 0.52 0.54 1 0.98 0.43 0.56 0.50 0.44 0.45 0.50 0.56 0.57 2 0.99 0.41 0.57 0.51 0.45 0.47 0.48 0.60 0.61 4 0.99 0.45 0.59 0.53 0.48 0.49 0.52 0.64 0.65 8 1.00 0.66 0.71 0.63 0.56 0.59 0.66 0.72 0.71 16 0.97 0.78 0.91 0.73 0.67 0.70 0.79 0.80 0.80 32 0.95 1.17 0.98 0.87 0.82 0.86 0.89 0.90 0.90 64 0.96 0.95 1.01 1.01 0.98 1.00 1.03 0.99 0.99 128 0.99 1.01 1.01 1.17 1.08 1.12 1.02 0.97 1.02 non-critical-length: 32 1 2 4 8 16 32 64 112 140 0 1.03 0.97 0.75 0.65 0.58 0.58 0.56 0.70 0.70 1 0.94 0.95 0.76 0.65 0.58 0.58 0.61 0.71 0.72 2 0.97 0.96 0.77 0.66 0.58 0.59 0.62 0.74 0.74 4 0.99 0.96 0.78 0.66 0.60 0.61 0.66 0.76 0.77 8 0.99 0.99 0.84 0.70 0.64 0.66 0.71 0.80 0.80 16 0.98 0.97 0.95 0.76 0.70 0.73 0.81 0.85 0.84 32 1.04 1.12 1.04 0.89 0.82 0.86 0.93 0.91 0.91 64 0.99 1.15 1.07 1.00 0.99 1.01 1.05 0.99 0.99 128 1.00 1.21 1.20 1.22 1.25 1.31 1.12 1.10 0.99 non-critical-length: 128 1 2 4 8 16 32 64 112 140 0 1.02 1.00 0.99 0.67 0.61 0.61 0.61 0.74 0.73 1 0.95 0.99 1.00 0.68 0.61 0.60 0.60 0.74 0.74 2 1.00 1.04 1.00 0.68 0.59 0.61 0.65 0.76 0.76 4 1.00 0.96 0.98 0.70 0.63 0.63 0.67 0.78 0.77 8 1.01 1.02 0.89 0.73 0.65 0.67 0.71 0.81 0.80 16 0.99 0.96 0.96 0.79 0.71 0.73 0.80 0.84 0.84 32 0.99 0.95 1.05 0.89 0.84 0.85 0.94 0.92 0.91 64 1.00 0.99 1.16 1.04 1.00 1.02 1.06 0.99 0.99 128 1.00 1.06 0.98 1.14 1.39 1.26 1.08 1.02 0.98 There is regression in large critical section. But adaptive mutex is aimed for "quick" locks. Small critical section is more common when users choose to use adaptive pthread_mutex. Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `8162147872`)	2022-09-28 07:34:53 -07:00
Noah Goldstein	04efdcfac4	sysdeps: Add 'get_fast_jitter' interace in fast-jitter.h 'get_fast_jitter' is meant to be used purely for performance purposes. In all cases it's used it should be acceptable to get no randomness (see default case). An example use case is in setting jitter for retries between threads at a lock. There is a performance benefit to having jitter, but only if the jitter can be generated very quickly and ultimately there is no serious issue if no jitter is generated. The implementation generally uses 'HP_TIMING_NOW' iff it is inlined (avoid any potential syscall paths). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `911c63a51c`)	2022-09-28 07:34:31 -07:00
Arjun Shankar	68507377f2	socket: Check lengths before advancing pointer in CMSG_NXTHDR The inline and library functions that the CMSG_NXTHDR macro may expand to increment the pointer to the header before checking the stride of the increment against available space. Since C only allows incrementing pointers to one past the end of an array, the increment must be done after a length check. This commit fixes that and includes a regression test for CMSG_FIRSTHDR and CMSG_NXTHDR. The Linux, Hurd, and generic headers are all changed. Tested on Linux on armv7hl, i686, x86_64, aarch64, ppc64le, and s390x. [BZ #28846] Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit `9c443ac455`)	2022-08-22 18:59:26 +02:00
Florian Weimer	1fcc7bfee2	alpha: Fix generic brk system call emulation in __brk_call (bug 29490) The kernel special-cases the zero argument for alpha brk, and we can use that to restore the generic Linux error handling behavior. Fixes commit `b57ab258c1` ("Linux: Introduce __brk_call for invoking the brk system call"). (cherry picked from commit `e7ad26ee3c`)	2022-08-22 11:12:40 +02:00
Joseph Myers	4ab59ce4e5	Update syscall lists for Linux 5.19 Linux 5.19 has no new syscalls, but enables memfd_secret in the uapi headers for RISC-V. Update the version number in syscall-names.list to reflect that it is still current for 5.19 and regenerate the arch-syscall.h headers with build-many-glibcs.py update-syscalls. Tested with build-many-glibcs.py. (cherry picked from commit `fccadcdf5b`)	2022-08-08 07:54:02 +02:00
Joseph Myers	b991af5063	Update syscall-names.list for Linux 5.18 Linux 5.18 has no new syscalls. Update the version number in syscall-names.list to reflect that it is still current for 5.18. Tested with build-many-glibcs.py. (cherry picked from commit `3d9926663c`)	2022-07-21 17:06:27 +02:00
Noah Goldstein	ccc54bd61c	x86: Add missing IS_IN (libc) check to strncmp-sse4_2.S Was missing to for the multiarch build rtld-strncmp-sse4_2.os was being built and exporting symbols: build/glibc/string/rtld-strncmp-sse4_2.os: 0000000000000000 T __strncmp_sse42 Introduced in: commit `11ffcacb64` Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 21 12:10:50 2017 -0700 x86-64: Implement strcmp family IFUNC selectors in C (cherry picked from commit `96ac447d91`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	35f9c72c8b	x86: Move mem{p}{mov\|cpy}_{chk_}erms to its own file The primary memmove_{impl}_unaligned_erms implementations don't interact with this function. Putting them in same file both wastes space and unnecessarily bloats a hot code section. (cherry picked from commit `21925f6473`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	7079931c51	x86: Move and slightly improve memset_erms Implementation wise: 1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not use the L(stosb) label that was previously defined. 2. Don't give the hotpath (fallthrough) to zero size. Code positioning wise: Move memset_{chk}_erms to its own file. Leaving it in between the memset_{impl}_unaligned both adds unnecessary complexity to the file and wastes space in a relatively hot cache section. (cherry picked from commit `4a3f29e7e4`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	f4598f0351	x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl list This was simply missing and meant we weren't testing it properly. (cherry picked from commit `2a1099020c`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	aadd0a1c7c	x86: Put wcs{n}len-sse4.1 in the sse4.1 text section Previously was missing but the two implementations shouldn't get in the sse2 (generic) text section. (cherry picked from commit `afc6e4328f`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	d201c59177	x86: Align entry for memrchr to 64-bytes. The function was tuned around 64-byte entry alignment and performs better for all sizes with it. As well different code boths where explicitly written to touch the minimum number of cache line i.e sizes <= 32 touch only the entry cache line. (cherry picked from commit `227afaa672`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	c51d8d383c	x86: Add BMI1/BMI2 checks for ISA_V3 check BMI1/BMI2 are part of the ISA V3 requirements: https://en.wikipedia.org/wiki/X86-64 And defined by GCC when building with `-march=x86-64-v3` (cherry picked from commit `8da9f346cb`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	ba1c3f23d9	x86: Cleanup bounds checking in large memcpy case 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x). Previously was using `__x86_rep_movsb_threshold` and should have been using `__x86_shared_non_temporal_threshold`. 2. Avoid reloading __x86_shared_non_temporal_threshold before the L(large_memcpy_4x) bounds check. 3. Document the second bounds check for L(large_memcpy_4x) more clearly. (cherry picked from commit `89a25c6f64`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	94b0dc9419	x86: Add bounds `x86_non_temporal_threshold` The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed by memmove-vec-unaligned-erms. The lower-bound is needed because memmove-vec-unaligned-erms unrolls the loop aggressively in the L(large_memset_4x) case. The upper-bound is needed because memmove-vec-unaligned-erms right-shifts the value of `x86_non_temporal_threshold` by LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow. The lack of lower-bound can be a correctness issue. The lack of upper-bound cannot. (cherry picked from commit `b446822b6a`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	9d50e162ee	x86: Add sse42 implementation to strcmp's ifunc This has been missing since the the ifuncs where added. The performance of SSE4.2 is preferable to to SSE2. Measured on Tigerlake with N = 20 runs. Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906 (cherry picked from commit `ff439c4717`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	6e008c884d	x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Move the setting of `rep_movsb_stop_threshold` to after the tunables have been collected so that the `rep_movsb_stop_threshold` (which is used to redirect control flow to the non_temporal case) will use any user value for `non_temporal_threshold` (set using glibc.cpu.x86_non_temporal_threshold) (cherry picked from commit `0355915514`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	fc54e1fae8	x86: Align varshift table to 32-bytes This ensures the load will never split a cache line. (cherry picked from commit `0f91811333`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	820504e3ed	x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions Give fall-through path to `vzeroupper` and taken-path to `vzeroall`. Generally even on machines with RTM the expectation is the string-library functions will not be called in transactions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c28db9cb29`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	3c87383a20	x86: Shrink code size of memchr-evex.S This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `56da3fe1dd`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	a910d7e164	x86: Shrink code size of memchr-avx2.S This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6dcbb7d95d`) x86: Fix page cross case in rawmemchr-avx2 [BZ #29234] commit `6dcbb7d95d` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jun 6 21:11:33 2022 -0700 x86: Shrink code size of memchr-avx2.S Changed how the page cross case aligned string (rdi) in rawmemchr. This was incompatible with how `L(cross_page_continue)` expected the pointer to be aligned and would cause rawmemchr to read data start started before the beginning of the string. What it would read was in valid memory but could count CHAR matches resulting in an incorrect return value. This commit fixes that issue by essentially reverting the changes to the L(page_cross) case as they didn't really matter. Test cases added and all pass with the new code (and where confirmed to fail with the old code). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `2c9af8421d`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	b05bd59823	x86: Optimize memrchr-avx2.S The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `af5306a735`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	83a986e9fb	x86: Optimize memrchr-evex.S The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `b4209615a0`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	4901009dad	x86: Optimize memrchr-sse2.S The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `731feee386`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	e805606193	x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `dd5c483b25`)	2022-07-18 20:45:20 -07:00
Noah Goldstein	70be93d1c5	x86: Create header for VEC classes in x86 strings library This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `8a780a6b91`)	2022-07-18 20:45:20 -07:00
Raghuveer Devulapalli	82a707aeb7	x86_64: Add strstr function with 512-bit EVEX Adding a 512-bit EVEX version of strstr. The algorithm works as follows: (1) We spend a few cycles at the begining to peek into the needle. We locate an edge in the needle (first occurance of 2 consequent distinct characters) and also store the first 64-bytes into a zmm register. (2) We search for the edge in the haystack by looking into one cache line of the haystack at a time. This avoids having to read past a page boundary which can cause a seg fault. (3) If an edge is found in the haystack we first compare the first 64-bytes of the needle (already stored in a zmm register) before we proceed with a full string compare performed byte by byte. Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512) Geometric mean of all benchmarks: new / old = 0.66 Difficult skiptable(0) : new / old = 0.02 Difficult skiptable(1) : new / old = 0.01 Difficult 2-way : new / old = 0.25 Difficult testing first 2 : new / old = 1.26 Difficult skiptable(0) : new / old = 0.05 Difficult skiptable(1) : new / old = 0.06 Difficult 2-way : new / old = 0.26 Difficult testing first 2 : new / old = 1.05 Difficult skiptable(0) : new / old = 0.42 Difficult skiptable(1) : new / old = 0.24 Difficult 2-way : new / old = 0.21 Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5082a287d5`) x86: Remove __mmask intrinsics in strstr-avx512.c The intrinsics are not available before GCC7 and using standard operators generates code of equivalent or better quality. Removed: _cvtmask64_u64 _kshiftri_mask64 _kand_mask64 Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958 (cherry picked from commit `f2698954ff`)	2022-07-18 20:45:20 -07:00
H.J. Lu	f6bc52f080	x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we can ignore their r_addends. Reviewed-by: Fangrui Song <maskray@google.com> (cherry picked from commit `f8587a6189`)	2022-07-18 20:45:20 -07:00
Sunil K Pandey	8ab861d295	x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen This patch implements following evex512 version of string functions. Perf gain for evex512 version is up to 50% as compared to evex, depending on length and alignment. Placeholder function, not used by any processor at the moment. - String length function using 512 bit vectors. - String N length using 512 bit vectors. - Wide string length using 512 bit vectors. - Wide string N length using 512 bit vectors. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `9c66efb86f`)	2022-07-18 20:45:20 -07:00
Adhemerval Zanella	eb9aa96fac	x86_64: Remove bzero optimization Both symbols are marked as legacy in POSIX.1-2001 and removed on POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE or _DEFAULT_SOURCE. GCC also replaces bcopy with a memmove and bzero with memset on default configuration (to actually get a bzero libc call the code requires to omit string.h inclusion and built with -fno-builtin), so it is highly unlikely programs are actually calling libc bzero symbol. On a recent Linux distro (Ubuntu 22.04), there is no bzero calls by the installed binaries. $ cat count_bstring.sh #!/bin/bash files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done` total=0 for file in $files; do symbols=`objdump -R $file 2>&1` if [ $? -eq 0 ]; then ncalls=`echo $symbols \| grep -w $1 \| wc -l` ((total=total+ncalls)) if [ $ncalls -gt 0 ]; then echo "$file: $ncalls" fi fi done echo "TOTAL=$total" $ ./count_bstring.sh bzero TOTAL=0 Checked on x86_64-linux-gnu. (cherry picked from commit `9403b71ae9`)	2022-07-18 20:45:20 -07:00
Sunil K Pandey	8d324019e6	x86_64: Remove end of line trailing spaces This commit remove trailing space introduced by following commit. commit `a775a7a3eb` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:56:29 2021 -0400 x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]	2022-07-18 20:45:09 -07:00
Adhemerval Zanella	4b246b2bbd	linux: Fix mq_timereceive check for 32 bit fallback code (BZ 29304) On success, mq_receive() and mq_timedreceive() return the number of bytes in the received message, so it requires to check if the value is larger than 0. Checked on i686-linux-gnu. (cherry picked from commit `71d87d85bf`)	2022-06-30 10:46:39 -03:00
Adhemerval Zanella	bb4148283f	nios2: Remove _dl_skip_args usage (BZ# 29187) Since `ad43cac44a` the generic code already shuffles the argv/envp/auxv on the stack to remove the ld.so own arguments and thus _dl_skip_args is always 0. So there is no need to adjust the argc or argv. Checked with qemu-user that arguments are correctly passed on both constructors and main program. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `4868ba5d25`)	2022-06-10 09:15:00 -03:00
Adhemerval Zanella	96944f0f81	hppa: Remove _dl_skip_args usage (BZ# 29165) Different than other architectures, hppa creates an unrelated stack frame where ld.so argc/argv adjustments done by `ad43cac44a` is not done on the argc/argv saved/restore by _dl_start_user. Instead load _dl_argc and _dl_argv directlty instead of adjust them using _dl_skip_args value. Checked on hppa-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `6242602273`)	2022-06-10 09:13:54 -03:00

1 2 3 4 5 ...

14556 Commits