glibc

mirror of https://sourceware.org/git/glibc.git synced 2024-11-23 05:20:06 +00:00

Author	SHA1	Message	Date
Sunil K Pandey	2bf02c5843	x86-64: Add vector cbrt/cbrtf implementation to libmvec Implement vectorized cbrt/cbrtf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cbrt/cbrtf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:38:02 -08:00
Sunil K Pandey	aa1809a1df	x86-64: Add vector sinh/sinhf implementation to libmvec Implement vectorized sinh/sinhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector sinh/sinhf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:55 -08:00
Sunil K Pandey	76ddc74e86	x86-64: Add vector expm1/expm1f implementation to libmvec Implement vectorized expm1/expm1f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector expm1/expm1f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:49 -08:00
Sunil K Pandey	ef7ea9c132	x86-64: Add vector cosh/coshf implementation to libmvec Implement vectorized cosh/coshf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cosh/coshf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:42 -08:00
Sunil K Pandey	8b726453d5	x86-64: Add vector exp10/exp10f implementation to libmvec Implement vectorized exp10/exp10f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp10/exp10f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:35 -08:00
Sunil K Pandey	3fc9ccc20b	x86-64: Add vector exp2/exp2f implementation to libmvec Implement vectorized exp2/exp2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp2/exp2f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:29 -08:00
Sunil K Pandey	37475ba883	x86-64: Add vector hypot/hypotf implementation to libmvec Implement vectorized hypot/hypotf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector hypot/hypotf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:21 -08:00
Sunil K Pandey	11c01de14c	x86-64: Add vector asin/asinf implementation to libmvec Implement vectorized asin/asinf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector asin/asinf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:37:03 -08:00
Sunil K Pandey	146310177a	x86-64: Add vector atan/atanf implementation to libmvec Implement vectorized atan/atanf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atan/atanf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-29 11:36:46 -08:00
Adhemerval Zanella	83b8d5027d	malloc: Remove memusage.h And use machine-sp.h instead. The Linux implementation is based on already provided CURRENT_STACK_FRAME (used on nptl code) and STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.	2021-12-28 14:57:57 -03:00
Adhemerval Zanella	a75b1e35c5	malloc: Use hp-timing on libmemusage Instead of reimplemeting on GETTIME macro.	2021-12-28 14:57:57 -03:00
Adhemerval Zanella	8c0664e2b8	elf: Add _dl_audit_pltexit It consolidates the code required to call la_pltexit audit callback. Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu. Reviewed-by: Florian Weimer <fweimer@redhat.com>	2021-12-28 08:40:38 -03:00
Samuel Thibault	ae49f218da	hurd: Fix static-PIE startup hurd initialization stages use RUN_HOOK to run various initialization functions. That is however using absolute addresses which need to be relocated, which is done later by csu. We can however easily make the linker compute relative addresses which thus don't need a relocation. The new SET_RELHOOK and RUN_RELHOOK macros implement this.	2021-12-28 10:28:22 +01:00
Noah Goldstein	cca457f9c5	x86: Optimize L(less_vec) case in memcmpeq-evex.S No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-27 03:18:58 -06:00
Noah Goldstein	abddd61de0	x86: Optimize L(less_vec) case in memcmp-evex-movbe.S No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-27 03:17:59 -06:00
Sunil K Pandey	f20f980c71	x86-64: Add vector acos/acosf implementation to libmvec Implement vectorized acos/acosf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector acos/acosf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-22 13:03:14 -08:00
H.J. Lu	de8a0897e3	Regenerate ulps on x86_64 with GCC 12 Fix FAIL: math/test-float-clog10 FAIL: math/test-float32-clog10 on Intel Core i7-1165G7 with GCC 12.	2021-12-20 15:25:00 -08:00
Siddhesh Poyarekar	5afe4c0d69	Cleanup encoding in comments Replace non-UTF-8 and non-ASCII characters in comments with their UTF-8 equivalents so that files don't end up with mixed encodings. With this, all files (except tests that actually test different encodings) have a single encoding. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>	2021-12-13 10:01:45 +05:30
Florian Weimer	627f5ede70	Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGN TLS_INIT_TCB_ALIGN is not actually used. TLS_TCB_ALIGN was likely introduced to support a configuration where the thread pointer has not the same alignment as THREAD_SELF. Only ia64 seems to use that, but for the stack/pointer guard, not for storing tcbhead_t. Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift the thread pointer, potentially landing in a different residue class modulo the alignment, but the changes should not impact that. In general, given that TLS variables have their own alignment requirements, having different alignment for the (unshifted) thread pointer and struct pthread would potentially result in dynamic offsets, leading to more complexity. hppa had different values before: __alignof__ (tcbhead_t), which seems to be 4, and __alignof__ (struct pthread), which was 8 (old default) and is now 32. However, it defines THREAD_SELF as: /* Return the thread descriptor for the current thread. / # define THREAD_SELF \ ({ struct pthread __self; \ __self = __get_cr27(); \ __self - 1; \ }) So the thread pointer points after struct pthread (hence __self - 1), and they have to have the same alignment on hppa as well. Similarly, on ia64, the definitions were different. We have: # define TLS_PRE_TCB_SIZE \ (sizeof (struct pthread) \ + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t) \ ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1) \ & ~(__alignof__ (struct pthread) - 1)) \ : 0)) # define THREAD_SELF \ ((struct pthread ) ((char ) __thread_self - TLS_PRE_TCB_SIZE)) And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment (confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c). On m68k, we have a larger gap between tcbhead_t and struct pthread. But as far as I can tell, the port is fine with that. The definition of TCB_OFFSET is sufficient to handle the shifted TCB scenario. This fixes commit `23c77f6018` ("nptl: Increase default TCB alignment to 32"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-12-09 23:47:49 +01:00
Florian Weimer	8d1927d8dc	nptl: Introduce THREAD_GETMEM_VOLATILE This will be needed for rseq TCB access. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-09 09:49:32 +01:00
Florian Weimer	ce2248ab91	nptl: Introduce <tcb-access.h> for THREAD_* accessors These are common between most architectures. Only the x86 targets are outliers. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-12-09 09:49:32 +01:00
Noah Goldstein	4df1fa6ddc	x86-64: Use notl in EVEX strcmp [BZ #28646 ] Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>	2021-12-03 21:14:11 -08:00
Sunil K Pandey	c58d3b7d00	x86-64: Add vector sin/sinf to libmvec microbenchmark Add vector sin/sinf and input files to libmvec microbenchmark. libmvec-sin-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-sinf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-24 07:50:23 -08:00
Sunil K Pandey	6a556bac81	x86-64: Add vector pow/powf to libmvec microbenchmark Add vector pow/powf and input files to libmvec microbenchmark. libmvec-pow-inputs: arg1: 90% Normal random distribution range: (0.0, 256.0) mean: 0.0 sigma: 32.0 10% uniform random distribution in range (0.0, 256.0) arg2: 90% Normal random distribution range: (-127.0, 127.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-127.0, 127.0) libmvec-powf-inputs: arg1: 90% Normal random distribution range: (0.0f, 100.0f) mean: 0.0f sigma: 16.0f 10% uniform random distribution in range (0.0f, 100.0f) arg2: 90% Normal random distribution range: (-10.0f, 10.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-10.0f, 10.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-24 07:49:14 -08:00
Sunil K Pandey	8ab8afb336	x86-64: Add vector log/logf to libmvec microbenchmark Add vector log/logf and input files to libmvec microbenchmark. libmvec-log-inputs: 70% Normal random distribution range: (0.0, DBL_MAX) mean: 1.0 sigma: 50.0 30% uniform random distribution in range (0.0, DBL_MAX) libmvec-logf-inputs: 70% Normal random distribution range: (0.0f, FLT_MAX) mean: 1.0f sigma: 50.0f 30% uniform random distribution in range (0.0f, FLT_MAX) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-24 07:48:14 -08:00
Sunil K Pandey	37df38bd5f	x86-64: Add vector exp/expf to libmvec microbenchmark Add vector exp/expf and input files to libmvec microbenchmark. libmvec-exp-inputs: 90% Normal random distribution range: (-708.0, 709.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-500.0, 500.0) libmvec-expf-inputs: 90% Normal random distribution range: (-87.0f, 88.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-50.0f, 50.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-24 07:46:59 -08:00
Sunil K Pandey	4443695598	x86-64: Add vector cos/cosf to libmvec microbenchmark Add vector cos/cosf and input files to libmvec microbenchmark. libmvec-cos-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-cosf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-24 07:45:20 -08:00
Sunil K Pandey	a43c0b5483	x86-64: Create microbenchmark infrastructure for libmvec Add python script to generate libmvec microbenchmark from the input values for each libmvec function using skeleton benchmark template. Creates double and float benchmarks with vector length 1, 2, 4, 8, and 16 for each libmvec function. Vector length 1 corresponds to scalar version of function and is included for vector function perf comparison. Co-authored-by: Haochen Jiang <haochen.jiang@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-16 11:37:39 -08:00
Noah Goldstein	2f9062d717	x86: Shrink memcmp-sse4.S code size No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-10 20:12:10 -06:00
Noah Goldstein	a6b7502ec0	x86: Optimize memmove-vec-unaligned-erms.S No bug. The optimizations are as follows: 1) Always align entry to 64 bytes. This makes behavior more predictable and makes other frontend optimizations easier. 2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have significant benefits in the case that: 0 < (dst - src) < [256, 512] 3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%] improvement and for FSRM [-10%, 25%]. In addition to these primary changes there is general cleanup throughout to optimize the aligning routines and control flow logic. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-06 16:18:03 -05:00
Fangrui Song	6720d36b66	x86-64: Replace movzx with movzbl Clang cannot assemble movzx in the AT&T dialect mode. ../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction movzx (%rsi), %ecx ^~~~ Change movzx to movzbl, which follows the AT&T dialect and is used elsewhere in the file. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-11-02 20:59:52 -07:00
H.J. Lu	14dbbf46a0	x86-64: Remove Prefer_AVX2_STRCMP Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1 VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1 VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake.	2021-11-01 07:53:04 -07:00
H.J. Lu	c46e9afb2d	x86-64: Improve EVEX strcmp with masked load In strcmp-evex.S, to compare 2 32-byte strings, replace VMOVU (%rdi, %rdx), %YMM0 VMOVU (%rsi, %rdx), %YMM1 /* Each bit in K0 represents a mismatch in YMM0 and YMM1. / VPCMP $4, %YMM0, %YMM1, %k0 VPCMP $0, %YMMZERO, %YMM0, %k1 VPCMP $0, %YMMZERO, %YMM1, %k2 / Each bit in K1 represents a NULL in YMM0 or YMM1. / kord %k1, %k2, %k1 / Each bit in K1 represents a NULL or a mismatch. / kord %k0, %k1, %k1 kmovd %k1, %ecx testl %ecx, %ecx jne L(last_vector) with VMOVU (%rdi, %rdx), %YMM0 VPTESTM %YMM0, %YMM0, %k2 / Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi, %rdx). */ VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} kmovd %k1, %ecx incl %ecx jne L(last_vector) It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>	2021-11-01 07:52:56 -07:00
Noah Goldstein	1d56fd3bae	x86_64: Add memcmpeq.S to fix disable-multi-arch build The following commit: commit `cf4fd28ea4` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Tue Oct 26 19:43:18 2021 -0500 Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S was not defined outside of multiarch and the alias for __memcmpeq in x86_64/memcmp.S was removed. This commit fixes that issue by adding x86_64/memcmpeq.S. make xcheck passes on x86_64 with and without --disable-multi-arch	2021-10-28 16:35:50 -05:00
Noah Goldstein	9b7cfab180	x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S No bug. This commit adds new optimized __memcmpeq implementation for evex. The primary optimizations are: 1) skipping the logic to find the difference of the first mismatched byte. 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas.	2021-10-27 13:03:46 -05:00
Noah Goldstein	b4ed69ba16	x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S No bug. This commit adds new optimized __memcmpeq implementation for avx2. The primary optimizations are: 1) skipping the logic to find the difference of the first mismatched byte. 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas.	2021-10-27 13:03:46 -05:00
Noah Goldstein	fa7f63d8d6	x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S No bug. This commit does not modify any of the memcmp implementation. It just adds __memcmpeq ifdefs to skip obvious cases where computing the proper 1/-1 required by memcmp is not needed.	2021-10-27 13:03:46 -05:00
Noah Goldstein	cf4fd28ea4	x86_64: Add support for __memcmpeq using sse2, avx2, and evex No bug. This commit adds support for __memcmpeq to be implemented seperately from memcmp. Support is added for versions optimized with sse2, avx2, and evex.	2021-10-27 13:03:46 -05:00
Noah Goldstein	9894127d20	String: Add hidden defs for __memcmpeq() to enable internal usage No bug. This commit adds hidden defs for all declarations of __memcmpeq. This enables usage of __memcmpeq without the PLT for usage internal to GLIBC.	2021-10-26 16:51:29 -05:00
Noah Goldstein	44829b3ddb	String: Add support for __memcmpeq() ABI on all targets No bug. This commit adds support for __memcmpeq() as a new ABI for all targets. In this commit __memcmpeq() is implemented only as an alias to the corresponding targets memcmp() implementation. __memcmpeq() is added as a new symbol starting with GLIBC_2.35 and defined in string.h with comments explaining its behavior. Basic tests that it is callable and works where added in string/tester.c As discussed in the proposal "Add new ABI '__memcmpeq()' to libc" __memcmpeq() is essentially a reserved namespace for bcmp(). The means is shares the same specifications as memcmp() except the return value for non-equal byte sequences is any non-zero value. This is less strict than memcmp()'s return value specification and can be better optimized when a boolean return is all that is needed. __memcmpeq() is meant to only be called by compilers if they can prove that the return value of a memcmp() call is only used for its boolean value. All tests in string/tester.c passed. As well build succeeds on x86_64-linux-gnu target.	2021-10-26 16:51:29 -05:00
Noah Goldstein	bad852b61b	x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'. it could potentially be dangerous to use SSE2 if this function is ever called without using 'vzeroupper' beforehand. While compilers appear to use 'vzeroupper' before function calls if AVX2 has been used, using SSE2 here is more brittle. Since it is not absolutely necessary it should be avoided. It costs 2-extra bytes but the extra bytes should only eat into alignment padding. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-10-23 13:02:42 -05:00
Sunil K Pandey	4f690aad9e	x86_64: Add missing libmvec ABI tests Add vector ABI tests for cos, exp, log, pow and sin functions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-10-22 06:46:49 -07:00
H.J. Lu	d962cce139	x86-64: Add sysdeps/x86_64/fpu/Makeconfig 1. Add sysdeps/x86_64/fpu/Makeconfig to auto-generate libmvec.mk, which contains libmvec ABI test dependencies and CFLAGS, in the build directory. 2. Include libmvec.mk for libmvec ABI test dependencies and CFLAGS. Tested on SSE4, AVX, AVX2 and AVX512 machines. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>	2021-10-20 11:53:45 -07:00
H.J. Lu	9d3c9a046a	x86-64: Add test-vector-abi.h/test-vector-abi-sincos.h Add templates for vector ABI test and use them for vector sincos/sincosf ABI tests.	2021-10-14 11:59:12 -07:00
Adhemerval Zanella	d6d89608ac	elf: Fix dynamic-link.h usage on rtld.c The `4af6982e4c` fix does not fully handle RTLD_BOOTSTRAP usage on rtld.c due two issues: 1. RTLD_BOOTSTRAP is also used on dl-machine.h on various architectures and it changes the semantics of various machine relocation functions. 2. The elf_get_dynamic_info() change was done sideways, previously to `490e6c62aa` get-dynamic-info.h was included by the first dynamic-link.h include without RTLD_BOOTSTRAP being defined. It means that the code within elf_get_dynamic_info() that uses RTLD_BOOTSTRAP is in fact unused. To fix 1. this patch now includes dynamic-link.h only once with RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have the relocation fnctions with the expected semantics for the loader. And to fix 2. part of `4af6982e4c` is reverted (the check argument elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP pieces are removed. To reorganize the includes the static TLS definition is moved to its own header to avoid a circular dependency (it is defined on dynamic-link.h and dl-machine.h requires it at same time other dynamic-link.h definition requires dl-machine.h defitions). Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL are moved to its own header. Only ancient ABIs need special values (arm, i386, and mips), so a generic one is used as default. The powerpc Elf64_FuncDesc is also moved to its own header, since csu code required its definition (which would require either include elf/ folder or add a full path with elf/). Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32, and powerpc64le. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>	2021-10-14 14:52:07 -03:00
Noah Goldstein	e59ced2384	x86: Optimize memset-vec-unaligned-erms.S No bug. Optimization are 1. change control flow for L(more_2x_vec) to fall through to loop and jump for L(less_4x_vec) and L(less_8x_vec). This uses less code size and saves jumps for length > 4x VEC_SIZE. 2. For EVEX/AVX512 move L(less_vec) closer to entry. 3. Avoid complex address mode for length > 2x VEC_SIZE 4. Slightly better aligning code for the loop from the perspective of code size and uops. 5. Align targets so they make full use of their fetch block and if possible cache line. 6. Try and reduce total number of icache lines that will need to be pulled in for a given length. 7. Include "local" version of stosb target. For AVX2/EVEX/AVX512 jumping to the stosb target in the sse2 code section will almost certainly be to a new page. The new version does increase code size marginally by duplicating the target but should get better iTLB behavior as a result. test-memset, test-wmemset, and test-bzero are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-10-12 13:38:02 -05:00
Noah Goldstein	1bd8b8d58f	x86: Optimize memcmp-evex-movbe.S for frontend behavior and size No bug. The frontend optimizations are to: 1. Reorganize logically connected basic blocks so they are either in the same cache line or adjacent cache lines. 2. Avoid cases when basic blocks unnecissarily cross cache lines. 3. Try and 32 byte align any basic blocks possible without sacrificing code size. Smaller / Less hot basic blocks are used for this. Overall code size shrunk by 168 bytes. This should make up for any extra costs due to aligning to 64 bytes. In general performance before deviated a great deal dependending on whether entry alignment % 64 was 0, 16, 32, or 48. These changes essentially make it so that the current implementation is at least equal to the best alignment of the original for any arguments. The only additional optimization is in the page cross case. Branch on equals case was removed from the size == [4, 7] case. As well the [4, 7] and [2, 3] case where swapped as [4, 7] is likely a more hot argument size. test-memcmp and test-wmemcmp are both passing.	2021-10-12 12:02:12 -05:00
Adhemerval Zanella	4af6982e4c	elf: Fix elf_get_dynamic_info definition Before to `490e6c62aa` ('elf: Avoid nested functions in the loader [BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on the first dynamic-link.h include and later within _dl_start(). The former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used on setup_vdso() (since it is a global definition), while the former does define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation. With the commit change, the function is now included and defined once instead of defined as a nested function. So rtld.c defines without defining RTLD_BOOTSTRAP and it brokes at least powerpc32. This patch fixes by moving the get-dynamic-info.h include out of dynamic-link.h, which then the caller can corirectly set the expected semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or RESOLVE_MAP. It also required to enable some asserts only for the loader bootstrap to avoid issues when called from setup_vdso(). As a side note, this is another issues with nested functions: it is not clear from pre-processed output (-E -dD) how the function will be build and its semantic (since nested function will be local and extra C defines may change it). I checked on x86_64-linux-gnu (w/o --enable-static-pie), i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4, aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and s390x-linux-gnu. Reviewed-by: Fangrui Song <maskray@google.com>	2021-10-12 13:25:43 -03:00
Fangrui Song	bf433b849a	elf: Remove Intel MPX support (lazy PLT, ld.so profile, and LD_AUDIT) Intel MPX failed to gain wide adoption and has been deprecated for a while. GCC 9.1 removed Intel MPX support. Linux kernel removed MPX in 2019. This patch removes the support code from the dynamic loader. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2021-10-11 11:14:02 -07:00
Fangrui Song	490e6c62aa	elf: Avoid nested functions in the loader [BZ #27220 ] dynamic-link.h is included more than once in some elf/ files (rtld.c, dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested functions. This harms readability and the nested functions usage is the biggest obstacle prevents Clang build (Clang doesn't support GCC nested functions). The key idea for unnesting is to add extra parameters (struct link_map and struct r_scope_elm []) to RESOLVE_MAP, ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a], elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired by Stan Shebs' ppc64/x86-64 implementation in the google/grte/v5-2.27/master which uses mixed extra parameters and static variables.) Future simplification: * If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM, elf_machine_runtime_setup can drop the `scope` parameter. * If TLSDESC no longer need to be in elf_machine_lazy_rel, elf_machine_lazy_rel can drop the `scope` parameter. Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32, sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64. In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu and riscv64-linux-gnu-rv64imafdc-lp64d. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2021-10-07 11:55:02 -07:00

1 2 3 4 5 ...

1426 Commits