glibc

mirror of https://sourceware.org/git/glibc.git synced 2025-01-15 21:40:05 +00:00

Author	SHA1	Message	Date
Noah Goldstein	64479f11b7	x86: Move wcschr SSE2 implementation to multiarch/wcschr-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	72a48ec0f7	x86: Move strcat SSE2 implementation to multiarch/strcat-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	cd080d0741	x86: Move strchr SSE2 implementation to multiarch/strchr-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	425647458b	x86: Move strrchr SSE2 implementation to multiarch/strrchr-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	08af081ffd	x86: Move memrchr SSE2 implementation to multiarch/memrchr-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	6b9006bfb0	x86: Move strcpy SSE2 implementation to multiarch/strcpy-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	58e6cd4bcb	x86: Move strlen SSE2 implementation to multiarch/strlen-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	60a583ec60	x86: Move strcmp SSE42 implementation to multiarch/strcmp-sse4_2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	427eaa2c85	x86: Move wcscmp SSE2 implementation to multiarch/wcscmp-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	d561fbb041	x86: Move strcmp SSE2 implementation to multiarch/strcmp-sse2.S This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Because strcmp-sse2.S implements so many functions (more from avx2/evex/sse42) add a new file 'strcmp-naming.h' to assist in getting the correct symbol name for all the function across multiarch/non-multiarch builds. Tested build on x86_64 and x86_32 with/without multiarch.	2022-07-13 14:55:31 -07:00
Noah Goldstein	30e57e0a21	x86: Rename STRCASECMP_NONASCII macro to STRCASECMP_L_NONASCII The previous macro name can be confusing given that both `__strcasecmp_l_nonascii` and `__strcasecmp_nonascii` are functions and we use the `_l` version.	2022-07-13 14:55:31 -07:00
Noah Goldstein	f2698954ff	x86: Remove __mmask intrinsics in strstr-avx512.c The intrinsics are not available before GCC7 and using standard operators generates code of equivalent or better quality. Removed: _cvtmask64_u64 _kshiftri_mask64 _kand_mask64 Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958	2022-07-12 15:41:14 -07:00
Noah Goldstein	9c38deec96	x86: Remove generic strncat, strncpy, and stpncpy implementations These functions all have optimized versions: __strncat_sse2_unaligned, __strncpy_sse2_unaligned, and stpncpy_sse2_unaligned which are faster than their respective generic implementations. Since the sse2 versions can run on baseline x86_64, we should use these as the baseline implementation and can remove the generic implementations. Geometric mean of N=20 runs of the entire benchmark suite on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake) __strncat_sse2_unaligned / __strncat_generic: .944 __strncpy_sse2_unaligned / __strncpy_generic: .726 __stpncpy_sse2_unaligned / __stpncpy_generic: .650 Tested build with and without multiarch and full check with multiarch.	2022-07-12 11:44:12 -07:00
H.J. Lu	ec9013727d	x86-64: Remove redundant strcspn-generic/strpbrk-generic/strspn-generic Remove redundant strcspn-generic, strpbrk-generic and strspn-generic from sysdep_routines in sysdeps/x86_64/multiarch/Makefile added by commit `c69f960b01` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jul 3 21:28:07 2022 -0700 x86: Add support for building str{c\|p}{brk\|spn} with explicit ISA level since they have been added to sysdep_routines in sysdeps/x86_64/Makefile.	2022-07-08 16:06:04 -07:00
H.J. Lu	eedf7886ed	x86-64: Don't mark symbols as hidden in strcmp-XXX.S Don't mark symbols as hidden in strcmp-avx2.S, strcmp-evex.S and strcmp-sse42.S since they are marked as hidden in the IFUNC selectors.	2022-07-07 16:38:11 -07:00
Noah Goldstein	ae308947ff	x86: Add support for building {w}memcmp{eq} with explicit ISA level 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memcmp sse2 from memcmp.S to multiarch/memcmp-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memcmp-evex-movbe.S). 3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the non-multiarch {w}memcmp{eq}.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.	2022-07-05 16:42:42 -07:00
Noah Goldstein	37ecc657b2	x86: Add support for building {w}memset{_chk} with explicit ISA level 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memset sse2 from memset.S to multiarch/memset-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memset-avx2-unaligned-erms.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memset-evex-unaligned-erms.S). 3. Add new multiarch/rtld-memset.S that just include the non-multiarch memset.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.	2022-07-05 16:42:42 -07:00
Noah Goldstein	b6a02c3606	x86: Add support for building {w}memmove{_chk} with explicit ISA level 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memmove sse2 from memmove.S to multiarch/memmove-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memmove-avx2-unaligned-erms.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memmove-evex-unaligned-erms.S). 3. Add new multiarch/rtld-memmove.S that just include the non-multiarch memmove.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch. isa raising memmove	2022-07-05 16:42:42 -07:00
Noah Goldstein	c69f960b01	x86: Add support for building str{c\|p}{brk\|spn} with explicit ISA level The changes for these functions are different than the others because the best implementation (sse4_2) requires the generic implementation as a fallback to be built as well. Changes are: 1. Add non-multiarch functions for str{c\|p}{brk\|spn}.c to statically select the best implementation based on the configured ISA build level. 2. Add stubs for str{c\|p}{brk\|spn}-generic and varshift.c to in the sysdeps/x86_64 directory so that the the sse4 implementation will have all of its dependencies for the non-multiarch / rtld build when ISA level >= 2. 3. Add new multiarch/rtld-strcspn.c that just include the non-multiarch strcspn.c which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.	2022-07-05 16:42:42 -07:00
Noah Goldstein	baeae86fb8	x86: Add comment explaining no Slow_SSE4_2 check in ifunc-sse4_2 Just for clarities sake and so that if a future implementation is added we remember to add the check.	2022-07-05 16:42:42 -07:00
Noah Goldstein	96ac447d91	x86: Add missing IS_IN (libc) check to strncmp-sse4_2.S Was missing to for the multiarch build rtld-strncmp-sse4_2.os was being built and exporting symbols: build/glibc/string/rtld-strncmp-sse4_2.os: 0000000000000000 T __strncmp_sse42 Introduced in: commit `11ffcacb64` Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 21 12:10:50 2017 -0700 x86-64: Implement strcmp family IFUNC selectors in C	2022-06-29 19:47:52 -07:00
Noah Goldstein	0aa294fb88	x86: Add missing IS_IN (libc) check to strcspn-sse4.c Was missing to for the multiarch build rtld-strcspn-sse4.os was being built and exporting symbols: build/glibc/string/rtld-strcspn-sse4.os: U ___m128i_shift_right U __strcspn_generic 0000000000000000 T __strcspn_sse42 U strlen build/glibc/string/rtld-varshift.os: 0000000000000000 R ___m128i_shift_right Introduced in: commit `06e51c8f3d` Author: H.J. Lu <hongjiu.lu@intel.com> Date: Fri Jul 3 02:48:56 2009 -0700 Add SSE4.2 support for strcspn, strpbrk, and strspn on x86-64.	2022-06-29 19:47:52 -07:00
Noah Goldstein	8cfbbbcdf9	x86: Add missing IS_IN (libc) check to memmove-ssse3.S Was missing to for the multiarch build rtld-memmove-ssse3.os was being built and exporting symbols: >$ nm string/rtld-memmove-ssse3.os U __GI___chk_fail 0000000000000020 T __memcpy_chk_ssse3 0000000000000040 T __memcpy_ssse3 0000000000000020 T __memmove_chk_ssse3 0000000000000040 T __memmove_ssse3 0000000000000000 T __mempcpy_chk_ssse3 0000000000000010 T __mempcpy_ssse3 U __x86_shared_cache_size_half Introduced after 2.35 in: commit `26b2478322` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Thu Apr 14 11:47:40 2022 -0500 x86: Reduce code size of mem{move\|pcpy\|cpy}-ssse3	2022-06-29 19:47:52 -07:00
H.J. Lu	88070acdd0	x86-64: Properly indent X86_IFUNC_IMPL_ADD_VN arguments Properly indent X86_IFUNC_IMPL_ADD_VN arguments for memchr, rawmemchr and wmemchr. Co-authored-by: H.J. Lu <hjl.tools@gmail.com>	2022-06-29 19:47:52 -07:00
Noah Goldstein	58bcf7b71a	x86-64: Small improvements to dl-trampoline.S 1. Remove sse2 instructions when using the avx512 or avx version. 2. Fixup some format nits in how the address offsets where aligned. 3. Use more space efficient instructions in the conditional AVX restoral. - vpcmpeqq -> vpcmpeqb - cmp imm32, r; jz -> inc r; jz 4. Use `rep movsb` instead of `rep movsq`. The former is guranteed to be fast with the ERMS flags, the latter is not. The latter also wastes an instruction in size setup.	2022-06-29 19:47:52 -07:00
Noah Goldstein	21925f6473	x86: Move mem{p}{mov\|cpy}_{chk_}erms to its own file The primary memmove_{impl}_unaligned_erms implementations don't interact with this function. Putting them in same file both wastes space and unnecessarily bloats a hot code section.	2022-06-29 19:47:52 -07:00
Noah Goldstein	4a3f29e7e4	x86: Move and slightly improve memset_erms Implementation wise: 1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not use the L(stosb) label that was previously defined. 2. Don't give the hotpath (fallthrough) to zero size. Code positioning wise: Move memset_{chk}_erms to its own file. Leaving it in between the memset_{impl}_unaligned both adds unnecessary complexity to the file and wastes space in a relatively hot cache section.	2022-06-29 19:47:52 -07:00
Noah Goldstein	2a1099020c	x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl list This was simply missing and meant we weren't testing it properly.	2022-06-29 19:47:52 -07:00
H.J. Lu	cfdc4df66c	x86-64: Only define used SSE/AVX/AVX512 run-time resolvers When glibc is built with x86-64 ISA level v3, SSE run-time resolvers aren't used. For x86-64 ISA level v4 build, both SSE and AVX resolvers are unused. Check the minimum x86-64 ISA level to exclude the unused run-time resolvers.	2022-06-27 14:17:52 -07:00
Noah Goldstein	4fc321dc58	x86: Fix backwards Prefer_No_VZEROUPPER check in ifunc-evex.h Add third argument to X86_ISA_CPU_FEATURES_ARCH_P macro so the runtime CPU_FEATURES_ARCH_P check can be inverted if the MINIMUM_X86_ISA_LEVEL is not high enough to constantly evaluate the check. Use this new macro to correct the backwards check in ifunc-evex.h	2022-06-27 08:35:51 -07:00
Noah Goldstein	d912127bde	x86: Rename strstr_sse2 to strstr_generic as it uses string/strstr.c This is in accordance with other files in the multiarch directory.	2022-06-27 08:35:51 -07:00
Noah Goldstein	d1e931125b	x86: Remove unused file wmemcmp-sse4 The memcmp-sse4 was removed in: commit `7cbc03d030` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Fri Apr 15 12:28:00 2022 -0500 x86: Remove memcmp-sse4.S so this file does nothing.	2022-06-27 08:35:51 -07:00
Noah Goldstein	afc6e4328f	x86: Put wcs{n}len-sse4.1 in the sse4.1 text section Previously was missing but the two implementations shouldn't get in the sse2 (generic) text section.	2022-06-27 08:35:51 -07:00
Noah Goldstein	227afaa672	x86: Align entry for memrchr to 64-bytes. The function was tuned around 64-byte entry alignment and performs better for all sizes with it. As well different code boths where explicitly written to touch the minimum number of cache line i.e sizes <= 32 touch only the entry cache line.	2022-06-27 08:35:51 -07:00
Noah Goldstein	bd42891bb3	x86: Remove faulty sanity tests for RTLD build with no multiarch The sanity tests where meant to ensure that the default implementation was only being built without multiarch with the exception of the multiarch/rtld-.S files. The code used IS_IN (rtld) to check if the build for was for an multiarch/rtld-.S file which is incorrect as IS_IN (rtld) is set for the non-multiarch build as well.	2022-06-23 11:14:08 -07:00
Noah Goldstein	3079f652d7	x86: Replace all sse instructions with vex equivilent in avx+ files Most of these don't really matter as there was no dirty upper state but we should generally avoid stray sse when its not needed. The one case that really matters is in svml_d_tanh4_core_avx2.S: blendvps %xmm0, %xmm8, %xmm7 When there was a dirty upper state. Tested on x86_64-linux	2022-06-22 19:42:17 -07:00
Noah Goldstein	3edda6a0f0	x86: Add support for compiling {raw\|w}memchr with high ISA level 1. Refactor files so that all implementations for in the multiarch directory. - Essentially moved sse2 {raw\|w}memchr.S implementation to multiarch/{raw\|w}memchr-sse2.S - The non-multiarch {raw\|w}memchr.S file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memchr-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memchr-evex{-rtm}.S). 3. Add new multiarch/rtld-{raw}memchr.S that just include the non-multiarch {raw}memchr.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. - Guranteed replacement essentially means that for any ISA level build there must be a function that the baseline of the ISA supports. So for {raw\|w}memchr.S since there is not ISA level 2 function, the ISA level 2 build still includes the ISA level 1 (sse2) function. Once we reach the ISA level 3 build, however, {raw\|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA level 1 implementation ({raw\|w}memchr-sse2.S) will not be built. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.	2022-06-22 19:41:35 -07:00
Noah Goldstein	703f434108	x86: Add defines / utilities for making ISA specific x86 builds 1. Factor out some of the ISA level defines in isa-level.c to standalone header isa-level.h 2. Add new headers with ISA level dependent macros for handling ifuncs. Note, this file does not change any code. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.	2022-06-22 19:41:35 -07:00
Noah Goldstein	c22eb807b0	x86: Rename generic functions with unique postfix for clarity No functions are changed. It just renames generic implementations from '{func}_sse2' to '{func}_generic'. This is just because the postfix "_sse2" was overloaded and was used for files that had hand-optimized sse2 assembly implementations and files that just redirected back to the generic implementation. Full xcheck passed on x86_64.	2022-06-16 20:17:45 -07:00
Fangrui Song	4ef05df5ef	x86-64: Handle fewer relocation types for RTLD_BOOTSTRAP The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. RELATIVE has been handled (by _ELF_DYNAMIC_DO_RELOC due to DT_RELACOUNT, or RELR), so the switch statement only needs to handle GLOB_DAT and JUMP_SLOT. We can drop these `#if[n]def RTLD_BOOTSTRAP` and add a large `# ifndef RTLD_BOOTSTRAP` instead.	2022-06-16 11:48:15 -07:00
Noah Goldstein	89a25c6f64	x86: Cleanup bounds checking in large memcpy case 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x). Previously was using `__x86_rep_movsb_threshold` and should have been using `__x86_shared_non_temporal_threshold`. 2. Avoid reloading __x86_shared_non_temporal_threshold before the L(large_memcpy_4x) bounds check. 3. Document the second bounds check for L(large_memcpy_4x) more clearly.	2022-06-15 14:25:55 -07:00
Fangrui Song	de38b2a343	elf: Remove ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA If an executable has copy relocations for extern protected data, that can only work if the library containing the definition is built with assumptions (a) the compiler emits GOT-generating relocations (b) the linker produces R__GLOB_DAT instead of R__RELATIVE. Otherwise the library uses its own definition directly and the executable accesses a stale copy. Note: the GOT relocations defeat the purpose of protected visibility as an optimization, but allow rtld to make the executable and library use the same copy when copy relocations are present, but it turns out this never worked perfectly. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both a.so and b.so define protected var and the executable copy relocates var: b.so accesses its own copy even with GLOB_DAT. The behavior change is from commit `62da1e3b00` (x86) and then copied to nios2 (`ae5eae7cfc`) and arc (`0e7d930c4c`). Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy relocated data like a.so. There is now a warning for copy relocation on protected symbol since commit `7374c02b68`. It's extremely unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA behavior, so let's remove it: this removes a check in the symbol lookup code.	2022-06-15 11:29:55 -07:00
Noah Goldstein	ff439c4717	x86: Add sse42 implementation to strcmp's ifunc This has been missing since the the ifuncs where added. The performance of SSE4.2 is preferable to to SSE2. Measured on Tigerlake with N = 20 runs. Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906	2022-06-14 20:58:09 -07:00
Wilco Dijkstra	fdaf78656f	Add bounds check to __libc_ifunc_impl_list Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2022-06-10 17:13:29 +01:00
Noah Goldstein	cffb9414c5	x86: Optimize svml_s_tanhf4_core_sse4.S Optimizations are: 1. Reduce code size (-112 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata size (-4k+ rodata is shared with avx2). Result is roughly a 15-16% speedup: Function, New Time, Old Time, New / Old _ZGVbN4v_tanhf, 3.158, 3.749, 0.842	2022-06-09 12:51:25 -07:00
Noah Goldstein	bcc41f66a4	x86: Optimize svml_s_tanhf8_core_avx2.S Optimizations are: 1. Reduce code size (-81 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata size (-32 bytes). Result is roughly a 17-18% speedup: Function, New Time, Old Time, New / Old _ZGVdN8v_tanhf, 1.977, 2.402, 0.823	2022-06-09 12:51:22 -07:00
Noah Goldstein	3a49ce8799	x86: Add data file that can be shared by tanhf-avx2 and tanhf-sse4 tanhf-avx2 and tanhf-sse4 use the same data tables so we can save over 4kb using a shared datatable. This does increase the memory footprint of the sse4 version (as now all the targets are 32 bytes instead of 16), generally it seems worth the code size save. NB: This patch doesn't do anything itself, it is setup for future patches.	2022-06-09 12:51:15 -07:00
Noah Goldstein	e560b3c2d2	x86: Optimize svml_s_tanhf16_core_avx512.S Optimizations are: 1. Reduce code size (-67 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Reduce rodata usage (-448 bytes). Result is roughly a 14% speedup: Function, New Time, Old Time, New / Old _ZGVeN16v_tanhf, 0.649, 0.752, 0.863	2022-06-09 12:51:12 -07:00
Noah Goldstein	fe1915d4f6	x86: Improve svml_s_atanhf4_core_sse4.S Improvements are: 1. Reduce code size (-62 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata usage (-16 bytes). The throughput improvement is not significant as the port 0 bottleneck is unavoidable. Function, New Time, Old Time, New / Old _ZGVbN4v_atanhf, 8.821, 8.903, 0.991	2022-06-09 12:51:09 -07:00
Noah Goldstein	65897e9916	x86: Improve svml_s_atanhf8_core_avx2.S Improvements are: 1. Reduce code size (-60 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Shrink rodata usage (-32 bytes). The throughput improvement is not that significant (3-5%) as the port 0 bottleneck is unavoidable. Function, New Time, Old Time, New / Old _ZGVdN8v_atanhf, 2.799, 2.923, 0.958	2022-06-09 12:51:04 -07:00

1 2 3 4 5 ...

1714 Commits