There is transition penalty when SSE instructions are mixed with 256-bit
AVX or 512-bit AVX512 load instructions. Since _dl_runtime_resolve_avx
and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
registers, there is transition penalty when SSE instructions are used
with lazy binding on AVX and AVX512 processors.
To avoid SSE transition penalty, if only the lower 128 bits of the first
8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
with the zero upper bits.
For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
or the upper 256 bits of ZMM registers are zero. We can restore only the
non-zero portion of vector registers with AVX/AVX512 load instructions
which will zero-extend upper bits of vector registers.
This patch adds _dl_runtime_resolve_sse_vex which saves and restores
XMM registers with 128-bit AVX store/load instructions. It is used to
preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
_dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
that we store and load only the non-zero portion of vector registers.
This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
_dl_runtime_profile_avx512 when only the lower 128 bits of vector
registers are used.
_dl_runtime_resolve_avx_slow is added and used for AVX processors which
don't support XGETBV with ECX == 1. Since there is no SSE transition
penalty on AVX512 processors which don't support XGETBV with ECX == 1,
_dl_runtime_resolve_avx512_slow isn't provided.
[BZ #20495]
[BZ #20508]
* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
processors, set Use_dl_runtime_resolve_slow and set
Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
New.
(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
(index_arch_Use_dl_runtime_resolve_opt): Likewise.
(index_arch_Use_dl_runtime_resolve_slow): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
if Use_dl_runtime_resolve_opt is set. Use
_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
(_dl_runtime_resolve_opt): New. Defined for AVX and AVX512.
(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
New.
(_dl_runtime_resolve_opt): Likewise.
(_dl_runtime_profile): Define only if _dl_runtime_profile is
defined.
All other state bits, except for bit_YMM_state, are defined as (1 << N).
This patch changes bit_YMM_state from (2 << 1) to (1 << 2).
* sysdeps/x86/cpu-features.h (bit_YMM_state): Set to (1 << 2).
Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove,
memcpy, mempcpy and memset aren't used by the current processors, this
patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so
that they can be used in the future.
* sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New.
(index_arch_Prefer_ERMS): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return
__memcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
(__memmove_erms): Enabled for libc.a.
* ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return
__memmove_erms or Prefer_ERMS.
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return
__mempcpy_erms for Prefer_ERMS.
* sysdeps/x86_64/multiarch/memset.S (memset): Return
__memset_erms for Prefer_ERMS.
Skip counting logical threads for Intel processors if the HTT bit is 0
which indicates there is only a single logical processor.
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
logical threads if the HTT bit is 0.
* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
(index_cpu_HTT): Likewise.
(reg_HTT): Likewise.
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on
i686 and x86-64.
* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
Remove ifunc-defines.sym.
* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
Likewise.
* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
* sysdeps/x86/rtld-global-offsets.sym: Likewise.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
* sysdeps/x86/Makefile (gen-as-const-headers): Remove
rtld-global-offsets.sym.
* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
* sysdeps/x86/cpu-features-offsets.sym: This.
* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which
has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB
(ERMS) bit to x86 cpu-features.
* sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New.
(index_cpu_ERMS): Likewise.
(reg_ERMS): Likewise.
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load. A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.
[BZ #19583]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
processors. Set Fast_Copy_Backward for AMD Excavator
processors.
* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
New.
(index_arch_Fast_Unaligned_Copy): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
Since only Intel processors with AVX2 have fast unaligned load, we
should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
and call get_common_indeces for other processors.
Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
GLRO(dl_x86_cpu_features) in cpu-features.c.
[BZ #19583]
* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
inline. Check family before setting family, model and
extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable
bits here.
(init_cpu_features): Replace HAS_CPU_FEATURE and
HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load
for Intel processors with usable AVX2. Call get_common_indeces
for other processors with family == NULL.
* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
(CPU_FEATURES_ARCH_P): Likewise.
(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
index_* and bit_* macros are used to access cpuid and feature arrays o
struct cpu_features. It is very easy to use bits and indices of cpuid
array on feature array, especially in assembly codes. For example,
sysdeps/i386/i686/multiarch/bcopy.S has
HAS_CPU_FEATURE (Fast_Rep_String)
which should be
HAS_ARCH_FEATURE (Fast_Rep_String)
We change index_* and bit_* to index_cpu_*/index_arch_* and
bit_cpu_*/bit_arch_* so that we can catch such error at build time.
[BZ #19762]
* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
(bit_arch_*): This for feature array.
(bit_*): Renamed to ...
(bit_cpu_*): This for cpu array.
(index_*): Renamed to ...
(index_arch_*): This for feature array.
(index_*): Renamed to ...
(index_cpu_*): This for cpu array.
[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
bit_##name with index_cpu_##name and bit_cpu_##name.
[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
bit_##name with index_arch_##name and bit_arch_##name.
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
* sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
index_Prefer_No_VZEROUPPER): New.
* sysdeps/x86/cpu-features.c (init_cpu_features): Set the
Prefer_No_VZEROUPPER for Knights Landing.
According to Silvermont software optimization guide, for 64-bit
applications, branch prediction performance can be negatively impacted
when the target of a branch is more than 4GB away from the branch. Add
the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable
pages with MAP_32BIT first. NB: MAP_32BIT will map to lower 2GB, not
lower 4GB, address. Prefer_MAP_32BIT_EXEC reduces bits available for
address space layout randomization (ASLR), which is always disabled for
SUID programs and can only be enabled by setting environment variable,
LD_PREFER_MAP_32BIT_EXEC.
On Fedora 23, this patch speeds up GCC 5 testsuite by 3% on Silvermont.
[BZ #19367]
* sysdeps/unix/sysv/linux/wordsize-64/mmap.c: New file.
* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h: Likewise.
* sysdeps/unix/sysv/linux/x86_64/64/mmap.c: Likewise.
* sysdeps/x86/cpu-features.h (bit_Prefer_MAP_32BIT_EXEC): New.
(index_Prefer_MAP_32BIT_EXEC): Likewise.
We detect i586 and i686 features at run-time by checking CX8 and CMOV
CPUID features bits. We can use these information to select the best
implementation in ix86 multiarch. HAS_I586/HAS_I686 is true if i586/i686
instructions are available on the processor.
Due to the reordering and the other nifty extensions in i686, it is not
really good to use heavily i586 optimized code on an i686. It's better
to use i486 code if it isn't an i586. USE_I586/USE_I686 is true if
i586/i686 implementation should be used for the processor. USE_I586
is true only if i686 instructions aren't available. If i686 instructions
are available, we always choose i686 or i486 implementation, in that order,
and we never choose i586 implementation for i686-class processors.
* sysdeps/i386/init-arch.h: New file.
* sysdeps/i386/i586/init-arch.h: Likewise.
* sysdeps/i386/i686/init-arch.h: Likewise.
* sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_I586
bit if CX8 is available. Set bit_I686 bit if CMOV is available.
* sysdeps/x86/cpu-features.h (bit_I586): New.
(bit_I686): Likewise.
(bit_CX8): Likewise.
(bit_CMOV): Likewise.
(index_CX8): Likewise.
(index_CMOV): Likewise.
(index_I586): Likewise.
(index_I686): Likewise.
(reg_CX8): Likewise.
(reg_CMOV): Likewise.
(HAS_I586): Defined as HAS_ARCH_FEATURE (I586) if i586 isn't
available at compile-time.
(HAS_I686): Defined as HAS_ARCH_FEATURE (I686) if i686 isn't
available at compile-time.
* sysdeps/x86/init-arch.h (USE_I586): New macro.
(USE_I686): Likewise.
cpuid, i586 and i686 instructions are available if the processor
specified by -march= supports them. We can use this information
to determine whether those instructions can be used safely.
* sysdeps/x86/cpu-features.c (init_cpu_features): Check
whether cpuid is available only if HAS_CPUID is 0.
* sysdeps/x86/cpu-features.h (HAS_CPUID): New.
(HAS_I586): Likewise.
(HAS_I686): Likewise.