The current IFUNC selection is based on microbenchmarks in glibc. It
should give the best performance for most workloads. But other choices
may have better performance for a particular workload or on the hardware
which wasn't available at the selection was made. The environment
variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used
to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz,
where the feature name is case-sensitive and has to match the ones in
cpu-features.h. It can be used by glibc developers to override the
IFUNC selection to tune for a new processor or improve performance for
a particular workload. It isn't intended for normal end users.
NOTE: the IFUNC selection may change over time. Please check all
multiarch implementations when experimenting.
Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is
provided to set threshold to use non temporal store to NUMBER,
GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache
size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set
shared cache size.
* elf/dl-tunables.list (tune): Add ifunc,
x86_non_temporal_threshold,
x86_data_cache_size and x86_shared_cache_size.
* manual/tunables.texi: Document glibc.tune.ifunc,
glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size
and glibc.tune.x86_non_temporal_threshold.
* sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file.
* sysdeps/x86/cpu-tunables.c: Likewise.
* sysdeps/x86/cacheinfo.c
(init_cacheinfo): Check and get data cache size, shared cache
size and non temporal threshold from cpu_features.
* sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE):
New.
[HAVE_TUNABLES] Include <unistd.h>.
[HAVE_TUNABLES] Include <elf/dl-tunables.h>.
[HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise.
[HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set
IFUNC selection, data cache size, shared cache size and non
temporal threshold.
* sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size,
shared_cache_size and non_temporal_threshold.
Since cpu_features is available, use it instead of dl_x86_cpu_features.
* sysdeps/x86/cacheinfo.c (intel_check_word): Accept cpu_features
and use it instead of dl_x86_cpu_features.
(handle_intel): Replace maxidx with cpu_features. Pass
cpu_features to intel_check_word.
(__cache_sysconf): Pass cpu_features to handle_intel.
(init_cacheinfo): Likewise. Use cpu_features instead of
dl_x86_cpu_features.
__x86_shared_non_temporal_threshold was set to 6 times of per-core
shared cache size, based on the large memcpy micro benchmark in glibc
on a 8-core processor. For a processor with more than 8 cores, the
threshold is too low. Set __x86_shared_non_temporal_threshold to the
3/4 of the total shared cache size so that it is unchanged on 8-core
processors. On processors with less than 8 cores, the threshold is
lower.
* sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold):
Set to the 3/4 of the total shared cache size.
Remove is_intel, is_amd and max_cpuid macros. Use __get_cpu_features
to get cpu_features instead.
* sysdeps/x86/cacheinfo.c (is_intel): Removed.
(is_amd): Likewise.
(max_cpuid): Likewise.
(__cache_sysconf): Use __get_cpu_features to get cpu_features.
(init_cacheinfo): Likewise.
Since the maximum CPUID level of older Intel CPUs is 1, change
handle_intel to return -1, instead of assert, when the maximum
CPUID level is less than 2.
[BZ #20647]
* sysdeps/x86/cacheinfo.c (handle_intel): Return -1 if the
maximum CPUID level is less than 2.
For Intel processors, when there are both L2 and L3 caches, SMT level
type should be ued to count number of available logical processors
sharing L2 cache. If there is only L2 cache, core level type should
be used to count number of available logical processors sharing L2
cache. Number of available logical processors sharing L2 cache should
be used for non-inclusive L2 and L3 caches.
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of
available logical processors with SMT level type sharing L2
cache for Intel processors.
L2 cache is shared by 2 cores on Knights Landing, which has 4 threads
per core:
https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing
So L2 cache is shared by 8 threads on Knights Landing as reported by
CPUID. We should remove special L2 cache case for Knights Landing.
[BZ #18185]
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads
sharing L2 cache to 2 for Knights Landing.
Intel CPUID with EAX == 11 returns:
ECX Bits 07 - 00: Level number. Same value in ECX input.
Bits 15 - 08: Level type.
^^^^^^^^^^^^^^^^^^^^^^^^ This is level type.
Bits 31 - 16: Reserved.
Intel processor level type mask should be 0xff00, not 0xff0.
[BZ #20119]
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel
processor level type mask for CPUID with EAX == 11.
Skip counting logical threads for Intel processors if the HTT bit is 0
which indicates there is only a single logical processor.
* sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting
logical threads if the HTT bit is 0.
* sysdeps/x86/cpu-features.h (bit_cpu_HTT): New.
(index_cpu_HTT): Likewise.
(reg_HTT): Likewise.
Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86
and x86_64.
* sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c>
instead of <sysdeps/x86_64/cacheinfo.c>.
* sysdeps/x86_64/cacheinfo.c: Moved to ...
* sysdeps/x86/cacheinfo.c: Here.