Add support for MTE to memchr. Regression tested with xcheck and benchmarked
with glibc's benchtests on the Cortex-A53, Cortex-A72, and Neoverse N1.
The existing implementation assumes that any access to the pages in which the
string resides is safe. This assumption is not true when MTE is enabled. This
patch updates the algorithm to ensure that accesses remain within the bounds
of an MTE tag (16-byte chunks) and improves overall performance.
Co-authored-by: Gabor Kertesz <gabor.kertesz@arm.com>
Add support for MTE to strcpy. Regression tested with xcheck and benchmarked
with glibc's benchtests on the Cortex-A53, Cortex-A72, and Neoverse N1.
The existing implementation assumes that any access to the pages in which the
string resides is safe. This assumption is not true when MTE is enabled. This
patch updates the algorithm to ensure that accesses remain within the bounds
of an MTE tag (16-byte chunks) and improves overall performance.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
1. Divide architecture features into the usable features and the preferred
features. The usable features are for correctness and can be exported in
a stable ABI. The preferred features are for performance and only for
glibc internal use.
2. Change struct cpu_features to
struct cpu_features
{
struct cpu_features_basic basic;
unsigned int *usable_p;
struct cpuid_registers cpuid[COMMON_CPUID_INDEX_MAX];
unsigned int usable[USABLE_FEATURE_INDEX_MAX];
unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};
and initialize usable_p to pointer to the usable arary so that
struct cpu_features
{
struct cpu_features_basic basic;
unsigned int *usable_p;
struct cpuid_registers cpuid[COMMON_CPUID_INDEX_MAX];
};
can be exported via a stable ABI. The cpuid and usable arrays can be
expanded with backward binary compatibility for both .o and .so files.
3. Add COMMON_CPUID_INDEX_7_ECX_1 for AVX512_BF16.
4. Detect ENQCMD, PKS, AVX512_VP2INTERSECT, MD_CLEAR, SERIALIZE, HYBRID,
TSXLDTRK, L1D_FLUSH, CORE_CAPABILITIES and AVX512_BF16.
5. Rename CAPABILITIES to ARCH_CAPABILITIES.
6. Check if AVX512_VP2INTERSECT, AVX512_BF16 and PKU are usable.
7. Update CPU feature detection test.
The -fno-math-errno is already added by default and the minimum
required GCC to build glibc (6.2) make the -ffinite-math-only
superflous.
Checked on aarch64-linux-gnu.
Checked with a build for riscv64-linux-gnu-rv64imac-lp64 (no
builtin support), riscv64-linux-gnu-rv64imafdc-lp64, and
riscv64-linux-gnu-rv64imafdc-lp64d.
The generic implementation is simplified by removing the
'optimization' for !_IEEE_FP_INEXACT (which does not handle
inexact neither some values).
Checked on alpha-linux-gnu.
The powerpc sqrt implementation is also simplified:
- the static constants are open coded within the implementation.
- for !USE_SQRT_BUILTIN the function is implemented directly on
__ieee754_sqrt (it avoid an superflous extra jump).
Checked on powerpc-linux-gnu and powerpc64le-linux-gnu.
The define is already set on the math-use-builtins-ceil.h, the patch
just removes the implementations (it was missed on c9feb1be93).
Checked on aarch64-linux-gnu.
Each symbol definitions are moved on a separated file and it
cover all symbol type definitions (float, double, long double,
and float128).
It allows to set support for architectures without the boiler
place of copying default values.
Checked with a build on the affected ABIs.
The generic implementation is slight worse (Itanium(R) Processor 9020):
Before new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 3.61582e+08,
"iterations": 2.384e+07,
"reciprocal-throughput": 14.8334,
"latency": 15.5006,
"max-throughput": 6.74153e+07,
"min-throughput": 6.45136e+07
}
}
With new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 3.85549e+08,
"iterations": 2.384e+07,
"reciprocal-throughput": 15.8391,
"latency": 16.5056,
"max-throughput": 6.31348e+07,
"min-throughput": 6.05857e+07
}
}
However it fixes all the issues on both:
math/test-float-exp10
math/test-float32-exp10
(all the issues wrong results for non default rounding modes).
The existing ia64 libm interface uses matherrf and matherrl in addition
to matherr for SVID error handling. However, there is no such error
handling support for exp10f in ia64 libm. So replacing it with the
generic implementation should be fine.
Checked on ia64-linux-gnu.
This patch changes the exp10f error handling semantics to only set
errno according to POSIX rules. New symbol version is introduced at
GLIBC_2.32. The old wrappers are kept for compat symbols.
There are some outliers that need special handling:
- ia64 provides an optimized implementation of exp10f that uses ia64
specific routines to set SVID compatibility. The new symbol version
is aliased to the exp10f one.
- m68k also provides an optimized implementation, and the new version
uses it instead of the sysdeps/ieee754/flt32 one.
- riscv and csky uses the generic template implementation that
does not provide SVID support. For both cases a new exp10f
version is not added, but rather the symbols version of the
generic sysdeps/ieee754/flt32 is adjusted instead.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, i686-linux-gnu,
powerpc64le-linux-gnu.
It is inspired by expf and reuses its tables and internal functions.
The error checks are inlined and errno setting is in separate tail
called functions, but the wrappers are kept in this patch to handle
the _LIB_VERSION==_SVID_ case.
Double precision arithmetics is used which is expected to be faster on
most targets (including soft-float) than using single precision and it
is easier to get good precision result with it.
Result for x86_64 (i7-4790K CPU @ 4.00GHz) are:
Before new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 4.0414e+09,
"iterations": 1.00128e+08,
"reciprocal-throughput": 26.6818,
"latency": 54.043,
"max-throughput": 3.74787e+07,
"min-throughput": 1.85038e+07
}
With new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 4.11951e+09,
"iterations": 1.23968e+08,
"reciprocal-throughput": 21.0581,
"latency": 45.4028,
"max-throughput": 4.74876e+07,
"min-throughput": 2.20251e+07
}
Result for aarch64 (A72 @ 2GHz) are:
Before new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 4.62362e+09,
"iterations": 3.3376e+07,
"reciprocal-throughput": 127.698,
"latency": 149.365,
"max-throughput": 7.831e+06,
"min-throughput": 6.69501e+06
}
With new code:
"exp10f": {
"workload-spec2017.wrf (adapted)": {
"duration": 4.29108e+09,
"iterations": 6.6752e+07,
"reciprocal-throughput": 51.2111,
"latency": 77.3568,
"max-throughput": 1.9527e+07,
"min-throughput": 1.29271e+07
}
Checked on x86_64-linux-gnu, powerpc64le-linux-gnu, aarch64-linux-gnu,
and sparc64-linux-gnu.
strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.
They were to be moved to sysdeps/pthread/Makefile in 45fce058f ('htl:
Enable more cancellation tests')
* nptl/Makefile: (tests): Remove tst-cancelx9.
(CFLAGS-tst-cancelx9.c): Remove.
Linux 5.7 has no new syscalls. Update the version number in
syscall-names.list to reflect that it is still current for 5.7.
Tested with build-many-glibcs.py.
This came to light when adding hard-flaot support to ARC glibc port
without hardware sqrt support causing glibc build to fail:
| ../sysdeps/ieee754/dbl-64/e_sqrt.c: In function '__ieee754_sqrt':
| ../sysdeps/ieee754/dbl-64/e_sqrt.c:58:54: error: unused variable 'ty' [-Werror=unused-variable]
| double y, t, del, res, res1, hy, z, zz, p, hx, tx, ty, s;
The reason being EMULV() macro uses the hardware provided
__builtin_fma() variant, leaving temporary variables 'p, hx, tx, hy, ty'
unused hence compiler warning and ensuing error.
The intent of the patch was to fix that error, but EMULV is pervasive
and used fair bit indirectly via othe rmacros, hence this patch.
Functionally it should not result in code gen changes and if at all
those would be better since the scope of those temporaries is greatly
reduced now
Built tested with aarch64-linux-gnu arm-linux-gnueabi arm-linux-gnueabihf hppa-linux-gnu x86_64-linux-gnu arm-linux-gnueabihf riscv64-linux-gnu-rv64imac-lp64 riscv64-linux-gnu-rv64imafdc-lp64 powerpc-linux-gnu microblaze-linux-gnu nios2-linux-gnu hppa-linux-gnu
Also as suggested by Joseph [1] used --strip and compared the libs with
and w/o patch and they are byte-for-byte unchanged (with gcc 9).
| for i in `find . -name libm-2.31.9000.so`;
| do
| echo $i; diff $i /SCRATCH/vgupta/gnu2/install/glibcs/$i ; echo $?;
| done
| ./aarch64-linux-gnu/lib64/libm-2.31.9000.so
| 0
| ./arm-linux-gnueabi/lib/libm-2.31.9000.so
| 0
| ./x86_64-linux-gnu/lib64/libm-2.31.9000.so
| 0
| ./arm-linux-gnueabihf/lib/libm-2.31.9000.so
| 0
| ./riscv64-linux-gnu-rv64imac-lp64/lib64/lp64/libm-2.31.9000.so
| 0
| ./riscv64-linux-gnu-rv64imafdc-lp64/lib64/lp64/libm-2.31.9000.so
| 0
| ./powerpc-linux-gnu/lib/libm-2.31.9000.so
| 0
| ./microblaze-linux-gnu/lib/libm-2.31.9000.so
| 0
| ./nios2-linux-gnu/lib/libm-2.31.9000.so
| 0
| ./hppa-linux-gnu/lib/libm-2.31.9000.so
| 0
| ./s390x-linux-gnu/lib64/libm-2.31.9000.so
[1] https://sourceware.org/pipermail/libc-alpha/2019-November/108267.html
Now that ldconfig defaults to the new format (only), check for it
first. Also apply the corruption check added in commit 2954daf00b
("Add more checks for valid ld.so.cache file (bug 18093)") to the
new-format-only case.
Suggested-by: Josh Triplett <josh@joshtriplett.org>
* sysdeps/mach/hurd/Makefile [subdir=misc] (sysdep_routines): Add
writev_nocancel writev_nocancel_nostatus.
* sysdeps/mach/hurd/not-cancel.h (__writev_nocancel_nostatus): Replace
macro with function declaration (with hidden prototype in libc).
(__writev_nocancel): New function declaration (with hidden prototype in libc).
* sysdeps/mach/hurd/writev_nocancel_nostatus.c: New file.
* sysdeps/posix/writev_nocancel.c: New file, includes writev.c to make a
nocancel variant that calls __write_nocancel.
* sysdeps/posix/writev.c (writev): Do not define alias if __writev is
renamed.
and add _nocancel variants.
* sysdeps/mach/hurd/write.c (__libc_write): Call __write_nocancel
surrounded by enabling async cancel, to replace implementation moved
to...
* sysdeps/mach/hurd/write_nocancel.c (__write_nocancel): ... here.
* sysdeps/mach/hurd/pwrite64.c (__libc_pwrite64): Call
__pwrite64_nocancel surrounded by enabling async cancel, to replace
implementation moved to...
* sysdeps/mach/hurd/pwrite64_nocancel.c (__pwrite64_nocancel): ... here.
* sysdeps/mach/hurd/Makefile (sysdep_routines): Add write_nocancel and
pwrite64_nocancel.
* sysdeps/mach/hurd/not-cancel.h (__write_nocancel,
__pwrite64_nocancel): Replace macros with prototypes with a hidden proto on
libc.
* sysdeps/mach/hurd/dl-sysdep.c (__write_nocancel): New alias, check
that it is not hidden.
* sysdeps/mach/hurd/Versions (libc.GLIBC_PRIVATE): Add __write_nocancel.
(ld.GLIBC_PRIVATE): Add __write_nocancel.
* sysdeps/mach/hurd/i386/localplt.data (__write_nocancel): Add
reference.
* sysdeps/htl/stdio-lock.h: New file, registers locking cleanup to htl.
* sysdeps/htl/libc-lockP.h: Include <libc-lock.h>.
(__libc_cleanup_region_start, __libc_cleanup_end,
__libc_cleanup_region_end): Override macros from <libc-lock.h> with
versions which register cleanup to htl.
(__pthread_get_cleanup_stack): Make reference weak for skipping
registration on in the static non-libpthread case.
If libpthread gets loaded dynamically, the stack needs to already contain the
cleanup handlers of the main thread.
* htl/libc_pthread_init.c (__pthread_cleanup_stack): New per-thread variable.
* htl/Versions (libc): Add __pthread_cleanup_stack as private symbol.
* htl/pt-internal.h (struct __pthread): Remove cancelation_handlers
field.
(__pthread_cleanup_stack): Add variable declaration.
* htl/pt-alloc.c (initialize_pthread): Remove initialization of
cancelation_handlers field.
* htl/pt-cleanup.c (__pthread_get_cleanup_stack): Return the address of
__pthread_cleanup_stack instead of that of the cancelation_handlers
field.
* htl/forward.c: Include <pt-internal.h>.
(dummy_list): Remove variable.
(__pthread_get_cleanup_stack): Return the address of __pthread_cleanup_stack
instead of that of dummy_list.
* sysdeps/mach/hurd/recv.c (__recv): Make the __socket_recv call
cancellable.
* sysdeps/mach/hurd/recvfrom.c (__recvfrom): Make the __socket_recv and
__socket_whatis_address calls cancellable.
* sysdeps/mach/hurd/recvmsg.c (__libc_recvmsg): Make the __socket_recv,
__socket_whatis_address, __io_reauthenticate, and __auth_user_authenticate calls
cancellable.
Added a check to detect the CPU value in preconfigure, so that glibc is
built with the correct --with-cpu value. And move existing checks into
preconfigure.ac.
Co-Authored-By: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Co-Authored-By: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
libpthread_syms.a will contain the symbols that libc tries to get from
libpthread, to be used by the system, but also by tests.
* htl/libpthread.a, htl/libpthread_pic.a: Link libpthread_syms.a and Move EXTERN
references to...
* htl/libpthread_syms.a: ... new file. Add missing
__pthread_enable_asynccancel reference.
* htl/Makefile: Install libpthread_syms.a and link it into static tests.
Introduce an Arm MTE compatible strlen implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance on modern cores. On cores with less
efficient Advanced SIMD implementation such as Cortex-A53 it can
be slower.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
Introduce an Arm MTE compatible strchr implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>