This patch set introduces optimized string, wcsmbs and memory functions for
S390/S390x. The functions are accelerated by the usage of the new z13 vector
instructions.
The Principles of Operations manual for IBM z13 is publically available:
http://publibfi.boulder.ibm.com/epubs/pdf/dz9zr010.pdf
The support for these instructions in assembler was introduced by commits:
-"[Committed] S/390: Add support for IBM z13."
(https://sourceware.org/ml/binutils/2015-01/msg00197.html)
-"[Committed] S/390: Add more IBM z13 instructions"
(https://sourceware.org/ml/binutils/2015-03/msg00088.html)
The first patches do preparation for the latter optimization patches.
The floating point exception handling - fetestexcept(), ... - is fixed and
the platform and hwcap strings are extended.
The current ifunc routines memset, memcpy and memcmp are refactored and the
ifunc test-framework is now enabled.
A S390 specific configure-check tests if the used binutils supports the new
vector instructions. The optimized functions are provided via ifunc if the
binutils supports the vector instructions. Otherwise a message is dumped to
configure output and only the currently used common code functions are
available.
The optimized functions are implemented in common for s390-32 and s390-64
and the few differences are handled via #ifdef.
The ifunc-resolvers are defined in files sysdeps/s390/multiarch/<func>.c,
which choose either the current implementation __<func>_c() or the vector
implementation __<func>_vx() depending on the HWCAP_S390_VX flag bit in
AT_HWCAP field. If the bit is set, the hardware and the kernel are supporting
vector registers and instructions. If the used binutils lacks vector-support,
then the default implementation in string or wcsmbs directory is included
here instead.
The file sysdeps/s390/multiarch/<func>-c.c includes the current implementation
and defines the function name __<func>_c.
The assembler files sysdeps/s390/multiarch/<func>-vx.S with the vector
instructions are using the directive '.machine "z13"' to allow building glibc
without option '-march=z13'. Additionally the directive '.machinemode
"zarch_nohighgprs"' is needed for the 31bit glibc. This mode does not set the
highgprs flag in ELF header, which would lead to an unloadable libc on a 31bit
kernel.
The most optimized string functions are structured in the same way:
The first 16 bytes of the string is loaded unaligned via vlbb - vector load
to block boundary (e.g. 4k). This instruction loads 16 bytes if possible.
In case of a page cross, it only loads the last bytes of the current page
without a segmentation fault.
Afterwards these first part of string is processed. If e.g. for strlen the end
of string is reached within this first part, the function returns. Otherwise
the pointer is aligned to 16 byte, so i can load a full vector register with vl
without checking for a page cross. Afterwards the first part of string is
processed. If e.g. for strlen the end of string is reached within this first
part, the function returns. Otherwise the pointer is aligned to 16 byte, so
a full vector register can be loaded with vl - vector load - without checking
for a page cross. The remaining string is processed in a four times unrolled
loop, because benchmark results measured improvements compared to a non
unrolled loop.
The optimized wide string functions can only handle 4byte aligned string
pointers. Although a wchar_t pointer should always be 4byte aligned, the most
current common code wide string functions can handle non aligned strings.
Thus the optimized functions will fall back to the common code functions in
case of a non aligned wide string to behave the same as before this patch.
Some string tests can test the string and the wide string version of a function.
The remaining ones are extended and new wide string tests are added.
This is the same in case of the benchtests.
ChangeLog:
* NEWS: New item for IBM z13 string optimizations.
The csqrt implementations in glibc can miss underflow exceptions when
the real or imaginary part of the result becomes tiny in the course of
scaling down (in particular, multiplication by 0.5) and that scaling
is exact although the relevant part of the mathematical result isn't.
This patch forces the exception in a similar way to previous fixes.
Tested for x86_64 and x86.
[BZ #18370]
* math/s_csqrt.c (__csqrt): Force underflow exception for results
whose real or imaginary part has small absolute value.
* math/s_csqrtf.c (__csqrtf): Likewise.
* math/s_csqrtl.c (__csqrtl): Likewise.
* math/auto-libm-test-in: Add more tests of csqrt.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/fpu/libm-test-ulps: Update.
lang_lib (which reflects ISO 639-2/B (bibliographic) codes) and
lang_term (which reflects ISO 639-2/T (terminology) codes) should be
identical except for those languages for which ISO 639-2 specifies
separate bibliographic/terminology values.
I used this Library of Congress page as the source:
http://www.loc.gov/standards/iso639-2/php/code_list.php
The csqrt functions scale up small arguments to avoid underflows when
calling hypot functions. However, even when hypot does not underflow,
a subsequent calculation of 0.5 * hypot can underflow. This patch
duly increases the threshold and scale factor to avoid such underflows
as well.
Tested for x86_64, x86 and mips64.
[BZ #18823]
* math/s_csqrt.c (__csqrt): Increase threshold and scale factor
for scaling up small arguments.
* math/s_csqrtf.c (__csqrtf): Likewise.
* math/s_csqrtl.c (__csqrtl): Likewise.
* math/auto-libm-test-in: Add more tests of csqrt.
* math/auto-libm-test-out: Regenerated.
I think the last clause of the conditional,
|| __n <= __bos (__dest)
may be backward. The code should call the runtime-checking function
if __n is not constant, or if __n is known to be LARGER than the size
of the destination.
Various fma implementations have logic that, when computing fma (x, y,
z) where z is large (so care needs taking to avoid internal overflow)
but x * y is small, scale x * y up instead of down to avoid internal
underflows resulting from scaling down. (In these cases, x * y is
small enough that only its sign actually matters rather than the exact
value.)
The threshold for scaling up instead of down was correct for "if the
unscaled values were multiplied, the low part of the multiplication
could underflow", and the scaling was sufficient to ensure that the
low part of the multiplication did not underflow (given that cases of
very small x * y - less than half the least subnormal - were
previously dealt with). However, the choice in the functions wasn't
between scaling up or no scaling, but between scaling up and scaling
down (scaling down actually being needed when x * y isn't so small
compared to z and so the exact value does matter). Thus a larger
threshold is needed to ensure that scaling down doesn't produce values
the multiplication of whose low parts underflows. This patch
increases the thresholds accordingly.
Tested for x86_64, x86 and mips64 (with the MIPS version of s_fmal.c
removed so that the ldbl-128 version gets tested instead of the
soft-fp one).
[BZ #18824]
* sysdeps/ieee754/dbl-64/s_fma.c (__fma): Increase threshold for
scaling x * y up instead of down.
* sysdeps/ieee754/ldbl-128/s_fmal.c (__fmal): Likewise.
* sysdeps/ieee754/ldbl-96/s_fmal.c (__fmal): Likewise.
* math/auto-libm-test-in: Add more tests of fma.
* math/auto-libm-test-out: Regenerated.
Similar to various other bugs in this area, some tanh implementations
do not raise the underflow exception for subnormal arguments, when the
result is tiny and inexact. This patch forces the exception in a
similar way to previous fixes.
Tested for x86_64, x86, mips64 and powerpc.
[BZ #16520]
* sysdeps/ieee754/dbl-64/s_tanh.c: Include <float.h>.
(__tanh): Force underflow exception for arguments with small
absolute value.
* sysdeps/ieee754/flt-32/s_tanhf.c: Include <float.h>.
(__tanhf): Force underflow exception for arguments with small
absolute value.
* sysdeps/ieee754/ldbl-128/s_tanhl.c: Include <float.h>.
(__tanhl): Force underflow exception for arguments with small
absolute value.
* sysdeps/ieee754/ldbl-128ibm/s_tanhl.c: Include <float.h>.
(__tanhl): Force underflow exception for arguments with small
absolute value.
* sysdeps/ieee754/ldbl-96/s_tanhl.c: Include <float.h>.
(__tanhl): Force underflow exception for arguments with small
absolute value.
* math/auto-libm-test-in: Add more tests of tanh.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/fpu/libm-test-ulps: Update.
https://sourceware.org/bugzilla/show_bug.cgi?id=18778
If dlopen fails to load an object that has triggered loading libpthread it
causes ld.so to unload libpthread because its DF_1_NODELETE flags has been
forcefully cleared. The next call to __rtdl_unlock_lock_recursive will crash
since pthread_mutex_unlock no longer exists.
This patch moves l->l_flags_1 &= ~DF_1_NODELETE out of loop through all loaded
libraries and performs the action only on inconsistent one.
[BZ #18778]
* elf/Makefile (tests): Add Add tst-nodelete2.
(modules-names): Add tst-nodelete2mod.
(tst-nodelete2mod.so-no-z-defs): New.
($(objpfx)tst-nodelete2): Likewise.
($(objpfx)tst-nodelete2.out): Likewise.
(LDFLAGS-tst-nodelete2): Likewise.
* elf/dl-close.c (_dl_close_worker): Move DF_1_NODELETE clearing
out of loop through all loaded libraries.
* elf/tst-nodelete2.c: New file.
* elf/tst-nodelete2mod.c: Likewise.
ldbl-128ibm tanhl uses a too-small threshold to decide when to return
+/-1, resulting in large errors. This patch changes it to a more
appropriate threshold (the requirement is for 2*exp(-2|x|) to be small
in terms of ulps of 1).
Tested for x86_64, x86 and powerpc.
[BZ #18790]
* sysdeps/ieee754/ldbl-128ibm/s_tanhl.c (__tanhl): Increase
threshold for returning +/- 1.
* math/auto-libm-test-in: Add more tests of tanh.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/fpu/libm-test-ulps: Update.
ldbl-128ibm sinhl uses a too-big threshold to decide when to return
the argument, resulting in large errors. This patch fixes it to use a
more appropriate threshold.
Tested for x86_64, x86 and powerpc.
[BZ #18789]
* sysdeps/ieee754/ldbl-128ibm/e_sinhl.c (__ieee754_sinhl): Use
smaller threshold for returning the argument.
* math/auto-libm-test-in: Add more tests of sinh.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/fpu/libm-test-ulps: Update.
The attached change fixes the miscompilation of sched_setaffinity() on
hppa. This is an old problem that was fixed on other architectures using
a similar approach to the attached change. See:
https://sourceware.org/ml/libc-hacker/2004-04/msg00016.html
Build tested on trunk. Patch has been applied to debian glibc for some time.
As noted in the bug, the asm operands need to be copied to register
variables to avoid operand reloads in the principal asm of the macro.
See the arm implementation for reference. Otherwise we get:
../sysdeps/unix/sysv/linux/hppa/bits/atomic.h:68:6: error:
can't find a register in class 'R1_REGS' while reloading 'asm'
Build tested on trunk with gcc-4.8. Similar patch has been tested
with 2.19 on Debian hppa-unknown-linux-gnu.
Similar to various other bugs in this area, some tan implementations
do not raise the underflow exception for subnormal arguments, when the
result is tiny and inexact. This patch forces the exception in a
similar way to previous fixes.
Tested for x86_64, x86, mips64 and powerpc.
[BZ #16517]
* sysdeps/ieee754/dbl-64/s_tan.c: Include <float.h>.
(tan): Force underflow exception for arguments with small absolute
value.
* sysdeps/ieee754/flt-32/k_tanf.c: Include <float.h>.
(__kernel_tanf): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128/k_tanl.c: Include <float.h>.
(__kernel_tanl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128ibm/k_tanl.c: Include <float.h>.
(__kernel_tanl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-96/k_tanl.c: Include <float.h>.
(__kernel_tanl): Force underflow exception for arguments with
small absolute value.
* math/auto-libm-test-in: Add more tests of tan.
* math/auto-libm-test-out: Regenerated.
Similar to various other bugs in this area, some sinh implementations
do not raise the underflow exception for subnormal arguments, when the
result is tiny and inexact. This patch forces the exception in a
similar way to previous fixes.
Tested for x86_64, x86, mips64 and powerpc.
[BZ #16519]
* sysdeps/ieee754/dbl-64/e_sinh.c: Include <float.h>.
(__ieee754_sinh): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/flt-32/e_sinhf.c: Include <float.h>.
(__ieee754_sinhf): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128/e_sinhl.c: Include <float.h>.
(__ieee754_sinhl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128ibm/e_sinhl.c: Include <float.h>.
(__ieee754_sinhl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-96/e_sinhl.c: Include <float.h>.
(__ieee754_sinhl): Force underflow exception for arguments with
small absolute value.
* math/auto-libm-test-in: Add more tests of sinh.
* math/auto-libm-test-out: Regenerated.
* sysdeps/i386/fpu/libm-test-ulps: Update.
In the "Kill regexp.h" thread, Joseph dug up more accurate information
about exactly which editions of the Single Unix Standard included and
deprecated this header.
The flt-32 implementation of powf wrongly uses x-1 instead of |x|-1
when computing log (x) for the case where |x| is close to 1 and y is
large. This patch fixes the logic accordingly. Relevant tests
existed for x close to 1, and corresponding tests are added for x
close to -1, as well as for some new variant cases.
Tested for x86_64 and x86.
[BZ #18647]
* sysdeps/ieee754/flt-32/e_powf.c (__ieee754_powf): For large y
and |x| close to 1, use absolute value of x when computing log.
* math/auto-libm-test-in: Add more tests of pow.
* math/auto-libm-test-out: Regenerated.
as discussed in the thread starting at
https://sourceware.org/ml/libc-alpha/2015-06/msg00098.html
it looks like the best options is to remove locale timezone information
from locales which currently provide it (in incomplete or incorrect
fashion) rather than to start duplicating tzdata info in glibc.
This patch adds __nonnull annotations for wcscat, wcsncat, wcscmp and wcsncmp.
These added annotations match the annoations for strcat, strncat, strcmp, strncmp in glibc.
<regexp.h> (not to be confused with <regex.h>) is an obsolete and
frankly horrible regular expression-matching API. It was part of SVID
but was withdrawn in Issue 5 (for reference, we're on Issue 7 now).
It doesn't do anything you can't do with <regex.h>, and using it
involves defining a bunch of macros before including the header.
Moreover, the code in regexp.h that uses those macros has been buggy
since its creation (in 1996) and no one has noticed, which indicates
to me that there are no users. (Specifically, RETURN() is used in a
whole bunch of cases where it should have been ERROR().)
The header is given a warning and marked deprecated for 2.22.
See:
https://sourceware.org/ml/libc-alpha/2015-07/msg00862.html and
https://sourceware.org/ml/libc-alpha/2015-07/msg00871.html.
On x86, linker in binutils 2.26 and newer consolidates R_*_JUMP_SLOT with
R_*_GLOB_DAT relocation against the same symbol. This patch extends
local PLT reference check to support alternate relocations.
[BZ #18078]
* scripts/check-localplt.awk: Support alternate relocations.
* scripts/localplt.awk: Also check relocations in DT_RELA/DT_REL
sections.
* sysdeps/unix/sysv/linux/i386/localplt.data: Mark free and
malloc entries with + REL R_386_GLOB_DAT.
* sysdeps/x86_64/localplt.data: New file.
Changes in support of -fno-plt also cause the elf/tst-audit* tests to
start passing on MIPS. This patch duly marks the relevant bug as
fixed in ChangeLog and NEWS.
The recently introduced TLS variables in the thread-local destructor
implementation (__cxa_thread_atexit_impl) used the default GD access
model, resulting in a call to __tls_get_addr. This causes a deadlock
with recent changes to the way TLS is initialized because DTV
allocations are delayed and hence despite knowing the offset to the
variable inside its TLS block, the thread has to take the global rtld
lock to safely update the TLS offset.
This causes deadlocks when a thread is instantiated and joined inside
a destructor of a dlopen'd DSO. The correct long term fix is to
somehow not take the lock, but that will need a lot deeper change set
to alter the way in which the big rtld lock is used.
Instead, this patch just eliminates the call to __tls_get_addr for the
thread-local variables inside libc.so, libpthread.so and rtld by
building all of their units with -mtls-model=initial-exec.
There were concerns that the static storage for TLS is limited and
hence we should not be using it. Additionally, dynamically loaded
modules may result in libc.so looking for this static storage pretty
late in static binaries. Both concerns are valid when using TLSDESC
since that is where one may attempt to allocate a TLS block from
static storage for even those variables that are not IE. They're not
very strong arguments for the traditional TLS model though, since it
assumes that the static storage would be used sparingly and definitely
not by default. Hence, for now this would only theoretically affect
ARM architectures.
The impact is hence limited to statically linked binaries that dlopen
modules that in turn load libc.so, all that on arm hardware. It seems
like a small enough impact to justify fixing the larger problem that
currently affects everything everywhere.
This still does not solve the original problem completely. That is,
it is still possible to deadlock on the big rtld lock with a small
tweak to the test case attached to this patch. That problem is
however not a regression in 2.22 and hence could be tackled as a
separate project. The test case is picked up as is from Alex's patch.
This change has been tested to verify that it does not cause any
issues on x86_64.
ChangeLog:
[BZ #18457]
* nptl/Makefile (tests): New test case tst-join7.
(modules-names): New test case module tst-join7mod.
* nptl/tst-join7.c: New file.
* nptl/tst-join7mod.c: New file.
* Makeconfig (tls-model): Pass -ftls-model=initial-exec for
all translation units in libc.so, libpthread.so and rtld.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
ChangeLog:
[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* include/link.h (struct link_map): Add concurrency note for
L_TLS_DTOR_COUNT.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose. Add conditionals
to allow tst-tls-atexit-nodelete test case to use it.
Commit a059d359d8 changed the sigaction
struct to pass conform tests, but it ended up also changing the ABI for
32 bit builds. For 64 bit builds, changing the long to two ints works,
but for 32 bit builds, it inserts 4 extra bytes. This leads to many
packages randomly failing like bash that spews things like:
configure: line 471: wait_for: No record of process 0
Bracket the new member by a wordsize check to fix the ABI for 32bit.
X86 struct siginfo in kernel 3.19 has been changed by
commit ee1b58d36aa1b5a79eaba11f5c3633c88231da83
Author: Qiaowei Ren <qiaowei.ren@intel.com>
Date: Fri Nov 14 07:18:19 2014 -0800
mpx: Extend siginfo structure to include bound violation information
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.
This patch updates x86 struct siginfo to enable GDB with MPX support.
[BZ #18696]
* sysdeps/unix/sysv/linux/x86/bits/siginfo.h (_sigfault): Add
si_addr_bnd.
(si_lower): New.
(si_upper): Likewise.
The DF_1_NODELETE flag is set too late when opening a DSO, due to
which, if a DSO is already open, subsequently opening it with
RTLD_NODELETE fails to set the DF_1_NODELETE flag. This patch fixes
this by setting the flag immediately after bumping the opencount.
Verified on x86_64.
[BZ #18676]
* elf/tst-nodelete-opened.c: New test case.
* elf/tst-nodelete-opened-lib.c: New test case module.
* elf/Makefile (tests, modules-names): Use them.
* elf/dl-open.c (dl_open_worker): Set DF_1_NODELETE flag
early.