I noticed that there was no meaningful test coverage for
fegetexceptflag and fesetexceptflag (one test ensures that calls to
them compile and link, but nothing to verify they work correctly).
This patch adds tests for these functions.
fesetexceptflag is meant to set the relevant exception flag bits to
the saved state without causing enabled traps to be taken. On some
architectures, it is not possible to set exception flag bits without
causing enabled traps to occur. Such architectures need to define
EXCEPTION_SET_FORCES_TRAP to 1 in their math-tests.h, as is done in
this patch for powerpc. x86 avoids needing to define this because the
traps resulting from setting exception bits don't occur until the next
floating-point operation or fwait instruction.
Tested for x86_64, x86 and powerpc. Note that test-fexcept fails for
powerpc because of a pre-existing bug in fesetexceptflag for powerpc,
which I'll fix separately.
* math/test-fexcept-traps.c: New file.
* math/test-fexcept.c: Likewise.
* math/Makefile (tests): Add test-fexcept and test-fexcept-traps.
* sysdeps/generic/math-tests.h (EXCEPTION_SET_FORCES_TRAP): New
macro.
* sysdeps/powerpc/math-tests.h [!__NO_FPRS__]
(EXCEPTION_SET_FORCES_TRAP): Likewise.
As pointer out on the mailing list, the inline assembly code in
sysdeps/powerpc/ifunc-sel.h doesn't have a list of clobbered registers
and used wrong constraints.
This patch fixes that. I verified it doesn't introduce any change in the
generated code.
Changelog:
* sysdeps/powerpc/ifunc-sel.h (ifunc_sel): Add "11", "12", "cr0" to the
clobber list. Use "i" constraint instead of "X".
(ifunc_one): Add "12" to the clobber list. Use "i" constraint instead
of "X".
On 32-bit PowerPC GCC 6 always saves the PIC register on the stack in
the prologue and adjust the stack in the epilogue. It is therefore not
possible anymore to just exit the function in the inline asm code,
otherwise it corrupts the stack pointer. This causes the following tests
to fail when using GCC 6:
FAIL: elf/ifuncmain1
FAIL: elf/ifuncmain1pic
FAIL: elf/ifuncmain1picstatic
FAIL: elf/ifuncmain1pie
FAIL: elf/ifuncmain1staticpic
FAIL: elf/ifuncmain1staticpie
FAIL: elf/ifuncmain1vis
FAIL: elf/ifuncmain1vispic
FAIL: elf/ifuncmain1vispie
FAIL: elf/ifuncmain2pic
FAIL: elf/ifuncmain2picstatic
FAIL: elf/ifuncmain3
FAIL: elf/ifuncmain4picstatic
FAIL: elf/ifuncmain5
FAIL: elf/ifuncmain5picstatic
FAIL: elf/ifuncmain5staticpic
The solution is to replace the beqlr instructions by a beq to the end
of the inline asm code. This fixes all the above failures.
ChangeLog:
* sysdeps/powerpc/ifunc-sel.h (ifunc_sel): Replace beqlr instructions
by beq instructions jumping to the end of the function.
Commit a6a4395d fixed modf implementation by compiling s_modf.c and
s_modff.c with -fsignaling-nans. However these files are also included
from the pre-POWER5+ implementation, and thus these files should also
be compiled with -fsignaling-nans.
Changelog:
[BZ #20240]
* sysdeps/powerpc/powerpc32/power4/fpu/multiarch/Makefile
(CFLAGS-s_modf-ppc32.c): New variable.
(CFLAGS-s_modff-ppc32.c): Likewise.
* sysdeps/powerpc/powerpc64/fpu/multiarch/Makefile
(CFLAGS-s_modf-ppc64.c): Likewise.
(CFLAGS-s_modff-ppc64.c): Likewise.
If the input values are unaligned and if there are null characters in the
memory before the starting address of the input values, strcasecmp
gives incorrect return code. Fixed it by adding mask the bits that
are not part of the string.
This implementation is based on the one already used at
sysdeps/x86_64/fpu/e_expf.S.
This implementation improves the performance by ~14% on average in synthetic
benchmarks at the cost of decreasing accuracy to 1 ULP.
atomic_compare_and_exchange_bool_rel and
catomic_compare_and_exchange_bool_rel are removed and replaced with the
new C11-like atomic_compare_exchange_weak_release. The concurrent code
in nscd/cache.c has not been reviewed yet, so this patch does not add
detailed comments.
* nscd/cache.c (cache_add): Use new C11-like atomic operation instead
of atomic_compare_and_exchange_bool_rel.
* nptl/pthread_mutex_unlock.c (__pthread_mutex_unlock_full): Likewise.
* include/atomic.h (atomic_compare_and_exchange_bool_rel,
catomic_compare_and_exchange_bool_rel): Remove.
* sysdeps/aarch64/atomic-machine.h
(atomic_compare_and_exchange_bool_rel): Likewise.
* sysdeps/alpha/atomic-machine.h
(atomic_compare_and_exchange_bool_rel): Likewise.
* sysdeps/arm/atomic-machine.h
(atomic_compare_and_exchange_bool_rel): Likewise.
* sysdeps/mips/atomic-machine.h
(atomic_compare_and_exchange_bool_rel): Likewise.
* sysdeps/tile/atomic-machine.h
(atomic_compare_and_exchange_bool_rel): Likewise.
Some architectures have their own versions of fdim functions, which
are missing errno setting (bug 6796) and may also return sNaN instead
of qNaN for sNaN input, in the case of the x86 / x86_64 long double
versions (bug 20256).
These versions are not actually doing anything that a compiler
couldn't generate, just straightforward comparisons / arithmetic (and,
in the x86 / x86_64 case, testing for NaNs with fxam, which isn't
actually needed once you use an unordered comparison and let the NaNs
pass through the same subtraction as non-NaN inputs). This patch
removes the x86 / x86_64 / powerpc versions, so that those
architectures use the generic C versions, which correctly handle
setting errno and deal properly with sNaN inputs. This seems better
than dealing with setting errno in lots of .S versions.
The i386 versions also return results with excess range and precision,
which is not appropriate for a function exactly defined by reference
to IEEE operations. For errno setting to work correctly on overflow,
it's necessary to remove excess range with math_narrow_eval, which
this patch duly does in the float and double versions so that the
tests can reliably pass on x86. For float, this avoids any double
rounding issues as the long double precision is more than twice that
of float. For double, double rounding issues will need to be
addressed separately, so this patch does not fully fix bug 20255.
Tested for x86_64, x86 and powerpc.
[BZ #6796]
[BZ #20255]
[BZ #20256]
* math/s_fdim.c: Include <math_private.h>.
(__fdim): Use math_narrow_eval on result.
* math/s_fdimf.c: Include <math_private.h>.
(__fdimf): Use math_narrow_eval on result.
* sysdeps/i386/fpu/s_fdim.S: Remove file.
* sysdeps/i386/fpu/s_fdimf.S: Likewise.
* sysdeps/i386/fpu/s_fdiml.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdim.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdimf.S: Likewise.
* sysdeps/i386/i686/fpu/s_fdiml.S: Likewise.
* sysdeps/powerpc/fpu/s_fdim.c: Likewise.
* sysdeps/powerpc/fpu/s_fdimf.c: Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_fdim.c: Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_fdim.c: Likewise.
* sysdeps/x86_64/fpu/s_fdiml.S: Likewise.
* math/libm-test.inc (fdim_test_data): Expect errno setting on
overflow. Add sNaN tests.
This implementation utilizes vectors to improve performance
compared to current byte by byte implementation for POWER7.
The performance improvement is upto 4x. This patch is tested
on powerpc64 and powerpc64le.
The powerpc64 versions of ceil, floor, round, trunc, rint, nearbyint
and their float versions return sNaN for sNaN input when they should
return qNaN. This patch fixes them to add a NaN argument to itself to
quiet sNaNs before returning.
Tested for powerpc64.
[BZ #20160]
* sysdeps/powerpc/powerpc64/fpu/s_ceil.S (__ceil): Add NaN
argument to itself before returning the result.
* sysdeps/powerpc/powerpc64/fpu/s_ceilf.S (__ceilf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_floor.S (__floor): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_floorf.S (__floorf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_nearbyint.S (__nearbyint):
Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_nearbyintf.S (__nearbyintf):
Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_rint.S (__rint): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_rintf.S (__rintf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_round.S (__round): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_roundf.S (__roundf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_trunc.S (__trunc): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_truncf.S (__truncf): Likewise.
The powerpc32 versions of ceil, floor, round, trunc, rint, nearbyint
and their float versions return sNaN for sNaN input when they should
return qNaN. This patch fixes them to add a NaN argument to itself to
quiet sNaNs before returning. The powerpc64 versions, which have the
same bug, will be addressed separately.
Tested for powerpc32.
[BZ #20160]
* sysdeps/powerpc/powerpc32/fpu/s_ceil.S (__ceil): Add NaN
argument to itself before returning the result.
* sysdeps/powerpc/powerpc32/fpu/s_ceilf.S (__ceilf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_floor.S (__floor): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_floorf.S (__floorf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_nearbyint.S (__nearbyint):
Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_nearbyintf.S (__nearbyintf):
Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_rint.S (__rint): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_rintf.S (__rintf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_round.S (__round): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_roundf.S (__roundf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_trunc.S (__trunc): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_truncf.S (__truncf): Likewise.
The powerpc implementations of fabsl for ldbl-128ibm (both powerpc32
and powerpc64) wrongly raise the "invalid" exception for sNaN
arguments. fabs functions should be quiet for all inputs including
signaling NaNs. The problem is the use of a comparison instruction
fcmpu to determine if the high part of the argument is negative and so
the low part needs to be negated; such instructions raise "invalid"
for sNaNs.
There is a pure integer implementation of fabsl in
sysdeps/ieee754/ldbl-128ibm/s_fabsl.c. However, it's not necessary to
use it to avoid such exceptions. The fsel instruction does not raise
exceptions for sNaNs, and can be used in place of the original
comparison. (Note that if the high part is zero or a NaN, it does not
matter whether the low part is negated; the choice of whether the low
part of a zero is +0 or -0 does not affect the value, and the low part
of a NaN does not affect the value / payload either.)
The condition in GCC for fsel to be available is TARGET_PPC_GFXOPT,
corresponding to the _ARCH_PPCGR predefined macro. fsel is available
on all 64-bit processors supported by GCC. A few 32-bit processors
supported by GCC do not have TARGET_PPC_GFXOPT despite having hard
float support. To support those processors, integer code (similar to
that in copysignl) is included for the !_ARCH_PPCGR case for
powerpc32.
Tested for powerpc32 (configurations with and without _ARCH_PPCGR) and
powerpc64.
[BZ #20157]
* sysdeps/powerpc/powerpc32/fpu/s_fabsl.S (__fabsl): Use fsel to
determine whether to negate low half if [_ARCH_PPCGR], and integer
comparison otherwise.
* sysdeps/powerpc/powerpc64/fpu/s_fabsl.S (__fabsl): Use fsel to
determine whether to negate low half.
Continuing fixes for ceil, floor and trunc functions not to raise the
"inexact" exception, this patch fixes the versions used on older
powerpc64 processors. As was done with the round implementations some
time ago, the save of floating-point state is moved after the first
floating-point operation on the input to ensure that any "invalid"
exception from signaling NaN input is included in the saved state, and
then the whole state gets restored rather than just the rounding mode.
This has no effect on configurations using the power5+ code, since
such processors can do these operations with a single instruction (and
those instructions do not set "inexact", so are correct for TS 18661-1
semantics).
Tested for powerpc64.
[BZ #15479]
* sysdeps/powerpc/powerpc64/fpu/s_ceil.S (__ceil): Move save of
floating-point state after first floating-point operation on
input. Restore full floating-point state instead of just rounding
mode.
* sysdeps/powerpc/powerpc64/fpu/s_ceilf.S (__ceilf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_floor.S (__floor): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_floorf.S (__floorf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_trunc.S (__trunc): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_truncf.S (__truncf): Likewise.
Continuing fixes for ceil, floor and trunc functions not to raise the
"inexact" exception, this patch fixes the versions used on older
powerpc32 processors. As was done with the round implementations some
time ago, the save of floating-point state is moved after the first
floating-point operation on the input to ensure that any "invalid"
exception from signaling NaN input is included in the saved state, and
then the whole state gets restored rather than just the rounding mode.
This has no effect on configurations using the power5+ code, since
such processors can do these operations with a single instruction (and
those instructions do not set "inexact", so are correct for TS 18661-1
semantics).
Tested for powerpc32.
[BZ #15479]
* sysdeps/powerpc/powerpc32/fpu/s_ceil.S (__ceil): Move save of
floating-point state after first floating-point operation on
input. Restore full floating-point state instead of just rounding
mode.
* sysdeps/powerpc/powerpc32/fpu/s_ceilf.S (__ceilf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_floor.S (__floor): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_floorf.S (__floorf): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_trunc.S (__trunc): Likewise.
* sysdeps/powerpc/powerpc32/fpu/s_truncf.S (__truncf): Likewise.
The file sysdeps/powerpc/sysdeps.h defines aliases for condition register
operands. E.g.: 'cr7' means condition register 7. On the one hand, this
increases readability, as it makes it easier for readers to know whether the
operand is a condition register, a general purpose register or an immediate.
On the other hand, this permits that condition registers be written as if they
were general purpose, and vice-versa, thus reducing the readability of the
code.
This commit removes some of these unintentional misuses.
The changes have no effect on the final code. Checked with objdump.
Call __memset_power8 to pad, with zeros, the remaining bytes in the
dest string on __strncpy_power8 and __stpncpy_power8. This improves
performance when n is larger than the input string, giving ~30% gain for
larger strings without impacting much shorter strings.
This patch optimizes strcasestr function for power >= 8 systems. The average
improvement of this optimization is ~40% and compares 16 bytes at a time
using vector instructions. This patch is tested on powerpc64 and powerpc64le.
This utilizes vectors and bitmasks. For small needle, large
haystack, the performance improvement is upto 8x. For short
strings (0-4B), the cost of computing the bitmask dominates,
and is a tad slower.
This patch removes the powerpc64 optimized strspn, strcspn, and
strpbrk assembly implementation now that the default C one
implements the same strategy. On internal glibc benchtests
current implementations shows similar performance with -O2.
Tested on powerpc64le (POWER8).
* sysdeps/powerpc/powerpc64/strcspn.S: Remove file.
* sysdeps/powerpc/powerpc64/strpbrk.S: Remove file.
* sysdeps/powerpc/powerpc64/strspn.S: Remove file.
The ldbl-128ibm implementation of nearbyintl uses logic that only
works in round-to-nearest mode. This contrasts with rintl, which
works in all rounding modes.
Now, arguably nearbyintl could simply be aliased to rintl, given that
spurious "inexact" is generally allowed for ldbl-128ibm, even for the
underlying arithmetic operations. But given that the only point of
nearbyintl is to avoid "inexact", this patch follows the more
conservative approach of adding conditionals to the rintl
implementation to make it suitable for use to implement nearbyintl,
then builds it for nearbyintl with USE_AS_NEARBYINTL defined. The
test test-nearbyint-except-2 shows up issues when traps on "inexact"
are enabled, which turn out to be problems with the powerpc
fenv_private.h implementation (two functions that should disable
exception traps potentially failing to do so in some cases); this
patch duly fixes that as well (I don't see any other existing cases
where this would be user-visible; there isn't much use of *_NOEX,
*hold* etc. in libm that requires exceptions to be discarded and not
trapped on).
Tested for powerpc.
[BZ #19790]
* sysdeps/ieee754/ldbl-128ibm/s_rintl.c [USE_AS_NEARBYINTL]
(rintl): Define as macro.
[USE_AS_NEARBYINTL] (__rintl): Likewise.
(__rintl) [USE_AS_NEARBYINTL]: Use SET_RESTORE_ROUND_NOEX instead
of fesetround. Ensure results are evaluated before end of scope.
* sysdeps/ieee754/ldbl-128ibm/s_nearbyintl.c: Define
USE_AS_NEARBYINTL and include s_rintl.c.
* sysdeps/powerpc/fpu/fenv_private.h (libc_feholdsetround_ppc):
Disable exception traps in new environment.
(libc_feholdsetround_ppc_ctx): Likewise.
The operand modifier %s on powerpc is an undocumented internal implementation
detail of GCC. Besides that, the GCC community wants to remove it. This patch
rewrites the expressions that use this modifier with logically equivalent
expressions that don't require it.
Explanation for the substitution:
The %s modifier takes an immediate operand and prints 32 less such immediate.
Thus, in the previous code, the expression resulted in:
32 - __builtin_ffs(e)
where e was guaranteed to have exactly a single bit set, by the following
expressions:
(e & (e-1) == 0) : e has at most one bit set.
(e != 0) : e is not zero, thus it has at least one bit set.
Since we guarantee that there is exactly only one bit set, the following
statement is true:
32 - __builtin_ffs(e) == __builtin_clz(e)
Thus, we can replace __builtin_ffs with __builtin_clz and remove the %s operand
modifier.
HWCAP-related code should had been updated when the 32 bits of HWCAP were
used. This patch updates the code in dl-procinfo.h to loop through all
the 32 bits in HWCAP and updates _dl_powerpc_cap_flags accordingly.
On running tests after from-scratch ulps regeneration, I found that
some libm tests failed with ulps in excess of those recorded in the
from-scratch regeneration, which should never happen unless those ulps
exceed the limit on ulps that can go in libm-test-ulps files.
Failure: Test: atan2_upward (inf, -inf)
Result:
is: 2.35619498e+00 0x1.2d97ccp+1
should be: 2.35619450e+00 0x1.2d97c8p+1
difference: 4.76837159e-07 0x1.000000p-21
ulp : 2.0000
max.ulp : 1.0000
Maximal error of `atan2_upward'
is : 2 ulp
accepted: 1 ulp
Failure: Test: carg_upward (-inf + inf i)
Result:
is: 2.35619498e+00 0x1.2d97ccp+1
should be: 2.35619450e+00 0x1.2d97c8p+1
difference: 4.76837159e-07 0x1.000000p-21
ulp : 2.0000
max.ulp : 1.0000
Maximal error of `carg_upward'
is : 2 ulp
accepted: 1 ulp
The problem comes from the addition of tests for the finite-math-only
versions of libm functions. Those tests share ulps with the default
function variants. make regen-ulps runs the default tests before the
finite-math-only tests, concatenating the resulting ulps before
feeding them to gen-libm-test.pl to generate a new libm-test-ulps
file. But gen-libm-test.pl always takes the last ulps value given for
any (function, type) pair. So, if the largest ulps for a function
come from non-finite inputs, a from-scratch regeneration loses those
ulps.
This patch fixes gen-libm-test.pl, in the case where there are
multiple ulps values for a (function, type) pair - which can only
happen as part of a regeneration - to take the largest ulps value
rather than the last one.
Tested for ARM / MIPS / powerpc-nofpu.
* math/gen-libm-test.pl (parse_ulps): Do not reduce
already-recorded ulps.
* sysdeps/arm/libm-test-ulps: Regenerated.
* sysdeps/mips/mips32/libm-test-ulps: Likewise.
* sysdeps/mips/mips64/libm-test-ulps: Likewise.
* sysdeps/powerpc/nofpu/libm-test-ulps: Likewise.
Work around a GCC behavior with hardware transactional memory built-ins.
GCC doesn't treat the PowerPC transactional built-ins as compiler
barriers, moving instructions past the transaction boundaries and
altering their atomicity.
Commit 67385a01d2 added a new feature for
powerpc, where we store HWCAP/Platform bits in the TCB. In the dynamic
linking case, we use the versioned symbol
'__parse_hwcap_and_convert_at_platform' to verify if this feature is
available. However, the same symbol was not exported to libc.a, making
it not possible for GCC to check for it prior to link time.
The optimized POWER7 logb implementation does not use the absolute
value of the word extracted from the input to apply the leading 0-bits
builtin (to ignore the float sign). This patch fixes it by
clearing the signal bit in the resulting word.
It fixes the subnormal tests failures when running on POWER7 ou
newer chip.
Tested on powerpc64le (POWER8).
[BZ# 19375]
* sysdeps/powerpc/power7/fpu/s_logb.c (__logb): Fix return for
negative subnormals.
This patch adds a new feature for powerpc. In order to get faster access to
the HWCAP/HWCAP2 bits and platform number (i.e. for implementing
__builtin_cpu_is () / __builtin_cpu_supports () in GCC) without the overhead of
reading from the auxiliary vector, we now reserve space for them in the TCB.
This is an ABI change for GLIBC 2.23.
A new versioned symbol '__parse_hwcap_and_convert_at_platform' is available to
get the data from the auxiliary vector and parse it, and store it for later use
in the TLS initialization code. This function is called very early
(in _dl_sysdep_start () via DL_PLATFORM_INFO for the dynamic linking case, and
in __libc_start_main () for the static linking case) to make sure the data is
available at the time of TLS initialization.
* sysdeps/powerpc/Makefile (sysdep-dl-routines): Add hwcapinfo.
(sysdep_routines): Likewise.
(sysdep-rtld-routines): Likewise.
[$(subdir) = nptl](tests): Add test-get_hwcap and test-get_hwcap-static
[$(subdir) = nptl](tests-static): test-get_hwcap-static
* sysdeps/powerpc/Versions: Added new
__parse_hwcap_and_convert_at_platform symbol to GLIBC-2.23.
* sysdeps/powerpc/hwcapinfo.c: New file.
(__tcb_parse_hwcap_and_convert_at_platform): New function to initialize
and parse hwcap, hwcap2 and platform number information.
* sysdeps/powerpc/hwcapinfo.h: New file. Creates global variables
to store HWCAP+HWCAP2 and platform number.
* sysdeps/powerpc/nptl/tcb-offsets.sym: Added new offsets
for HWCAP+HWCAP2 and platform number in the TCB.
* sysdeps/powerpc/nptl/tls.h: New functionality. Stores
the HWCAP, HWCAP2 and platform number in the TCB.
(dtv): Added new fields for HWCAP+HWCAP2 and platform number.
(TLS_INIT_TP): Included calls to add the hwcap and
at_platform values in the TCB in TP initialization.
(TLS_DEFINE_INIT_TP): Likewise.
(THREAD_GET_HWCAP): New macro.
(THREAD_SET_HWCAP): Likewise.
(THREAD_GET_AT_PLATFORM): Likewise.
(THREAD_SET_AT_PLATFORM): Likewise.
* sysdeps/powerpc/powerpc32/dl-machine.h:
(dl_platform_init): New function that calls
__parse_hwcap_and_convert_at_platform for the dymanic linking case for
powerpc32.
* sysdeps/powerpc/powerpc64/dl-machine.h: Likewise, for powerpc64.
* sysdeps/powerpc/test-get_hwcap-static.c: New file. Testcase for
this functionality, static linking case.
* sysdeps/powerpc/test-get_hwcap.c: New file. Likewise, dynamic
linking case.
* sysdeps/unix/sysv/linux/powerpc/libc-start.c: Added call to
__parse_hwcap_and_convert_at_platform for the static linking case.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist:
Included the new __parse_hwcap_and_convert_at_platform symbol in the
ABI list for GLIBC 2.23.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist:
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist:
Likewise.
This patch optimizes powerpc spinlock implementation by:
* Use the correct EH hint bit on the larx for supported ISA. For lock
acquisition, the thread that acquired the lock with a successful stcx
does not want to give away the write ownership on the cacheline. The
idea is to make the load reservation "sticky" about retaining write
authority to the line. That way, the store that must inevitably come
to release the lock can succeed quickly and not contend with other
threads issuing lwarx. If another thread does a store to the line
(false sharing), the winning thread must give up write authority to
the proper value of EH for the larx for a lock acquisition is 1.
* Increase contented lock performance by up to 40%, and no measurable
impact on uncontended locks on P8.
Thanks to Adhemerval Zanella who did most of the work. I've run some
tests, and addressed some minor feedback.
* sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock):
Add lwarx hint, and use macro for acquire instruction.
* sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock):
Likewise.
* sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ...
* sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and
update to use new atomic macros.
The powerpc hard-float round and roundf functions, both 32-bit and
64-bit, raise spurious "inexact" exceptions for integer arguments from
adding 0.5 and rounding to integer toward zero.
Since these functions already save and restore the rounding mode, it's
natural to make them restore the full floating-point state instead to
fix this bug, which this patch does. The save of the state is moved
after the first floating-point operation on the input so that any
"invalid" exceptions from signaling NaN inputs are properly
preserved. As a consequence of this approach to the fix, "inexact"
for noninteger arguments (disallowed by TS 18661-1 but not by C99/C11,
see bug 15479) is also avoided for these implementations; this is
*not* a general fix for bug 15479 since plenty of other
implementations of various functions still raise spurious "inexact"
for noninteger arguments.
This issue and fix do not apply to builds using power5+ versions of
round and roundf, which use the frin instruction and avoid "inexact"
exceptions that way.
This patch should get hard-float powerpc32 and powerpc64 (default
function implementations) back to a state where test-float and
test-double will pass after ulps regeneration.
Tested for powerpc32 and powerpc64.
[BZ #15479]
[BZ #19238]
* sysdeps/powerpc/powerpc32/fpu/s_round.S (__round): Save
floating-point state after first operation on input. Restore full
state rather than just rounding mode.
* sysdeps/powerpc/powerpc32/fpu/s_roundf.S (__roundf): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_round.S (__round): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_roundf.S (__roundf): Likewise.
Similar to bug 19134 for powerpc32, the powerpc64 implementations of
lround, lroundf, llround, llroundf can raise spurious "inexact"
exceptions for integer arguments from adding 0.5 then converting to
integer (this does not apply to the power5+ version for double, which
uses the frin instruction which is defined never to raise "inexact"; I
don't know why power5+ doesn't use that version for float as well).
This patch fixes the bug in a similar way to the powerpc32 bug, by
testing for integers (adding and subtracting 2^52 and comparing with
the value before that addition and subtraction) and not adding 0.5 in
that case.
The powerpc maintainers may wish to look at making power5+ / power6x /
power8 use frin for float lround / llround as well as for double,
unless there's some reason I've missed that this isn't beneficial.
Tested for powerpc64.
[BZ #19235]
* sysdeps/powerpc/powerpc64/fpu/s_llround.S (__llround): Do not
add 0.5 to integer arguments.
* sysdeps/powerpc/powerpc64/fpu/s_llroundf.S (__llroundf):
Likewise.
(.LC2): New object.
Similar to bug 15491 recently fixed for x86_64 / x86, the powerpc
(both powerpc32 and powerpc64) hard-float implementations of
nearbyintf and nearbyint wrongly clear an "inexact" exception that was
raised before the function was called; this shows up as failure of the
test math/test-nearbyint-except added when that bug was fixed. They
also wrongly leave traps on "inexact" disabled if they were enabled
before the function was called.
This patch fixes the bugs similar to how the x86 bug was fixed: saving
and restoring the whole floating-point state, both to restore the
original "inexact" flag state and to restore the original state of
whether traps on "inexact" were enabled. Because there's a convenient
point in the powerpc implementations to save state after any sNaN
arguments will have raised "invalid" but before "inexact" traps need
to be disabled, no special handling for "invalid" is needed as in the
x86 version.
Tested for powerpc64 and powerpc32, where it fixes the
math/test-nearbyint-except failure as well as fixing the new test
math/test-nearbyint-except-2 added by this patch. Also tested for
x86_64 and x86 that the new test passes.
If powerpc experts see a more efficient way of doing this
(e.g. instruction positioning that's better for pipelines on typical
processors) then of course followups optimizing the fix are welcome.
[BZ #19228]
* sysdeps/powerpc/powerpc32/fpu/s_nearbyint.S (__nearbyint): Save
and restore full floating-point state.
* sysdeps/powerpc/powerpc32/fpu/s_nearbyintf.S (__nearbyintf):
Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_nearbyint.S (__nearbyint):
Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_nearbyintf.S (__nearbyintf):
Likewise.
* math/test-nearbyint-except-2.c: New file.
* math/Makefile (tests): Add test-nearbyint-except-2.
Since '--no-tls-optimize' is available for Power in ld, we need to provide
__tls_get_addr () in static libc in order to avoid undefined references to this
symbol when that flag is used.
* sysdeps/powerpc/libc-tls.c: New file. Provides __tls_get_addr () in
static libc.
The skip_lock_out_of_tbegin_retries adaptive parameter was
not being used correctly, nor as described. This prevents
a fallback for all users of the lock if a transient abort
occurs within the accepted number of retries.
[BZ #19174]
* sysdeps/powerpc/nptl/elide.h (__elide_lock): Fix usage of
.skip_lock_out_of_tbegin_retries.
* sysdeps/unix/sysv/linux/powerpc/elision-lock.c
(__lll_lock_elision): Likewise, and respect a value of
try_tbegin <= 0.
The previous code used to evaluate the preprocessor token is_lock_free to
a variable before starting a transaction. This behavior can cause an
error if another thread got the lock (without using a transaction)
between the evaluation of the token and the beginning of the transaction.
This bug can be triggered with the following order of events:
1. The lock accessed by is_lock_free is free.
2. Thread T1 evaluates is_lock_free and stores into register R1 that the
lock is free.
3. Thread T2 acquires the same lock used in is_lock_free.
4. T1 begins the transaction, creating a memory barrier where is_lock_free
is false, but R1 is true.
5. T1 reads R1 and doesn't abort the transaction.
6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides
to unlock a lock acquired by T2, leading to undefined behavior.
This patch delays the evaluation of is_lock_free to inside a transaction
by moving this part of the code to the macro ELIDE_LOCK.
[BZ #18743]
* sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this
code to...
(ELIDE_LOCK): ...here.
(__get_new_count): New function with part of the code from
__elide_lock that updates the value of adapt_count after a
transaction abort.
(__elided_trylock): Moved this code to...
(ELIDE_TRYLOCK): ...here.
The powerpc32 implementation of lround and lroundf can produce
spurious exceptions from adding 0.5 then converting to integer. This
includes "inexact" from the conversion to integer (not allowed for
integer arguments to these functions), and, for larger integer
arguments, "inexact", and "overflow" when rounding upward, from the
addition. In addition, "inexact" is not allowed together with
"invalid" and so inexact addition must be avoided when the integer
will be out of range of 32-bit long, whether or not the argument is an
integer.
This patch fixes these problems. As in the powerpc64 llround
implementation, a check is added for too-large arguments; in the
powerpc64 case that means arguments at least 2^52 in magnitude (so
that 0.5 cannot be added exactly), while in this case it means
arguments for which the result would overflow "long". In those cases
a suitable overflowing value is used for the integer conversion
without adding 0.5, while for smaller arguments it's tested whether
the argument is an integer (by adding and subtracting 2^52 to the
absolute value and comparing with the original absolute value) to
avoid adding 0.5 to integers and generating spurious "inexact".
This code is not used when the power5+ sysdeps directories are used,
as there's a separate power5+ version of these functions..
Tested for powerpc. This gets test-float (for a default powerpc32
hard-float build without any --with-cpu) back to the point where it
should pass once powerpc ulps are regenerated; test-double still needs
another problem with exceptions fixed to get back to that point (and I
haven't looked lately at what default powerpc64 results are like).
[BZ #19134]
* sysdeps/powerpc/powerpc32/fpu/s_lround.S (.LC1): New object.
(.LC2): Likewise.
(.LC3): Likewise.
(__lround): Do not add 0.5 to integer or out-of-range arguments.
The powerpc32 implementations of llroundf and llround produce spurious
and missing exceptions (some arising from such exceptions from
conversions to long long, some present even when fctidz is used).
This patch fixes those problems in a similar way to the llrint /
llrintf fixes. The spurious exceptions in the fctidz case for large
arguments arise from a converted value that saturated as LLONG_MAX
being converted back to float or double (the conversion back being
inexact, but "inexact" must not be raised together with "invalid"),
and from the subtraction x - xrf also being inexact for sufficiently
large arguments (whether the saturation was to LLONG_MAX or
LLONG_MIN); those are fixed by returning early if the argument is
large enough that no rounding is needed.
This code is not used for --with-cpu=power4 builds (I suspect the code
used in that case may also produce spurious "inexact" exceptions, but
that's something to investigate later).
Tested for powerpc.
[BZ #19125]
* sysdeps/powerpc/powerpc32/fpu/s_llround.c: Include <limits.h>,
<math_private.h> and <stdint.h>.
(__llround): Avoid conversions to and from long long int, and
subtractions, where those might raise spurious exceptions.
* sysdeps/powerpc/powerpc32/fpu/s_llroundf.c: Include
<math_private.h> and <stdint.h>.
(__llroundf): Avoid conversions to and from long long int, and
subtractions, where those might raise spurious exceptions.
sysdeps/powerpc/fpu/ has versions of llround and llroundf that are
actually used only for powerpc32 because
sysdeps/powerpc/powerpc64/fpu/ has its own versions of those
functions. This patch moves them into sysdeps/powerpc/powerpc32/fpu
to reflect where they are actually used (in preparation for fixing
other problems with those functions).
Tested for powerpc that installed stripped shared libraries are
unchanged by this patch.
* sysdeps/powerpc/fpu/s_llround.c: Move to ....
* sysdeps/powerpc/powerpc32/fpu/s_llround.c: ...here.
* sysdeps/powerpc/fpu/s_llroundf.c: Move to ....
* sysdeps/powerpc/powerpc32/fpu/s_llroundf.c: ...here.
The versions of llrint and llrintf for older powerpc32 processors
convert the results of __rint / __rintf to long long int, resulting in
spurious exceptions from such casts in certain cases. This patch
makes glibc work around the problems with the libgcc conversions when
the compiler used to build glibc doesn't use the fctidz instruction
for them.
Tested for powerpc.
[BZ #16422]
* sysdeps/powerpc/powerpc32/fpu/configure.ac (libc_cv_ppc_fctidz):
New configure test.
* sysdeps/powerpc/powerpc32/fpu/configure: Regenerated.
* config.h.in [_LIBC] (HAVE_PPC_FCTIDZ): New macro.
* sysdeps/powerpc/powerpc32/fpu/s_llrint.c: Include <limits.h>,
<math_private.h> and <stdint.h>.
(__llrint): Avoid conversions to long long int where those might
raise spurious exceptions.
* sysdeps/powerpc/powerpc32/fpu/s_llrintf.c: Include
<math_private.h> and <stdint.h>.
(__llrintf): Avoid conversions to long long int where those might
raise spurious exceptions.
On powerpc32 hard-float, older processors (ones where fcfid is not
available for 32-bit code), GCC generates conversions from integers to
floating point that wrongly convert integer 0 to -0 instead of +0 in
FE_DOWNWARD mode. This in turn results in logb and a few other
functions wrongly returning -0 when they should return +0.
This patch works around this issue in glibc as I proposed in
<https://sourceware.org/ml/libc-alpha/2015-09/msg00728.html>, so that
the affected functions can be correct and the affected tests pass in
the absence of a GCC fix for this longstanding issue (GCC bug 67771 -
if fixed, of course we can put in GCC version conditionals, and
eventually phase out the workarounds). A new macro
FIX_INT_FP_CONVERT_ZERO is added in a new sysdeps header
fix-int-fp-convert-zero.h, and the powerpc32/fpu version of that
header defines the macro based on the results of a configure test for
whether such conversions use the fcfid instruction.
Tested for x86_64 (that installed stripped shared libraries are
unchanged by the patch) and powerpc (that HAVE_PPC_FCFID comes out to
0 as expected and that the relevant tests are fixed). Also tested a
build with GCC configured for -mcpu=power4 and verified that
HAVE_PPC_FCFID comes out to 1 in that case.
There are still some other issues to fix to get test-float and
test-double passing cleanly for older powerpc32 processors (apart from
the need for an ulps regeneration for powerpc). (test-ldouble will be
harder to get passing cleanly, but with a combination of selected
fixes to ldbl-128ibm code that don't involve significant performance
issues, allowing spurious underflow and inexact exceptions for that
format, and lots of XFAILing for the default case of unpatched libgcc,
it should be doable.)
[BZ #887]
[BZ #19049]
[BZ #19050]
* sysdeps/generic/fix-int-fp-convert-zero.h: New file.
* sysdeps/ieee754/dbl-64/e_log10.c: Include
<fix-int-fp-convert-zero.h>.
(__ieee754_log10): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/dbl-64/e_log2.c: Include
<fix-int-fp-convert-zero.h>.
(__ieee754_log2): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/dbl-64/s_erf.c: Include
<fix-int-fp-convert-zero.h>.
(__erfc): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/dbl-64/s_logb.c: Include
<fix-int-fp-convert-zero.h>.
(__logb): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/flt-32/e_log10f.c: Include
<fix-int-fp-convert-zero.h>.
(__ieee754_log10f): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/flt-32/e_log2f.c: Include
<fix-int-fp-convert-zero.h>.
(__ieee754_log2f): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/flt-32/s_erff.c: Include
<fix-int-fp-convert-zero.h>.
(__erfcf): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/flt-32/s_logbf.c: Include
<fix-int-fp-convert-zero.h>.
(__logbf): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/ldbl-128ibm/s_erfl.c: Include
<fix-int-fp-convert-zero.h>.
(__erfcl): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/ieee754/ldbl-128ibm/s_logbl.c: Include
<fix-int-fp-convert-zero.h>.
(__logbl): Adjust signs as needed if FIX_INT_FP_CONVERT_ZERO.
* sysdeps/powerpc/powerpc32/fpu/configure.ac: New file.
* sysdeps/powerpc/powerpc32/fpu/configure: New generated file.
* sysdeps/powerpc/powerpc32/fpu/fix-int-fp-convert-zero.h: New
file.
* config.h.in [_LIBC] (HAVE_PPC_FCFID): New macro.
The file sysdeps/powerpc/sysdeps.h defines aliases for register operands,
which add the letter 'r' as a prefix to a register name. E.g.: register 20
can be written as 'r20', instead of '20'. On the one hand, this increases
readability, as it makes it easier for readers to know whether the operand is a
register or an immediate. On the other hand, this permits that immediate
operands be written as if they were registers, and vice-versa, thus reducing
the readability of the code.
This commit removes some of these unintentional misuses.
This commit also increases readability of the code by adding the prefix 'cr' to
some uses of the control register.
Both changes have no effect on the final code. Checked with objdump.
* sysdeps/powerpc/powerpc64/power8/strncpy.S: Remove or add register
prefix from operands.
Similar to various other bugs in this area, hypot functions can fail
to raise the underflow exception when the result is tiny and inexact
but one or more low bits of the intermediate result that is scaled
down (or, in the i386 case, converted from a wider evaluation format)
are zero. This patch forces the exception in a similar way to
previous fixes.
Note that this issue cannot arise for implementations of hypotf using
double (or wider) for intermediate evaluation (if hypotf should
underflow, that means the double square root is being computed of some
number of the form N*2^-298, for 0 < N < 2^46, which is exactly
represented as a double, and whatever the rounding mode such a square
root cannot have a mantissa with all zeroes after the initial 23
bits). Thus no changes are made to hypotf implementations in this
patch, only to hypot and hypotl.
Tested for x86_64, x86, mips64 and powerpc.
[BZ #18803]
* sysdeps/i386/fpu/e_hypot.S: Use DEFINE_DBL_MIN.
(MO): New macro.
(__ieee754_hypot) [PIC]: Load PIC register.
(__ieee754_hypot): Use DBL_NARROW_EVAL_UFLOW_NONNEG instead of
DBL_NARROW_EVAL.
* sysdeps/ieee754/dbl-64/e_hypot.c (__ieee754_hypot): Use
math_check_force_underflow_nonneg in case where result might be
tiny.
* sysdeps/ieee754/ldbl-128/e_hypotl.c (__ieee754_hypotl):
Likewise.
* sysdeps/ieee754/ldbl-128ibm/e_hypotl.c (__ieee754_hypotl):
Likewise.
* sysdeps/ieee754/ldbl-96/e_hypotl.c (__ieee754_hypotl): Likewise.
* sysdeps/powerpc/fpu/e_hypot.c (__ieee754_hypot): Likewise.
* math/auto-libm-test-in: Add more tests of hypot.
* math/auto-libm-test-out: Regenerated.
It was noted in
<https://sourceware.org/ml/libc-alpha/2012-09/msg00305.html> that the
bits/*.h naming scheme should only be used for installed headers.
This patch renames bits/atomic.h to atomic-machine.h to follow that
convention.
This is the only change in this series that needs to change the
filename rather than simply removing a directory level (because both
atomic.h and bits/atomic.h exist at present).
Tested for x86_64 (testsuite, and that installed stripped shared
libraries are unchanged by the patch).
[BZ #14912]
* sysdeps/aarch64/bits/atomic.h: Move to ...
* sysdeps/aarch64/atomic-machine.h: ...here.
(_AARCH64_BITS_ATOMIC_H): Rename macro to
_AARCH64_ATOMIC_MACHINE_H.
* sysdeps/alpha/bits/atomic.h: Move to ...
* sysdeps/alpha/atomic-machine.h: ...here.
* sysdeps/arm/bits/atomic.h: Move to ...
* sysdeps/arm/atomic-machine.h: ...here. Update comments.
* bits/atomic.h: Move to ...
* sysdeps/generic/atomic-machine.h: ...here.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/i386/bits/atomic.h: Move to ...
* sysdeps/i386/atomic-machine.h: ...here.
* sysdeps/ia64/bits/atomic.h: Move to ...
* sysdeps/ia64/atomic-machine.h: ...here.
* sysdeps/m68k/coldfire/bits/atomic.h: Move to ...
* sysdeps/m68k/coldfire/atomic-machine.h: ...here.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/m68k/m680x0/m68020/bits/atomic.h: Move to ...
* sysdeps/m68k/m680x0/m68020/atomic-machine.h: ...here.
* sysdeps/microblaze/bits/atomic.h: Move to ...
* sysdeps/microblaze/atomic-machine.h: ...here.
* sysdeps/mips/bits/atomic.h: Move to ...
* sysdeps/mips/atomic-machine.h: ...here.
(_MIPS_BITS_ATOMIC_H): Rename macro to _MIPS_ATOMIC_MACHINE_H.
* sysdeps/powerpc/bits/atomic.h: Move to ...
* sysdeps/powerpc/atomic-machine.h: ...here. Update comments.
* sysdeps/powerpc/powerpc32/bits/atomic.h: Move to ...
* sysdeps/powerpc/powerpc32/atomic-machine.h: ...here. Update
comments. Include <atomic-machine.h> instead of <bits/atomic.h>.
* sysdeps/powerpc/powerpc64/bits/atomic.h: Move to ...
* sysdeps/powerpc/powerpc64/atomic-machine.h: ...here. Include
<atomic-machine.h> instead of <bits/atomic.h>.
* sysdeps/s390/bits/atomic.h: Move to ...
* sysdeps/s390/atomic-machine.h: ...here.
* sysdeps/sparc/sparc32/bits/atomic.h: Move to ...
* sysdeps/sparc/sparc32/atomic-machine.h: ...here.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/sparc/sparc32/sparcv9/bits/atomic.h: Move to ...
* sysdeps/sparc/sparc32/sparcv9/atomic-machine.h: ...here.
* sysdeps/sparc/sparc64/bits/atomic.h: Move to ...
* sysdeps/sparc/sparc64/atomic-machine.h: ...here.
* sysdeps/tile/bits/atomic.h: Move to ...
* sysdeps/tile/atomic-machine.h: ...here.
* sysdeps/tile/tilegx/bits/atomic.h: Move to ...
* sysdeps/tile/tilegx/atomic-machine.h: ...here. Include
<sysdeps/tile/atomic-machine.h> instead of
<sysdeps/tile/bits/atomic.h>.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/tile/tilepro/bits/atomic.h: Move to ...
* sysdeps/tile/tilepro/atomic-machine.h: ...here. Include
<sysdeps/tile/atomic-machine.h> instead of
<sysdeps/tile/bits/atomic.h>.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/unix/sysv/linux/arm/bits/atomic.h: Move to ...
* sysdeps/unix/sysv/linux/arm/atomic-machine.h: ...here. Include
<sysdeps/arm/atomic-machine.h> instead of
<sysdeps/arm/bits/atomic.h>.
* sysdeps/unix/sysv/linux/hppa/bits/atomic.h: Move to ...
* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: ...here.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/unix/sysv/linux/m68k/coldfire/bits/atomic.h: Move to ...
* sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h: ...here.
(_BITS_ATOMIC_H): Rename macro to _ATOMIC_MACHINE_H.
* sysdeps/unix/sysv/linux/nios2/bits/atomic.h: Move to ...
* sysdeps/unix/sysv/linux/nios2/atomic-machine.h: ...here.
(_NIOS2_BITS_ATOMIC_H): Rename macro to _NIOS2_ATOMIC_MACHINE_H.
* sysdeps/unix/sysv/linux/sh/bits/atomic.h: Move to ...
* sysdeps/unix/sysv/linux/sh/atomic-machine.h: ...here.
* sysdeps/x86_64/bits/atomic.h: Move to ...
* sysdeps/x86_64/atomic-machine.h: ...here.
* include/atomic.h: Include <atomic-machine.h> instead of
<bits/atomic.h>.
Linux commit b4b56f9ecab40f3b4ef53e130c9f6663be491894 introduced
a new HWCAP2 bit to indicate that the kernel now aborts a memory
transaction when a syscall is made. This patch adds that bit to
sysdeps/powerpc/bits/hwcap.h.
2015-08-26 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
* sysdeps/powerpc/bits/hwcap.h: Add PPC_FEATURE2_HTM_NOSC.
* sysdeps/powerpc/dl-procinfo.c:
(_dl_powerpc_cap_flags): Added descriptor for this hwcap
feature so it shows when LD_SHOW_AUXV=1.
Fix usage of tabort in generated syscalls. r0 has special meaning
when used with this instruction, thus it will not generate
persistent errors, nor return an error code. This mitigates poor
CPU usage when performing elided critical sections.
Additionally, transactions should be aborted when entering a user
invoked syscall. Otherwise the results of the transaction may be
undefined.
2015-08-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc32/sysdep.h (ABORT_TRANSACTION): Use
register other than r0 for tabort, it has special meaning.
* sysdeps/powerpc/powerpc64/sysdep.h (ABORT_TRANSACTION): Likewise
* sysdeps/unix.sysv/linux/powerpc/syscall.S (syscall): Abort
transaction before starting syscall.
Instead of checking needle length, constant 'n' number of comparisons
is checked to fall back to default implementation. This patch is tested
on powerpc64 and powerpc64le.
2015-08-25 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc64/power7/strstr.S: Handle worst case.
In powerpc64, memchr was always pointing to the internal __GI_memchr
implementation. This patch fixes that and makes it use the
optimized POWER7 version when adequate.
* sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c: Make
memchr not point to the internal __GI_memchr implementation.
This patch adds extra inline functions to change the Program Priority
Register from ISA 2.07.
2015-08-19 Gabriel F. T. Gomes <gftg@linux.vnet.ibm.com>
* sysdeps/powerpc/sys/platform/ppc.h (__ppc_set_ppr_med_high,
__ppc_set_ppr_very_low): New functions.
* manual/platform.texi: Add documentation about
__ppc_set_ppr_med_high and __ppc_set_ppr_very_low.
Some features in hwcap.h do not have matching string descriptors
to be displayed when LD_SHOW_AUXV=1. This patch fixes the problem.
2015-08-13 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
* sysdeps/powerpc/dl-procinfo.c:
(_dl_powerpc_cap_flags): Added missing strings for some
hwcap features.
* sysdeps/powerpc/dl-procinfo.h: Updated hwcap bit count.
When building with --disable-multi-arch the memmove and strstr POWER7
optimization create and uses symbols that conflict with expect conform
tests.
* sysdeps/powerpc/powerpc64/power7/memmove.S (bcopy): Changing to
__bcopy and add a weak_alias to bcopy.
* sysdeps/powerpc/powerpc64/power7/strstr.S (strstr): Use __strnlen
for static build.
This patches uses the default strcpy/stpcpy implementation for
POWER7/PPC64. This is faster in mostly inputs for benchtests
and for multiarch the implementation uses the POWER7 strlen and
memcpy.
* string/stpcpy.c (__stpcpy): Use STPCPY to redefine symbol name and
cleanup macro usage.
* string/strcpy.c (strcpt): Use STRCPY to redefine symbol name.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-power7.S: Remove file.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-power7.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/stpcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.c: Likewise.
* sysdeps/powerpc/powerpc64/stpcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/strcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && IS_IN (libc)]: Include <string/strcpy.c>.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && IS_IN (libc)]: Include <string/stpcpy.c>.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-power7.c: New file.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.c: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-power7.c: Likewise.
* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.c: Likewise.
* sysdeps/powerpc/powerpc64/power7/strcpy.c: Likewise.
This patch fixes the strstr build with --disable-multi-arch option.
The optimization calls the __strstr_ppc symbol, which always build
for multiarch config but not if it is disable. This patch fixes it
by adding the default C implementation object with the expected
symbol name.
* sysdeps/powerpc/powerpc64/power7/Makefile [$(subdir) = string]
(sysdep_routines): Add strstr-ppc64.
* sysdeps/powerpc/powerpc64/power7/strstr-ppc64.c: New file.
This patch optimizes strstr function for power >= 7 systems. Performance
gain is obtained using aligned memory access and usage of cmpb
instruction for quicker comparison. The average improvement of this
optimization is ~40%. Tested on ppc64 and ppc64le.
2015-07-16 Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc64/multiarch/Makefile: Add strstr().
* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
* sysdeps/powerpc/powerpc64/power7/strstr.S: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c: New File.
* sysdeps/powerpc/powerpc64/multiarch/strstr.c: New File.
Similar to various other bugs in this area, some sin and sincos
implementations do not raise the underflow exception for subnormal
arguments, when the result is tiny and inexact. This patch forces the
exception in a similar way to previous fixes.
Tested for x86_64, x86, mips64 and powerpc.
[BZ #16526]
[BZ #16538]
* sysdeps/ieee754/dbl-64/s_sin.c: Include <float.h>.
(__sin): Force underflow exception for arguments with small
absolute value.
* sysdeps/ieee754/flt-32/k_sinf.c: Include <float.h>.
(__kernel_sinf): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128/k_sincosl.c: Include <float.h>.
(__kernel_sincosl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128/k_sinl.c: Include <float.h>.
(__kernel_sinl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128ibm/k_sincosl.c: Include <float.h>.
(__kernel_sincosl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-128ibm/k_sinl.c: Include <float.h>.
(__kernel_sinl): Force underflow exception for arguments with
small absolute value.
* sysdeps/ieee754/ldbl-96/k_sinl.c: Include <float.h>.
(__kernel_sinl): Force underflow exception for arguments with
small absolute value.
* sysdeps/powerpc/fpu/k_sinf.c: Include <float.h>.
(__kernel_sinf): Force underflow exception for arguments with
small absolute value.
* math/auto-libm-test-in: Add more tests of sin and sincos.
* math/auto-libm-test-out: Regenerated.
Various code in glibc uses __strnlen instead of strnlen for namespace
reasons. However, __strnlen does not use libc_hidden_proto /
libc_hidden_def (as is normally done for any function defined and
called within the same library, whether or not exported from the
library and whatever namespace it is in), so the compiler does not
know that those calls are to a function within libc.
This patch uses libc_hidden_proto / libc_hidden_def with __strnlen.
On x86_64, it makes no difference to the installed stripped shared
libraries. On 32-bit x86, it causes __strnlen calls to go to the same
place as strnlen calls (the fallback strnlen implementation), rather
than through a PLT entry for the strnlen IFUNC; I'm not sure of the
logic behind when calls from within libc should use IFUNCs versus when
they should go direct to a particular function implementation, but
clearly it doesn't make sense for strnlen and __strnlen to be handled
differently in this regard.
Tested for x86_64 and x86 (testsuite, and comparison of installed
shared libraries as described above).
* string/strnlen.c [!STRNLEN] (__strnlen): Use libc_hidden_def.
* include/string.h (__strnlen): Use libc_hidden_proto.
* sysdeps/aarch64/strnlen.S (__strnlen): Use libc_hidden_def.
* sysdeps/i386/i686/multiarch/strnlen-c.c [SHARED]
(libc_hidden_def): Define __GI___strnlen as well as __GI_strnlen.
* sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-power7.S
(libc_hidden_def): Undefine and redefine.
* sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
[SHARED] (libc_hidden_def): Define __GI___strnlen as well as
__GI_strnlen.
* sysdeps/powerpc/powerpc32/power7/strnlen.S (__strnlen): Use
libc_hidden_def.
* sysdeps/tile/tilegx/strnlen.c (__strnlen): Likewise.
This patch fix the static build for strftime, which uses __wcschr.
Current powerpc32 implementation defines the __wcschr be an alias to
__wcschr_ppc32 and current implementation misses the correct alias for
static build.
It also changes the default wcschr.c logic so a IFUNC implementation
should just define WCSCHR and undefine the required alias/internal
definitions.
IFUNC is difficult to correctly implement on any target needing a GOT
to support position independent code, due to the dependency on order
of dynamic relocations. ld.so should be changed to apply IFUNC
relocations last, globally, because without that it is actually
impossible to write an IFUNC resolver in C that works in all
situations. Case in point, vfork in libpthread.so is an IFUNC with
the resolver returning &__libc_vfork. (system and fork are similar.)
If another shared library, libA say, uses vfork then it is quite
possible that libpthread.so hasn't been dynamically relocated before
the unfortunate libA is dynamically relocated. In that case the GOT
entry for &__libc_vfork is still zero, so the IFUNC resolver returns
NULL. LD_BIND_NOW=1 results in libA PLT dynamic relocations being
applied using this NULL value and ld.so segfaults.
This patch hardens ld.so to not segfault on a NULL from an IFUNC
resolver. It also fixes a problem with undefined weak. If you leave
the plt entry as-is for undefined weak then if the entry is ever
called it will loop in ld.so rather than segfaulting.
* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_fixup_plt):
Don't segfault if ifunc resolver returns a NULL. Do set plt to
zero for undefined weak.
(elf_machine_plt_conflict): Similarly.
This patch is glibc support for a PowerPC TLS optimization, inspired
by Alexandre Oliva's TLS optimization for other processors,
http://www.lsd.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt
In essence, this optimization uses a zero module id in the tls_index
GOT entry to indicate that a TLS variable is allocated space in the
static TLS area. A special plt call linker stub for __tls_get_addr
checks for such a tls_index and if found, returns the offset
immediately. The linker communicates the fact that the special
__tls_get_addr stub is used by setting a bit in the dynamic tag
DT_PPC64_OPT/DT_PPC_OPT. glibc communicates to the linker that this
optimization is available by the presence of __tls_get_addr_opt.
tst-tlsmod2.so is built with -Wl,--no-tls-get-addr-optimize for
tst-tls-dlinfo, which otherwise would fail since it tests that no
static tls is allocated. The ld option --no-tls-get-addr-optimize has
been available since binutils-2.20 so doesn't need a configure test.
* NEWS: Advertise TLS optimization.
* elf/elf.h (R_PPC_TLSGD, R_PPC_TLSLD, DT_PPC_OPT, PPC_OPT_TLS): Define.
(DT_PPC_NUM): Increment.
* elf/dynamic-link.h (HAVE_STATIC_TLS): Define.
(CHECK_STATIC_TLS): Use here.
* sysdeps/powerpc/powerpc32/dl-machine.h (elf_machine_rela): Optimize
TLS descriptors.
* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_rela): Likewise.
* sysdeps/powerpc/dl-tls.c: New file.
* sysdeps/powerpc/Versions: Add __tls_get_addr_opt.
* sysdeps/powerpc/tst-tlsopt-powerpc.c: New tls test.
* sysdeps/unix/sysv/linux/powerpc/Makefile: Add new test.
Build tst-tlsmod2.so with --no-tls-get-addr-optimize.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/ld.abilist: Update.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld.abilist: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/ld-le.abilist: Likewise.
This feature doesn't depend on the linker, as can be seen from the
actual test. It's a compiler feature.
* sysdeps/powerpc/powerpc64/configure.ac: Correct "linker support
for overlapping .opd entries" to "support...".
* sysdeps/powerpc/powerpc64/configure: Regenerate
In bc0cdc498 the configure check for HAVE_ASM_PPC_REL16 was removed
on the grounds that the minimum binutils supports rel16 relocs. This
is true, but not all references to HAVE_ASM_PPC_REL16 in the sources
were removed.
* config.h.in: Remove HAVE_ASM_PPC_REL16.
* sysdeps/powerpc/powerpc32/tls-macros.h: Remove HAVE_ASM_PPC_REL16
and false branch of conditional.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/swapcontext-common.S:
Likewise.
This patch makes soft-fp use static assertions in place of conditional
calls to abort, in places where there are checks for conditions (on
the types for which a macro is used) that the code is not prepared to
handle. The fallback definition of _FP_STATIC_ASSERT (for kernel use
only, as only relevant to compilers not supported for building glibc)
is as in misc/sys/cdefs.h.
This means that soft-fp only ever calls abort for _FP_UNREACHABLE
calls in builds with GCC versions before 4.5. Thus, there is no need
for an abort declaration or <stdlib.h> include, since the kernel code
handles defining abort as a macro itself - and so this avoids any need
for an __KERNEL__ condition on the abort declaration to avoid it
breaking with the kernel's macro definition. That is, this patch is
intended to make glibc's soft-fp code suitable for kernel use with no
kernel-local changes to the soft-fp code needed at all.
Tested for powerpc-nofpu that installed stripped shared libraries are
unchanged by the patch. One explicit <stdlib.h> include had to be
added to a file that was relying on the include from soft-fp.h.
* soft-fp/soft-fp.h (_FP_STATIC_ASSERT): New macro.
[_LIBC]: Do not include <stdlib.h>.
[!_LIBC] (abort): Remove declaration.
* soft-fp/op-2.h (_FP_MUL_MEAT_2_120_240_double): Use
_FP_STATIC_ASSERT instead of conditionally calling abort.
* soft-fp/op-common.h (_FP_FROM_INT): Likewise.
(_FP_EXTEND_CNAN): Likewise.
(FP_TRUNC): Likewise.
(__FP_CLZ): Likewise.
* sysdeps/powerpc/nofpu/flt-rounds.c: Include <stdlib.h>.
With AIX port deprecated there is no need to check/define
HAVE_ASM_GLOBAL_DOT_NAME anymore since the current minimum binutils
supported (2.22) does not emit global symbol with dot.
This patch removes all the HAVE_ASM_GLOBAL_DOT_NAME definition and
checks for powerpc64 port.
This patch fixes the inline feraiseexcept and feclearexcept macros for
powerpc by casting the input argument to integer before operation on it.
It fixes BZ#17776.
This patch fixes the missing "__memcpy_ppc" symbol for memmove-ppc64
object in static builds. Since memcpy ifunc is not enabled in static
mode, the specialized symbols are not provided. The patch changed the
it to just "__memcpy" instead.
Similarly to sqrt in
<https://sourceware.org/ml/libc-alpha/2015-02/msg00353.html>, the
powerpc sqrtf implementation for when _ARCH_PPCSQ is not defined also
relies on a * b + c being contracted into a fused multiply-add.
Although this contraction is not explicitly disabled for e_sqrtf.c, it
still seems appropriate to make the file explicit about its
requirements by using __builtin_fmaf; this patch does so.
Furthermore, it turns out that doing so fixes the observed inaccuracy
and missing exceptions (that is, that without explicit __builtin_fmaf
usage, it was not being compiled as intended).
Tested for powerpc32 (hard float).
[BZ #17967]
* sysdeps/powerpc/fpu/e_sqrtf.c (__slow_ieee754_sqrtf): Use
__builtin_fmaf instead of relying on contraction of a * b + c.
As Adhemerval noted in
<https://sourceware.org/ml/libc-alpha/2015-01/msg00451.html>, the
powerpc sqrt implementation for when _ARCH_PPCSQ is not defined is
inaccurate in some cases.
The problem is that this code relies on fused multiply-add, and relies
on the compiler contracting a * b + c to get a fused operation. But
sysdeps/ieee754/dbl-64/Makefile disables contraction for e_sqrt.c,
because the implementation in that directory relies on *not* having
contracted operations.
While it would be possible to arrange makefiles so that an earlier
sysdeps directory can disable the setting in
sysdeps/ieee754/dbl-64/Makefile, it seems a lot cleaner to make the
dependence on fused operations explicit in the .c file. GCC 4.6
introduced support for __builtin_fma on powerpc and other
architectures with such instructions, so we can rely on that; this
patch duly makes the code use __builtin_fma for all such fused
operations.
Tested for powerpc32 (hard float).
2015-02-12 Joseph Myers <joseph@codesourcery.com>
[BZ #17964]
* sysdeps/powerpc/fpu/e_sqrt.c (__slow_ieee754_sqrt): Use
__builtin_fma instead of relying on contraction of a * b + c.
This patch optimizes strncpy for power7 for unaligned source or
destination address. The source or destination address is aligned
to doubleword and data is shifted based on the alignment and
added with the previous loaded data to be written as a doubleword.
For each load, cmpb instruction is used for faster null check.
The new optimization shows 10 to 70% of performance improvement
for longer string though it does not show big difference on string
size less than 16 due to additional checks.Hence this new algorithm
is restricted to string greater than 16.
Current minimum binutils supported (2.22) has ".machine altivec" support
as default, so there is no need to add a configure check for such
functionality. This patches removes the configure checks for it.
This patch cleanup some multiarch code related to memmmove
optimization. Initial IFUNC support added specialized wordcopy
symbols which turned in local IFUNC calls used by memmove default
implementation. The patch removes the internal IFUNC for wordcopy
symbols and uses local branches in the memmmove optimization instead.
This patch cleanup some multiarch code related to memmmove
optimization. Initial IFUNC support added specialized wordcopy
symbols which turned in local IFUNC calls used by memmove default
implementation.
This change by removing then and used the optimized memmove instead
for supported chips.
This patch simplify the default bcopy symbol for powerpc64 by just using
memmove instead of implementing using the default bcopy. Since the
symbol is deprecated, it trades speed by code size.
This patch fixes a bug introduced by 18f2945ae9, where it optimizes
the FPSCR set by just issuing a mtfs instruction if new flag is different
from older one. The issue is a typo, where the new flag should the the
new value, instead of the old one.
It fixes BZ#17885.
Some powerpc64 processors (e5500 core for instance) does not provide the
fsqrt instruction, however current check to use in math_private.h is
__WORDSIZE and _ARCH_PWR4 (ISA 2.02). This is patch change it to use
the compiler flag _ARCH_PPCSQ (which is the same condition GCC uses to
decide whether to generate fsqrt instruction).
It fixes BZ#16576.
GLIBC memset optimization for POWER8 uses the '.machine power8'
directive, which is only supported officially on binutils 2.24+. This
causes a build failure on older binutils.
Since the requirement of .machine power8 is to correctly assembly the
'mtvsrd' instruction and it is already handled by the MTVSRD_V1_R4
macro, there is no really needed of using it.
The patch replaces the power8 with power7 for .machine directive.
It fixes BZ#17869.
This patch fix the elf/ifuncmain6pie failure when building with GCC
4.9+. For some reason, the compiler removes the branch taken code at
resolve_ifunc (sysdeps/powerpc/powerpc64/dl-machine.h) as dead-code
and thus the testcase fails because the ifunc resolves branches to an
invalid memory location. It fixes by explicit adding a dependency of
value based on odp variable to avoid compiler optimization.
It fixes BZ#17868.
When fixing namespace issues for <fenv.h> functions I missed one call
to fesetenv for powerpc-nofpu. This patch changes this to a call to
__fesetenv.
Tested for powerpc-nofpu; it fixes the previously observed math.h
linknamespace test failures.
[BZ #17748]
* sysdeps/powerpc/nofpu/feholdexcpt.c (__feholdexcept): Call
__fesetenv instead of fesetenv.
This patch fixes a performance regression on the POWER7/PPC64 memcmp
porting for Little Endian. The LE code uses 'ldbrx' instruction to read
the memory on byte reversed form, however ISA 2.06 just provide the indexed
form which uses a register value as additional index, instead of a fixed value
enconded in the instruction.
And the port strategy for LE uses r0 index value and update the address
value on each compare loop interation. For large compare size values,
it adds 8 more instructions plus some more depending of trailing
size. This patch fixes it by adding pre-calculate indexes to remove the
address update on loops and tailing sizes.
For large sizes it shows a considerable gain, with double performance
pairing with BE.
This patch adds an optimized POWER8 strncmp. The implementation focus
on speeding up unaligned cases follwing the ideas of power8 strcmp.
The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
(where sources alignment are different).
This patch optimized the POWER7 trailing check by avoiding using byte
read operations and instead use the doubleword already readed with
bitwise operations.
This patch adds an optimized POWER8 strcmp using unaligned accesses.
The algorithm first check the initial 16 bytes, then align the first
function source and uses unaligned loads on second argument only.
Aditional checks for page boundaries are done for unaligned cases
This patch adds an optimized POWER8 st{r,p}ncpy using unaligned accesses.
It shows 10%-80% improvement over the optimized POWER7 one that uses
only aligned accesses, specially on unaligned inputs.
The algorithm first read and check 16 bytes (if inputs do not cross a 4K
page size). The it realign source to 16-bytes and issue a 16 bytes read
and compare loop to speedup null byte checks for large strings. Also,
different from POWER7 optimization, the null pad is done inline in the
implementation using possible unaligned accesses, instead of realying on
a memset call. Special case is added for page cross reads.
With 3eb38795db (Simplify strncat) the generic algorithms uses
strlen, strnlen, and memcpy. This is faster than POWER7 current
implementation, especially for unaligned strings (where POWER7 code
uses byte-byte operations).
This patch removes the assembly implementation and uses a multiarch
specialization based on default algorithm calling optimized POWER7
symbols.
This patch adds an optimized POWER8 strcpy using unaligned accesses.
For strings up to 16 bytes the implementation first calculate the
string size, like strlen, and issues a memcpy. For larger strings,
source is first aligned to 16 bytes and then tested over a loop that
reads 16 bytes am combine the cmpb results for speedup. Special case is
added for page cross reads.
It shows 30%-60% improvement over the optimized POWER7 one that uses
only aligned accesses.
Linux kernel powerpc documentation states issuing a syscall inside a
transaction is not recommended and may lead to undefined behavior. It
also states syscalls does not abort transactoin neither they run in
transactional state.
To avoid side-effects being visible outside transactions, GLIBC with
lock elision enabled will issue a transaction abort instruction just
before all syscalls if hardware supports hardware transactions.
This patch adds support for lock elision using ISA 2.07 hardware
transactional memory for rwlocks. The logic is similar to the
one presented in pthread_mutex lock elision.
Concluding the fixes for C90 libm functions calling C99 fe* functions,
this patch fixes the case of feupdateenv by making it a weak alias for
__feupdateenv and making the affected code call __feupdateenv.
Tested for x86_64 (testsuite, and that installed stripped shared
libraries are unchanged by the patch). Also tested for ARM
(soft-float) that the math.h linknamespace tests now pass.
[BZ #17748]
* include/fenv.h (__feupdateenv): Use libm_hidden_proto.
* math/feupdateenv.c (__feupdateenv): Use libm_hidden_def.
* sysdeps/aarch64/fpu/feupdateenv.c (feupdateenv): Rename to
__feupdateenv and define as weak alias of __feupdateenv. Use
libm_hidden_weak.
* sysdeps/alpha/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/arm/feupdateenv.c (feupdateenv): Rename to __feupdateenv
and define as weak alias of __feupdateenv. Use libm_hidden_weak.
* sysdeps/hppa/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/i386/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/ia64/fpu/feupdateenv.c (feupdateenv): Rename to
__feupdateenv and define as weak alias of __feupdateenv. Use
libm_hidden_weak.
* sysdeps/m68k/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/mips/fpu/feupdateenv.c (feupdateenv): Rename to
__feupdateenv and define as weak alias of __feupdateenv. Use
libm_hidden_weak.
* sysdeps/powerpc/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/powerpc/nofpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/feupdateenv.c
(__feupdateenv): Likewise.
* sysdeps/s390/fpu/feupdateenv.c (feupdateenv): Rename to
__feupdateenv and define as weak alias of __feupdateenv. Use
libm_hidden_weak.
* sysdeps/sh/sh4/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/sparc/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/tile/math_private.h (__feupdateenv): New inline
function.
* sysdeps/x86_64/fpu/feupdateenv.c (__feupdateenv): Use
libm_hidden_def.
* sysdeps/generic/math_private.h (default_libc_feupdateenv): Call
__feupdateenv instead of feupdateenv.
(default_libc_feupdateenv_test): Likewise.
(libc_feresetround_ctx): Likewise.
Continuing the fixes for C90 libm functions calling C99 fe* functions,
this patch fixes the case of fesetround by making it a weak alias of
__fesetround and making the affected code call __fesetround. An
existing __fesetround function in fenv_libc.h for powerpc is renamed
to __fesetround_inline.
Tested for x86_64 (testsuite, and that disassembly of installed shared
libraries is unchanged by the patch). Also tested for ARM
(soft-float) that fesetround failures disappear from the linknamespace
test results (feupdateenv remains to be addressed to complete fixing
bug 17748).
[BZ #17748]
* include/fenv.h (__fesetround): Declare. Use libm_hidden_proto.
* math/fesetround.c (fesetround): Rename to __fesetround and
define as weak alias of __fesetround. Use libm_hidden_weak.
* sysdeps/aarch64/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/alpha/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/arm/fesetround.c (fesetround): Likewise.
* sysdeps/hppa/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/i386/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/ia64/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/m68k/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/mips/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/powerpc/fpu/fenv_libc.h (__fesetround): Rename to
__fesetround_inline.
* sysdeps/powerpc/fpu/fenv_private.h (libc_fesetround_ppc): Call
__fesetround_inline instead of __fesetround.
* sysdeps/powerpc/fpu/fesetround.c (fesetround): Rename to
__fesetround and define as weak alias of __fesetround. Use
libm_hidden_weak. Call __fesetround_inline instead of
__fesetround.
* sysdeps/powerpc/nofpu/fesetround.c (fesetround): Rename to
__fesetround and define as weak alias of __fesetround. Use
libm_hidden_weak.
* sysdeps/powerpc/powerpc32/e500/nofpu/fesetround.c (fesetround):
Likewise.
* sysdeps/s390/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/sh/sh4/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/sparc/fpu/fesetround.c (fesetround): Likewise.
* sysdeps/tile/math_private.h (__fesetround): New inline function.
* sysdeps/x86_64/fpu/fesetround.c (fesetround): Rename to
__fesetround and define as weak alias of __fesetround. Use
libm_hidden_weak.
* sysdeps/generic/math_private.h (default_libc_fesetround): Call
__fesetround instead of fesetround.
(default_libc_feholdexcept_setround): Likewise.
(libc_feholdsetround_ctx): Likewise.
(libc_feholdsetround_noex_ctx): Likewise.
Continuing the fixes for C90 libm functions calling C99 fe* functions,
this patch fixes the case of fesetenv by making it a weak alias of
__fesetenv and making the affected code (including various copies of
feupdateenv which also gets called from C90 functions) call
__fesetenv.
Tested for x86_64 (testsuite, and that disassembly of installed shared
libraries is unchanged by the patch). Also tested for ARM
(soft-float) that fesetenv failures disappear from the linknamespace
test results (fsetround and feupdateenv remain to be addressed to
complete fixing bug 17748).
[BZ #17748]
* include/fenv.h (__fesetenv): Use libm_hidden_proto.
* math/fesetenv.c (__fesetenv): Use libm_hidden_def.
* sysdeps/aarch64/fpu/fesetenv.c (fesetenv): Rename to __fesetenv
and define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/alpha/fpu/fesetenv.c (__fesetenv): Use libm_hidden_def.
* sysdeps/arm/fesetenv.c (fesetenv): Rename to __fesetenv and
define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/hppa/fpu/fesetenv.c (fesetenv): Likewise.
* sysdeps/i386/fpu/fesetenv.c (__fesetenv): Use libm_hidden_def.
* sysdeps/ia64/fpu/fesetenv.c (fesetenv): Rename to __fesetenv and
define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/m68k/fpu/fesetenv.c (__fesetenv): Use libm_hidden_def.
* sysdeps/mips/fpu/fesetenv.c (fesetenv): Rename to __fesetenv and
define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/powerpc/fpu/fesetenv.c (__fesetenv): Use
libm_hidden_def.
* sysdeps/powerpc/nofpu/fesetenv.c (__fesetenv): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/fesetenv.c (__fesetenv):
Likewise.
* sysdeps/s390/fpu/fesetenv.c (fesetenv): Rename to __fesetenv and
define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/sh/sh4/fpu/fesetenv.c (fesetenv): Likewise.
* sysdeps/sparc/fpu/fesetenv.c (__fesetenv): Use libm_hidden_def.
* sysdeps/tile/math_private.h (__fesetenv): New inline function.
* sysdeps/x86_64/fpu/fesetenv.c (fesetenv): Rename to __fesetenv
and define as weak alias of __fesetenv. Use libm_hidden_weak.
* sysdeps/generic/math_private.h (default_libc_fesetenv): Use
__fesetenv instead of fesetenv.
(libc_feresetround_noex_ctx): Likewise.
* sysdeps/alpha/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/hppa/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/i386/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/ia64/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/m68k/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/mips/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/powerpc/nofpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/feupdateenv.c
(__feupdateenv): Likewise.
* sysdeps/s390/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/sh/sh4/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/sparc/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/x86_64/fpu/feupdateenv.c (__feupdateenv): Likewise.
Continuing the fixes for C90 libm functions calling C99 fe* functions,
this patch fixes the case of feholdexcept by making it a weak alias of
__feholdexcept and making the affected code call __feholdexcept.
Tested for x86_64 (testsuite, and that disassembly of installed shared
libraries is unchanged by the patch). Also tested for ARM
(soft-float) that feholdexcept failures disappear from the
linknamespace test failures (fesetenv, fsetround and feupdateenv
remain to be addressed to complete fixing bug 17748).
[BZ #17748]
* include/fenv.h (__feholdexcept): Declare. Use
libm_hidden_proto.
* math/feholdexcpt.c (feholdexcept): Rename to __feholdexcept and
define as weak alias of __feholdexcept. Use libm_hidden_weak.
* sysdeps/aarch64/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/alpha/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/arm/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/hppa/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/i386/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/ia64/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/m68k/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/mips/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/powerpc/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/powerpc/nofpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/feholdexcpt.c
(feholdexcept): Likewise.
* sysdeps/s390/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/sh/sh4/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/sparc/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/x86_64/fpu/feholdexcpt.c (feholdexcept): Likewise.
* sysdeps/generic/math_private.h (default_libc_feholdexcept): Use
__feholdexcept instead of feholdexcept.
(default_libc_feholdexcept_setround): Likewise.
Continuing the fixes for C90 libm functions calling C99 fe* functions,
this patch fixes the case of fegetround by making it a weak alias of
__fegetround and making the affected code call __fegetround.
Tested for x86_64 (testsuite, and that disassembly of installed shared
libraries is unchanged by the patch). Also tested for ARM
(soft-float) that fegetround failures disappear from the linknamespace
test failures (feholdexcept, fesetenv, fesetround and feupdateenv
remain to be addressed before bug 17748 is fully fixed, although this
patch may suffice to fix the failures in some cases, when the libc_fe*
functions are implemented but there is no architecture-specific sqrt
implementation in use so there were failures from fegetround used by
sqrt but no other such failures).
[BZ #17748]
* include/fenv.h (__fegetround): Declare. Use libm_hidden_proto.
* math/fegetround.c (fegetround): Rename to __fegetround and
define as weak alias of __fegetround. Use libm_hidden_weak.
* sysdeps/aarch64/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/alpha/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/arm/fegetround.c (fegetround): Likewise.
* sysdeps/hppa/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/i386/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/ia64/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/m68k/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/mips/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/powerpc/fpu/fegetround.c (fegetround): Likewise.
Undefine after rather than before function definition; use
parentheses around function name in definition.
(__fegetround): Also undefine macro after function definition.
* sysdeps/powerpc/nofpu/fegetround.c (fegetround): Rename to
__fegetround and define as weak alias of __fegetround. Use
libm_hidden_weak. Do not undefine as macro.
* sysdeps/powerpc/powerpc32/e500/nofpu/fegetround.c (fegetround):
Likewise.
* sysdeps/s390/fpu/fegetround.c (fegetround): Rename to
__fegetround and define as weak alias of __fegetround. Use
libm_hidden_weak.
* sysdeps/sh/sh4/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/sparc/fpu/fegetround.c (fegetround): Likewise.
* sysdeps/tile/math_private.h (__fegetround): New inline function.
* sysdeps/x86_64/fpu/fegetround.c (fegetround): Rename to
__fegetround and define as weak alias of __fegetround. Use
libm_hidden_weak.
* sysdeps/ieee754/dbl-64/e_sqrt.c (__ieee754_sqrt): Use
__fegetround instead of fegetround.
Some C90 libm functions call fegetenv via libc_feholdsetround*
functions in math_private.h. This patch makes them call __fegetenv
instead, making fegetenv into a weak alias for __fegetenv as needed.
Tested for x86_64 (testsuite, and that disassembly of installed shared
libraries is unchanged by the patch). Also tested for ARM
(soft-float) that fegetenv failures disappear from the linknamespace
test failures (however, similar fixes will also be needed for
fegetround, feholdexcept, fesetenv, fesetround and feupdateenv before
this set of namespace issues covered by bug 17748 is fully fixed and
those linknamespace tests start passing).
[BZ #17748]
* include/fenv.h (__fegetenv): Use libm_hidden_proto.
* math/fegetenv.c (__fegetenv): Use libm_hidden_def.
* sysdeps/aarch64/fpu/fegetenv.c (fegetenv): Rename to __fegetenv
and define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/alpha/fpu/fegetenv.c (__fegetenv): Use libm_hidden_def.
* sysdeps/arm/fegetenv.c (fegetenv): Rename to __fegetenv and
define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/hppa/fpu/fegetenv.c (fegetenv): Likewise.
* sysdeps/i386/fpu/fegetenv.c (__fegetenv): Use libm_hidden_def.
* sysdeps/ia64/fpu/fegetenv.c (fegetenv): Rename to __fegetenv and
define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/m68k/fpu/fegetenv.c (__fegetenv): Use libm_hidden_def.
* sysdeps/mips/fpu/fegetenv.c (fegetenv): Rename to __fegetenv and
define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/powerpc/fpu/fegetenv.c (__fegetenv): Use
libm_hidden_def.
* sysdeps/powerpc/nofpu/fegetenv.c (__fegetenv): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/fegetenv.c (__fegetenv):
Likewise.
* sysdeps/s390/fpu/fegetenv.c (fegetenv): Rename to __fegetenv and
define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/sh/sh4/fpu/fegetenv.c (fegetenv): Likewise.
* sysdeps/sparc/fpu/fegetenv.c (__fegetenv): Use libm_hidden_def.
* sysdeps/tile/math_private.h (__fegetenv): New inline function.
* sysdeps/x86_64/fpu/fegetenv.c (fegetenv): Rename to __fegetenv
and define as weak alias of __fegetenv. Use libm_hidden_weak.
* sysdeps/generic/math_private.h (libc_feholdsetround_ctx): Use
__fegetenv instead of fegetenv.
(libc_feholdsetround_noex_ctx): Likewise.
This patch optimizes strcpy for ppc64/power7 for unaligned source or
destination address. The source or destination address is aligned
to doubleword and data is shifted based on the alignment and
added with the previous loaded data to be written as a doubleword.
For each load, cmpb instruction is used for faster null check.
The word aligned optimization is also removed, since the new unaligned
code path shows better results handling word-aligned strings.
More combination of unaligned inputs is also added in benchtest
to measure the improvement.The new optimization shows 2 to 80% of
performance improvement for longer string though it does not show
big difference on string size less than 16 due to additional checks.
The natural fix for some linknamespace test failures, where C90 libm
functions call C99 <fenv.h> functions, is to make fe* into weak
aliases for __fe* and call __fe* from within libm as needed.
To do this, the __fe* names need to be available for that purpose -
that is, they must not be used for something other than aliases of
fe*. On powerpc, however, __fegetround is an inline function in
fenv_libc.h, with no corresponding fegetround inline function;
fegetround has an equivalent macro expansion in bits/fenvinline.h, but
that is disabled if __NO_MATH_INLINES (which is defined for building
libm).
I see no need for that disabling; it's not even clear that
__NO_MATH_INLINES should affect <fenv.h>, and the results of
fegetround are completely defined so there is no semantic effect of
that disabling at all outside glibc. The x86 inline feraiseexcept is
conditioned on __USE_EXTERN_INLINES not __NO_MATH_INLINES (but that's
an inline function rather than a macro).
This patch removes the __NO_MATH_INLINES conditional on that
fegetround macro, so resulting in it being expanded inline inside
glibc. In turn, this means that direct calls to __fegetround from C99
functions in ldbl-128ibm can be changed to calls to fegetround, so
that nofpu fenv_libc.h files don't need to define __fegetround at all
and, by changing ldbl-128ibm files to use <fenv.h> not <fenv_libc.h>,
non-e500 nofpu no longer needs an fenv_libc.h file.
The other macros in fenvinline.h are left conditional on
__NO_MATH_INLINES, although since the only case where this should make
a difference is one involving undefined behavior (if the argument to
the function is not a valid exception macro).
The out-of-line definition for fegetround uses __fegetround (the
inline function removed by this patch). So this continues to work,
the fenvinline.h header is made to define __fegetround, and then to
define fegetround to call __fegetround.
Tested for powerpc32 (hard float) that installed stripped shared
libraries are unchanged by this patch; also tested that powerpc-nofpu
build still works. (This patch does not itself fix any bugs; it
simply cleans things up in preparation for separate bug fixes.)
* sysdeps/powerpc/bits/fenvinline.h (fegetround): Rename macro to
__fegetround and redefine to call __fegetround. Remove condition
on [!__NO_MATH_INLINES].
* sysdeps/powerpc/fpu/fenv_libc.h (__fegetround): Remove inline
function.
* sysdeps/powerpc/nofpu/fenv_libc.h: Remove file.
* sysdeps/powerpc/powerpc32/e500/nofpu/fenv_libc.h (__fegetround):
Remove macro.
* sysdeps/ieee754/ldbl-128ibm/s_llrintl.c: Include <fenv.h>
instead of <fenv_libc.h>.
(__llrintl): Call fegetround instead of __fegetround.
* sysdeps/ieee754/ldbl-128ibm/s_llroundl.c: Include <fenv.h>
instead of <fenv_libc.h>.
* sysdeps/ieee754/ldbl-128ibm/s_lrintl.c: Likewise.
(__lrintl): Call fegetround instead of __fegetround.
* sysdeps/ieee754/ldbl-128ibm/s_lroundl.c: Include <fenv.h>
instead of <fenv_libc.h>.
* sysdeps/ieee754/ldbl-128ibm/s_rintl.c: Likewise.
(__rintl): Call fegetround instead of __fegetround.
Various C90 and UNIX98 libm functions call feraiseexcept, which is not
in those standards. This causes linknamespace test failures - except
on x86 / x86_64, where feraiseexcept is inline (for the relevant
constant arguments) in bits/fenv.h.
This patch fixes this by making those functions call __feraiseexcept
instead. All changes are applied to all architectures rather than
considering the possibility that some might not be needed in some
cases (e.g. x86) as it seems most maintainable to keep architectures
consistent.
Where __feraiseexcept does not exist, it is added, with feraiseexcept
made a weak alias; where it is a strong alias, it is made weak.
libm_hidden_def / libm_hidden_proto are used with __feraiseexcept
(this might in some cases improve code generation for existing calls
to __feraiseexcept in some code on some architectures). Where there
are dummy feraiseexcept macros (on architectures without
floating-point exceptions support, to avoid compile errors from
references to undefined FE_* macros), corresponding dummy
__feraiseexcept macros are added. And on x86, to ensure
__feraiseexcept calls still get inlined, the inline function in
bits/fenv.h is refactored so that most of it can be reused in an
inline __feraiseexcept in a separate include/bits/fenv.h.
Calls are changed in C90/UNIX98 functions, but generally not in
functions missing from those standards. They are also changed in
libc_fe* functions (on the basis that those might be used in any libm
function), and in feupdateenv (on the same basis - may be used, via
default libc_*, in any libm function - of course feupdateenv will need
changing to __feupdateenv in a subsequent patch to make that fully
namespace-clean).
No __feraiseexcept is added corresponding to the feraiseexcept in
powerpc bits/fenvinline.h, because that macro definition is
conditional on !defined __NO_MATH_INLINES, and glibc libm is built
with -D__NO_MATH_INLINES, so changing internal calls to use
__feraiseexcept should make no difference.
Tested for x86_64 (testsuite; the only change in disassembly of
installed shared libraries is a slight code reordering in clog10, of
no apparent significance). Also tested for MIPS, where (in the
configuration tested) it eliminates math.h linknamespace failures for
n32 and n64 (some for o32 remain because of other issues).
[BZ #17723]
* include/fenv.h (__feraiseexcept): Use libm_hidden_proto.
* math/fraiseexcpt.c (__feraiseexcept): Use libm_hidden_def.
* sysdeps/aarch64/fpu/fraiseexcpt.c (feraiseexcept): Rename to
__feraiseexcept and define as weak alias of __feraiseexcept. Use
libm_hidden_weak.
* sysdeps/arm/fraiseexcpt.c (feraiseexcept): Likewise.
* sysdeps/hppa/fpu/fraiseexcpt.c (feraiseexcept): Likewise.
* sysdeps/i386/fpu/fraiseexcpt.c (__feraiseexcept): Use
libm_hidden_def.
* sysdeps/ia64/fpu/fraiseexcpt.c (feraiseexcept): Rename to
__feraiseexcept and define as weak alias of __feraiseexcept. Use
libm_hidden_weak.
* sysdeps/m68k/coldfire/fpu/fraiseexcpt.c (feraiseexcept):
Likewise.
* sysdeps/microblaze/math_private.h (__feraiseexcept): New macro.
* sysdeps/mips/fpu/fraiseexcpt.c (feraiseexcept): Rename to
__feraiseexcept and define as weak alias of __feraiseexcept. Use
libm_hidden_weak.
* sysdeps/powerpc/fpu/fraiseexcpt.c (__feraiseexcept): Use
libm_hidden_def.
* sysdeps/powerpc/nofpu/fraiseexcpt.c (__feraiseexcept): Likewise.
* sysdeps/powerpc/powerpc32/e500/nofpu/fraiseexcpt.c
(__feraiseexcept): Likewise.
* sysdeps/s390/fpu/fraiseexcpt.c (feraiseexcept): Rename to
__feraiseexcept and define as weak alias of __feraiseexcept. Use
libm_hidden_weak.
* sysdeps/sh/sh4/fpu/fraiseexcpt.c (feraiseexcept): Likewise.
* sysdeps/sparc/fpu/fraiseexcpt.c (__feraiseexcept): Use
libm_hidden_def.
* sysdeps/tile/math_private.h (__feraiseexcept): New macro.
* sysdeps/unix/sysv/linux/alpha/fraiseexcpt.S (__feraiseexcept):
Use libm_hidden_def.
* sysdeps/x86_64/fpu/fraiseexcpt.c (__feraiseexcept): Use
libm_hidden_def.
(feraiseexcept): Define as weak not strong alias. Use
libm_hidden_weak.
* sysdeps/x86/fpu/bits/fenv.h (__feraiseexcept_invalid_divbyzero):
New inline function. Factored out of ...
(feraiseexcept): ... here. Use __feraiseexcept_invalid_divbyzero.
* sysdeps/x86/fpu/include/bits/fenv.h: New file.
* math/e_scalb.c (invalid_fn): Call __feraiseexcept instead of
feraiseexcept.
* math/w_acos.c (__acos): Likewise.
* math/w_asin.c (__asin): Likewise.
* math/w_ilogb.c (__ilogb): Likewise.
* math/w_j0.c (y0): Likewise.
* math/w_j1.c (y1): Likewise.
* math/w_jn.c (yn): Likewise.
* math/w_log.c (__log): Likewise.
* math/w_log10.c (__log10): Likewise.
* sysdeps/aarch64/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/aarch64/fpu/math_private.h
(libc_feupdateenv_test_aarch64): Likewise.
* sysdeps/alpha/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/arm/fenv_private.h (libc_feupdateenv_test_vfp): Likewise.
* sysdeps/arm/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/ia64/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/m68k/fpu/feupdateenv.c (__feupdateenv): Likewise.
* sysdeps/mips/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/powerpc/fpu/e_sqrt.c (__slow_ieee754_sqrt): Likewise.
* sysdeps/s390/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/sh/sh4/fpu/feupdateenv.c (feupdateenv): Likewise.
* sysdeps/sparc/fpu/feupdateenv.c (__feupdateenv): Likewise.
Use of strftime, a C90 function, ends up bringing in wcschr, which is
not a C90 function. Although not a conformance bug (C90 reserves
wcs*), this is still contrary to glibc practice of avoiding relying on
those reservations; this patch arranges for the internal uses to use
__wcschr instead, with wcschr being a weak alias. This is more
complicated than some such patches because of the various IFUNC
definitions of wcschr (which include code redefining libc_hidden_def
in a way that involves creating __GI_wcschr manually and so also needs
to create __GI___wcschr after the change of internal uses to use
__wcschr).
Tested for x86_64 and 32-bit x86 (testsuite, and that disassembly of
installed shared libraries is unchanged by the patch).
2014-12-10 Joseph Myers <joseph@codesourcery.com>
Adhemerval Zanella <azanella@linux.vnet.ibm.com>
[BZ #17634]
* wcsmbs/wcschr.c [!WCSCHR] (wcschr): Define as __wcschr.
Undefine after defining function. Define as weak alias of
__wcschr. Use libc_hidden_weak.
* include/wchar.h (__wcschr): Declare. Use libc_hidden_proto.
* sysdeps/i386/i686/multiarch/wcschr-c.c [IS_IN (libc) && SHARED]
(libc_hidden_def): Also define __GI___wcschr alias.
* sysdeps/i386/i686/multiarch/wcschr.S (wcschr): Rename to
__wcschr and define as weak alias of __wcschr.
* sysdeps/powerpc/power6/wcschr.c [!WCSCHR] (WCSCHR): Define as
__wcschr.
[!WCSCHR] (DEFAULT_WCSCHR): Define.
[DEFAULT_WCSCHR] (__wcschr): Use libc_hidden_def.
[DEFAULT_WCSCHR] (wcschr): Define as weak alias of __wcschr. Use
libc_hidden_weak. Do not use libc_hidden_def.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c
[IS_IN (libc) && SHARED] (libc_hidden_def): Also define
__GI___wcschr alias.
* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c
[IS_IN (libc)] (wcschr): Define as macro expanding to
__redirect_wcschr.
[IS_IN (libc)] (__wcschr_ppc): Use __redirect_wcschr in typeof.
[IS_IN (libc)] (__wcschr_power6): Likewise.
[IS_IN (libc)] (__wcschr_power7): Likewise.
[IS_IN (libc)] (__libc_wcschr): New. Define with libc_ifunc
instead of wcschr.
[IS_IN (libc)] (wcschr): Undefine and define as weak alias of
__libc_wcschr.
[!IS_IN (libc)] (libc_hidden_def): Do not undefine and redefine.
* sysdeps/powerpc/powerpc64/multiarch/wcschr.c (wcschr): Rename to
__wcschr and define as weak alias of __wcschr. Use
libc_hidden_builtin_def.
* sysdeps/x86_64/wcschr.S (wcschr): Rename to __wcschr and define
as weak alias of __wcschr. Use libc_hidden_weak.
* time/alt_digit.c (_nl_get_walt_digit): Use __wcschr instead of
wcschr.
* time/era.c (_nl_init_era_entries): Likewise.
* conform/Makefile (test-xfail-ISO/time.h/linknamespace): Remove
variable.
(test-xfail-XPG3/time.h/linknamespace): Likewise.
(test-xfail-XPG4/time.h/linknamespace): Likewise.
This patch makes the POWER7 optimized strpbrk generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 does not change.
This patch makes the POWER7 optimized strcspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 does not change.
This patch makes the POWER7 optimized strspn generic by using
default doubleword stores to zero the hash, instead of VSX
instructions. Performance on POWER7/POWER8 machines does not changed.
This patch optimizes strtok and strtok_r for POWERPC64.
A table of 256 characters is created and marked based on
the 'accept' argument and used to check for any occurance on
the input string.Loop unrolling is also used to gain improvements.
On powerpc, atomic_exchange_and_add is implemented without any
barriers. This patchs adds the missing instruction and memory barrier
for acquire and release semanthics.
This sets __HAVE_64B_ATOMICS if provided. It also sets
USE_ATOMIC_COMPILER_BUILTINS to true if the existing atomic ops use the
__atomic* builtins (aarch64, mips partially) or if this has been
tested (x86_64); otherwise, this is set to false so that C11 atomics will
be based on the existing atomic operations.
This patch fixes the build of C mempcpy and stpcpy by disabling the
redirection to __mempcpy and __stpcpy asm names if
NO_MEMPCPY_STPCPY_REDIRECT is defined, and defining that macro in the
relevant source files.
Tested for powerpc32 that the build is fixed.
* include/string.h [NO_MEMPCPY_STPCPY_REDIRECT] (mempcpy): Do not
redeclare with asm name.
[NO_MEMPCPY_STPCPY_REDIRECT] (stpcpy): Likewise.
* string/mempcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Define before
including <string.h>.
* string/stpcpy.c (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
[!NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
[SHARED && !NOT_IN_libc] (NO_MEMPCPY_STPCPY_REDIRECT): Likewise.
Completing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch removes the final use - that for _dl_starting_up - replacing it
by rtld_hidden_def / rtld_hidden_proto. Having removed the last use,
the mechanism itself is also removed.
Tested for x86_64 that installed stripped shared libraries are
unchanged by the patch. (This is not much of a test since this
variable is only defined and used in the !HAVE_INLINED_SYSCALLS case.)
[BZ #14132]
* include/libc-symbols.h (INTUSE): Remove macro.
(INTDEF): Likewise.
(INTVARDEF): Likewise.
(_INTVARDEF): Likewise.
(INTDEF2): Likewise.
(INTVARDEF2): Likewise.
* elf/rtld.c [!HAVE_INLINED_SYSCALLS] (_dl_starting_up): Use
rtld_hidden_def instead of INTVARDEF.
* sysdeps/generic/ldsodefs.h [IS_IN_rtld]
(_dl_starting_up_internal): Remove declaration.
(_dl_starting_up): Use rtld_hidden_proto.
* elf/dl-init.c [!HAVE_INLINED_SYSCALLS] (_dl_starting_up): Remove
declaration.
[!HAVE_INLINED_SYSCALLS] (_dl_starting_up_internal): Likewise.
(_dl_init) [!HAVE_INLINED_SYSCALLS]: Don't use INTUSE with
_dl_starting_up.
* elf/dl-writev.h (_dl_writev): Likewise.
* sysdeps/powerpc/powerpc64/dl-machine.h [!HAVE_INLINED_SYSCALLS]
(DL_STARTING_UP_DEF): Use __GI__dl_starting_up instead of
_dl_starting_up_internal.
Continuing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch eliminates its use for _dl_init. Since _dl_init was already
declared with hidden visibility, creating a second hidden alias for it
was completely pointless, so this patch replaces all uses of
_dl_init_internal with plain _dl_init instead of using hidden_proto /
hidden_def (which are only needed when you want a hidden alias for a
non-hidden symbol; it's quite possible there are cases where they are
used but don't need to be because the symbol in question is not part
of the public ABI and is only used within a single library, so using
attributes_hidden instead would suffice).
Tested for x86_64 that installed stripped shared libraries are
unchanged by the patch.
[BZ #14132]
* elf/dl-init.c (_dl_init): Don't use INTDEF.
* sysdeps/aarch64/dl-machine.h (RTLD_START): Use _dl_init instead
of _dl_init_internal.
* sysdeps/alpha/dl-machine.h (RTLD_START): Likewise.
* sysdeps/arm/dl-machine.h (RTLD_START): Likewise.
* sysdeps/hppa/dl-machine.h (RTLD_START): Likewise.
* sysdeps/i386/dl-machine.h (RTLD_START): Likewise.
* sysdeps/ia64/dl-machine.h (RTLD_START): Likewise.
* sysdeps/m68k/dl-machine.h (RTLD_START): Likewise.
* sysdeps/microblaze/dl-machine.h (RTLD_START): Likewise.
* sysdeps/mips/dl-machine.h (RTLD_START): Likewise.
* sysdeps/powerpc/powerpc32/dl-start.S (_start): Likewise.
* sysdeps/s390/s390-32/dl-machine.h (RTLD_START): Likewise.
* sysdeps/s390/s390-64/dl-machine.h (RTLD_START): Likewise.
* sysdeps/sh/dl-machine.h (RTLD_START): Likewise.
* sysdeps/sparc/sparc32/dl-machine.h (RTLD_START): Likewise.
* sysdeps/sparc/sparc64/dl-machine.h (RTLD_START): Likewise.
* sysdeps/tile/dl-start.S (_start): Likewise.
* sysdeps/x86_64/dl-machine.h (RTLD_START): Likewise.
* sysdeps/x86_64/x32/dl-machine.h (RTLD_START): Likewise.
Continuing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch replaces its use for _dl_argv with rtld_hidden_data_def and
rtld_hidden_proto. Some places in .S files that previously used
_dl_argv_internal or INTUSE(_dl_argv) now use __GI__dl_argv directly
(there are plenty of existing examples of such direct use of __GI_*).
A single place in rtld.c previously used _dl_argv without INTUSE,
apparently accidentally, while the rtld_hidden_proto mechanism avoids
such accidential omissions. As a consequence, this patch *does*
change the contents of stripped ld.so. However, the installed
stripped shared libraries are identical to those you get if instead of
this patch you change that single _dl_argv use to use INTUSE, without
any other changes.
Tested for x86_64 (testsuite as well as comparison of installed
stripped shared libraries as described above).
[BZ #14132]
* sysdeps/generic/ldsodefs.h (_dl_argv): Use rtld_hidden_proto.
[IS_IN_rtld] (_dl_argv_internal): Do not declare.
(rtld_progname): Make macro definition unconditional.
* elf/rtld.c (_dl_argv): Use rtld_hidden_data_def instead of
INTDEF.
(dlmopen_doit): Do not use INTUSE with _dl_argv.
(dl_main): Likewise.
* elf/dl-sysdep.c (_dl_sysdep_start): Likewise.
* sysdeps/alpha/dl-machine.h (RTLD_START): Use __GI__dl_argv
instead of _dl_argv_internal.
* sysdeps/powerpc/powerpc32/dl-start.S (_dl_start_user): Use
__GI__dl_argv instead of INTUSE(_dl_argv).
* sysdeps/powerpc/powerpc64/dl-machine.h (RTLD_START): Use
__GI__dl_argv instead of _dl_argv_internal.
Continuing the removal of the obsolete INTDEF / INTUSE mechanism, this
patch removes the use of INTUSE to rename symbols in
sysdeps/powerpc/powerpc32/libgcc-compat.S. As the names in question
are purely internal to this particular object and not used anywhere
else, it doesn't matter at all whether __*_v_glibc20 or __*_internal
is used, so this patch just removes the macros in question.
Tested for powerpc32 that stripped installed shared libraries are
unchanged by this patch.
[BZ #14132]
* sysdeps/powerpc/powerpc32/libgcc-compat.S (__ashldi3_v_glibc20):
Remove macro definition.
(__ashrdi3_v_glibc20): Likewise.
(__lshrdi3_v_glibc20): Likewise.
(__cmpdi2_v_glibc20): Likewise.
(__ucmpdi2_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__fixdfdi_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__fixsfdi_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__fixunsdfdi_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__fixunssfdi_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__floatdidf_v_glibc20): Likewise.
[!_SOFT_FLOAT && !__NO_FPRS__] (__floatdisf_v_glibc20): Likewise.
This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
This patch cleanups the multiarch bzero for powerpc64 by remove
the multiarch objects and use instead the the memset embedded
implementation presented in each multiarch optimization. The
code generate is essentially the same, but the TB_TOCLESS (which
is not essential).
generic HAVE_RM_CTX implementation which is used for ppc/e500 as well
has introduced calls to fegetenv which should be resolved internally
with in libm
Signed-off-by: Khem Raj <raj.khem@gmail.com>
* sysdeps/powerpc/powerpc32/e500/nofpu/fegetenv.c (fegetenv): Add
libm_hidden_ver.
On powerpc, floating-point environment macros are defined as pointers
to constants in the library that contain the bit-patterns of the
desired environment, instead of being magic constants cast to pointer
type.
For soft-float, the bit-patterns used for fenv_t are not laid out the
same as for hard-float. (e500 has a third layout used; that's not an
ABI issue because these values are only meaningful within a single
process, all of whose glibc libraries must come from the same build of
glibc.) While the __fe_dfl_env value for soft-float was appropriate
for the soft-float fenv_t representation, the other two constants had
the same bit-patterns as for hard-float. Those bit patterns had the
effect of having exceptions already raised, causing
math/test-fenv-return to fail; this patch fixes the patterns used.
(__fe_nonieee_env also had exceptions unmasked, though they should be
masked to match hard-float semantics. Since there is no separate
non-IEEE mode for soft-float, it's most appropriate for
__fe_nonieee_env to be the same as __fe_dfl_env; this patch makes it
an alias.)
Tested for powerpc-nofpu.
[BZ #17261]
* sysdeps/powerpc/nofpu/fenv_const.c (__fe_enabled_env): Change
value to 0.
(__fe_nonieee_env): Define as an alias for __fe_dfl_env.
Now that MEMCPY_OK_FOR_FWD_MEMMOVE should be define on memcopy.h there
is no need to specialized powerpc memmove implementation. This patch
moves the define set to powerpc memcopy and cleanup its definition on
powerpc code.
This patch changes power7 memcpy to use VSX instructions only when
memory is aligned to quardword. It is to avoid unaligned kernel traps
on non-cacheable memory (for instance, memory-mapped I/O).
This patch adds an optimized memmove optimization for POWER7/powerpc64.
Basically the idea is to use the memcpy for POWER7 on non-overlapped
memory regions and a optimized backward memcpy for memory regions
that overlap (similar to the idea of string/memmove.c).
The backward memcpy algorithm used is similar the one use for memcpy for
POWER7, with adjustments done for alignment. The difference is memory
is always aligned to 16 bytes before using VSX/altivec instructions.
This patch removes the powerpc specific logic in memmove and instead
include default implementation with MEMCPY_OK_FOR_FWD_MEMMOVE defined.
This lead in a increase performance, since the constraints to use
memcpy in powerpc code are too restrictive and memcpy can be used for
any forward memmove.
The PAGE_COPY_THRESHOLD macro is meant to be overridden by
architecture-specific pagecopy.h, but it is currently done only by
mach; all other architectures use the default. Check to see if the
macro is defined in addition to whether it is set to a non-zero value.
This patch adds an ifunc power7 strcat symbol that uses the logic on
sysdeps/powerpc/strcat.c but call power7 strlen/strcpy symbols instead
of default ones.
shlib-versions files can contain ABI lines that map triplets to a
canonical ABI name. This name was once used for various purposes
where test baseline files for different ABIs went in a single
directory; now these purposes use sysdeps files, generation of headers
which have per-ABI variants uses abi-variants and related Makefile
variables and the shlib-versions ABI names are unused. This patch
duly removes those lines and associated build system support for them.
Tested for x86_64 (both a full testsuite run and confirming the
installed shared libraries are unchanged by the patch).
* Makeconfig ($(common-objpfx)soversions.mk): Do not generate
abi-name definition.
* scripts/soversions.awk: Do not handle or generate ABI lines.
* shlib-versions: Remove ABI entries.
* sysdeps/powerpc/nofpu/shlib-versions: Remove file.
* sysdeps/x86_64/x32/shlib-versions: Remove ABI entry.
This patch defines ELF_MACHINE_NO_RELA on all architectures. Tested
only on x86_64 to verify that the sources before and after are
identical except for two instructions that pass the current line
number in dl-machine.h to assert_fail.
This patch makes non-ex-ports architectures set base_machine and
machine based on the original configured machine value in preconfigure
fragments, like ex-ports architectures, rather than in the toplevel
configure.ac.
Tested x86 that the disassembly of installed shared libraries is
unchanged by the patch.
* configure.ac (base_machine): Do not set specially for particular
machines here.
* configure: Regenerated.
* sysdeps/powerpc/preconfigure: Move machine and base_machine
settings from configure.ac.
* sysdeps/i386/preconfigure: New file.
* sysdeps/s390/preconfigure: Likewise.
* sysdeps/sh/preconfigure: Likewise.
* sysdeps/sparc/preconfigure: Likewise.
Linux commit dd58a092c4202f2bd490adab7285b3ff77f8e467 added the
PPC_FEATURE2_VEC_CRYPTO auvx capability to indicate whether to
hardware supports vector crypto hardware instructions. This patch
adds its definition to powerpc hwcap bits.
This patch makes files using __ASSUME_* macros include
<kernel-features.h> explicitly, rather than relying on some other
header (such as tls.h, lowlevellock.h or pthreadP.h) to include it
implicitly. (I omitted cases where I've already posted or am testing
the patch that stops the file from needing __ASSUME_* at all.) This
accords with the general principle of making source files include the
headers for anything they use, and also helps make it safe to remove
<kernel-features.h> includes from any file that doesn't use
__ASSUME_* (some of those may be stray includes left behind after
increasing the minimum kernel version, others may never have been
needed or may have become obsolete after some other change).
Tested x86_64 that the disassembly of installed shared libraries is
unchanged by this patch.
* nptl/pthread_cond_wait.c: Include <kernel-features.h>.
* nptl/pthread_rwlock_timedrdlock.c: Likewise.
* nptl/pthread_rwlock_timedwrlock.c: Likewise.
* nptl/sysdeps/unix/sysv/linux/lowlevelrobustlock.c: Likewise.
* nscd/nscd.c: Likewise.
* sysdeps/i386/nptl/tcb-offsets.sym: Likewise.
* sysdeps/powerpc/nptl/tcb-offsets.sym: Likewise.
* sysdeps/sh/nptl/tcb-offsets.sym: Likewise.
* sysdeps/x86_64/nptl/tcb-offsets.sym: Likewise.
Optimization is achieved on 8 byte aligned strings with double word
comparison using cmpb instruction. On unaligned strings loop unrolling
is applied for Power7 gain.
This patch fixes the optimized ppc64/power7 strncat strlen call for
static build without ifunc enabled. The strlen symbol to call in such
situation is just strlen, instead of __GI_strlen (since the __GI_
alias is just created for shared objects).
This patch fixes a similar issue to
736c304a1a, where for PPC32 if the symbol
is defined as hidden (memchr) then compiler will create a local branc
(symbol@local) and the linker will not create a required PLT call to
make the ifunc work. It changes the default hidden symbol (__GI_memchr)
to default memchr symbol for powerpc32 (__memchr_ppc32).
This patch fixes the __copysignf optimized macro meant to internal libm
usage when used with constant value. Without the explicit cast to
float, if it is used with const double value (for instance, on
s_casinhf.c) double constants will be used and it may lead to precision
issues in some algorithms.
It fixes the following failures on PPC64/POWER7:
Failure: Test: Real part of: cacos_downward (inf + 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf - 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf + 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf - 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf + 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf - 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf + 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf - 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
The optimization is achieved by following techniques:
> data alignment [gain from aligned memory access on read/write]
> POWER7 gains performance with loop unrolling/unwinding
[gain by reduction of branch penalty].
> zero padding done by calling optimized memset
This patch changes de default symbol redirection for internal call of
memcpy, memset, memchr, and strlen to the IFUNC resolved ones. The
performance improvement is noticeable in algorithms that uses these
symbols extensible, like the regex functions.
This patch optimizes the FPSCR update on exception and rounding change
functions by just updating its value if new value if different from
current one. It also optimizes fedisableexcept and feenableexcept by
removing an unecessary FPSCR read.
This patch fixes some powerpc32 and powerpc64 builds with
--disable-multi-arch option along with different --with-cpu=powerN.
It cleanups the Implies directories by removing the multiarch
folder for non multiarch config and also fixing two assembly
implementations: powerpc64/power7/strncat.S that is calling the
wrong strlen; and power8/fpu/s_isnan.S that misses the hidden_def and
weak_alias directives.
This patch fixes the powerpc32 optimized nearbyint/nearbyintf bogus
results for FE_DOWNWARD rounding mode. This is due wrong instructions
sequence used in the rounding calculation (two subtractions instead of
adition and a subtraction).
Fixes BZ#16815.
This patch add an optimized strpbrk for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance on
the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and memory
clear using VSX instructions.
This patch add a optimized strcspn for POWER7 by using a different
algorithm than default implementation: it constructs a table based on
the 'accept' argument and use this table to check for any occurance
on the input string. The idea is similar as x86_64 uses.
For PowerPC some tunings were added, such as unroll loops and align
stack memory to table to 16 bytes (so VSX clean can ran without
alignment issues).
The roundl assembly implementation
(sysdeps/powerpc/powerpc64/fpu/s_roundl.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_roundl.c instead fixes the failing math.
This fixes 16707.
The nearbyintl assembly implementation
(sysdeps/powerpc/powerpc64/fpu/s_nearbyintl.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_nearbyintl.c instead fixes the failing
math.
Fixes BZ#16706.
The ceill assembly implementation (sysdeps/powerpc/powerpc64/fpu/s_ceill.S)
returns wrong results for some inputs where first double is a exact
integer and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_ceill.c instead fixes the failing math.
Fixes BZ#16701.
This patch fixes an issue for powerpc32-fpu static build which fails
with an 'bzero' undefined reference. This patch adds bzero ifunc selector
for static builds and fixes the '__bzero_ppc' reference to default
memset symbol (since static memset build does not provide ifunc
selector).
Fixes BZ#16689.
This patch fixes an issue for powerpc64[le] static build where __bzero
is definied in multiple places (memset-ppc64.o and bzero.o). It is now
defined only in bzero.o and memset-ppc64.o only defined __bzero_ppc for
both dynamic and static library.
Fixes BZ#16683.
The optimization is achieved by following techniques:
> hashing of needle.
> hashing avoids scanning of duplicate entries in needle across the string.
> initializing the hash table with Vector instructions (VSX) by quadword access.
> unrolling when scanning for character in string across hash table.
The optimization is achieved by following techniques:
1. Doubleword aligned memory access and compares using
cmpb instruction.
2. Loop unrolling for byte load/store.
3. CPU pre-fetch to avoid cache miss.
This patch fix the optimized powerpc-fpu modf/modff implementation
when using in non-default rounding mode where the zero sign is not
as expected. It fixes the libm testsuite tests
modf_downward (0) == 0.00000000000000000000e+00
modf_downward (20) == 0.00000000000000000000e+00
modf_downward (21) == 0.00000000000000000000e+00
Where the sign returned was negative.
As recently discussed
<https://sourceware.org/ml/libc-alpha/2014-02/msg00670.html>, it
doesn't seem particularly useful for libm-test-ulps files to contain
huge amounts of data on ulps for individual tests; just the global
maximum observed ulps for each function, together with the
verification of exceptions, errno and special results such as
infinities and NaNs for each test, suffices to verify that a
function's behavior on the given test inputs is within the expected
accuracy. Removing this data reduces source tree churn caused by
updates to these files when libm tests are added, and reduces the
frequency with which testsuite additions actually need libm-test-ulps
changes at all.
Accordingly, this patch removes that data, so that individual tests
get checked against the global bounds for the given function and only
generate an error if those are exceeded. Tested x86_64 (including
verifying that if an ulps value is artificially reduced, the tests do
indeed fail as they should and "make regen-ulps" generates the
expected changes).
* math/libm-test.inc (struct ulp_data): Don't refer to ulps for
individual tests in comment.
(libm-test-ulps.h): Don't refer to test_ulps in #include comment.
(prev_max_error): New variable.
(prev_real_max_error): Likewise.
(prev_imag_max_error): Likewise.
(compare_ulp_data): Don't refer to test names in comment.
(find_test_ulps): Remove function.
(find_function_ulps): Likewise.
(find_complex_function_ulps): Likewise.
(init_max_error): Take function name as argument. Look up ulps
for that function.
(print_ulps): Remove function.
(print_max_error): Use prev_max_error instead of calling
find_function_ulps.
(print_complex_max_error): Use prev_real_max_error and
prev_imag_max_error instead of calling find_complex_function_ulps.
(check_float_internal): Take max_ulp parameter instead of calling
find_test_ulps. Don't call print_ulps.
(check_float): Update call to check_float_internal.
(check_complex): Update calls to check_float_internal.
(START): Pass argument to init_max_error.
* math/gen-libm-test.pl (%results): Don't include "kind"
information.
(parse_ulps): Don't handle ulps of individual tests.
(print_ulps_file): Likewise.
(output_ulps): Likewise.
* math/README.libm-test: Update.
* manual/libm-err-tab.pl (parse_ulps): Don't handle ulps of
individual tests.
* sysdeps/aarch64/libm-test-ulps: Remove individual test ulps.
* sysdeps/alpha/fpu/libm-test-ulps: Likewise.
* sysdeps/arm/libm-test-ulps: Likewise.
* sysdeps/i386/fpu/libm-test-ulps: Likewise.
* sysdeps/ia64/fpu/libm-test-ulps: Likewise.
* sysdeps/m68k/coldfire/fpu/libm-test-ulps: Likewise.
* sysdeps/m68k/m680x0/fpu/libm-test-ulps: Likewise.
* sysdeps/microblaze/libm-test-ulps: Likewise.
* sysdeps/mips/mips32/libm-test-ulps: Likewise.
* sysdeps/mips/mips64/libm-test-ulps: Likewise.
* sysdeps/powerpc/fpu/libm-test-ulps: Likewise.
* sysdeps/powerpc/nofpu/libm-test-ulps: Likewise.
* sysdeps/s390/fpu/libm-test-ulps: Likewise.
* sysdeps/sh/libm-test-ulps: Likewise.
* sysdeps/sparc/fpu/libm-test-ulps: Likewise.
* sysdeps/tile/libm-test-ulps: Likewise.
* sysdeps/x86_64/fpu/libm-test-ulps: Likewise.
* sysdeps/hppa/fpu/libm-test-ulps: Remove individual test ulps.
This patch optimizes strrchr() for ppc64. It uses aligned memory
access along with cmpb instruction and CPU prefetch to avoid
cache misses for speed improvement.
This patch add a optimized llround/llroundf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized llrint/llrintf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized finite/finitef implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized isinf/isinff implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
This patch add a optimized isnan/isnanf implementation for POWER8
using the new Move From VSR Doubleword instruction to gains some
cycles from FP to GRP register move.
elf/tst-auxv.c includes misc/sys/auxv.h, which ends up not actually
being included due to the guard overlap, and getauxval becomes an
implicit declaration and implicit pointer conversion which means, at
best, the test isn't actually testing what it thinks it is and, at
worst, it'll crash and burn on platforms where implict pointer
conversion is a Very Bad Thing.
* sysdeps/powerpc/bits/hwcap.h: Allow _SYSDEPS_SYSDEP_H guard as a
synonym for _SYS_AUXV_H to allow direct inclusion.
* sysdeps/sparc/bits/hwcap.h: Likewise.
* sysdeps/powerpc/sysdep.h: Define _SYSDEPS_SYSDEP_H instead of
_SYS_AUXV_H so we can include sysdep.h and sys/auxv.h together.
* sysdeps/sparc/sysdep.h: Likewise.
IEEE 754-2008 defines two ways in which tiny results can be detected,
"before rounding" (based on the infinite-precision result) and "after
rounding" (based on the result when rounded to normal precision as if
the exponent range were unbounded). All binary operations on an
architecture must use the same choice of how tininess is detected.
soft-fp has so far implemented only before-rounding tininess
detection. This patch adds support for after-rounding tininess
detection. A new macro _FP_TININESS_AFTER_ROUNDING is added that
sfp-machine.h must define (soft-fp is meant to be self-contained so
the existing tininess.h files aren't used here, though the information
going in sfp-machine.h has been taken from them). The soft-fp macros
dealing with raising underflow exceptions then handle the cases where
the choice matters specially, rounding a copy of the input to the
appropriate precision to see if a value that's tiny before rounding
isn't tiny after rounding.
Tested for mips64 using GCC trunk (which now uses soft-fp on MIPS, so
supporting exceptions and rounding modes for long double where not
previously supported - this is the immediate motivation for doing this
patch now) together with (a) a patch to sysdeps/mips/math-tests.h to
enable exceptions / rounding modes tests for long double for GCC 4.9
and later, and (b) corresponding changes applied to libgcc's soft-fp
and sfp-machine.h files. In the libgcc context this is also tested on
x86_64 (also an after-rounding architecture) with testcases for
__float128 that I intend to add to the GCC testsuite when updating
soft-fp there.
(To be clear: this patch does not fix any glibc bugs that were
user-visible in past releases, since after-rounding architectures
didn't use soft-fp in any affected case with support for
floating-point exceptions - so there is no corresponding Bugzilla bug.
Rather, it works together with the GCC changes to use soft-fp on MIPS
to allow previously absent long double functionality to work properly,
and allows soft-fp to be used in glibc on after-rounding architectures
in cases where it couldn't previously be used.)
* soft-fp/op-common.h (_FP_DECL): Mark exponent as possibly
unused.
(_FP_PACK_SEMIRAW): Determine tininess based on rounding shifted
value if _FP_TININESS_AFTER_ROUNDING and unrounded value is in
subnormal range.
(_FP_PACK_CANONICAL): Determine tininess based on rounding to
normal precision if _FP_TININESS_AFTER_ROUNDING and unrounded
value has largest subnormal exponent.
* soft-fp/soft-fp.h [FP_NO_EXCEPTIONS]
(_FP_TININESS_AFTER_ROUNDING): Undefine and redefine to 0.
* sysdeps/aarch64/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): New macro.
* sysdeps/alpha/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/arm/soft-fp/sfp-machine.h (_FP_TININESS_AFTER_ROUNDING):
Likewise.
* sysdeps/mips/mips64/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/mips/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/powerpc/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/sh/soft-fp/sfp-machine.h (_FP_TININESS_AFTER_ROUNDING):
Likewise.
* sysdeps/sparc/sparc32/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/sparc/sparc64/soft-fp/sfp-machine.h
(_FP_TININESS_AFTER_ROUNDING): Likewise.
* sysdeps/tile/sfp-machine.h (_FP_TININESS_AFTER_ROUNDING):
Likewise.
This patch creates implicit rules to match the abifiles if
abilist-pattern is defined in the architecture Makefile. This allows
machine specific Makefiles to define different abifiles names
(for instance *-le.abilist for powerpc64le).
The truncl assembly implementation (sysdeps/powerpc/powerpc64/fpu/s_truncl.S)
returns wrong results for some inputs where first double is a exact integer
and the precision is determined by second long double.
Checking on implementation comments and history, I am very confident the
assembly implementation was based on a version before commit
5c68d40169 that fixes BZ#2423 (Errors in
long double (ldbl-128ibm) rounding functions in glibc-2.4).
By just removing the implementation and make the build select
sysdeps/ieee754/ldbl-128ibm/s_truncl.c instead it fixes tgammal
issues regarding wrong result sign.
This patch fixes bug 16390, incorrect signs of zero results from
ldbl-128ibm atan2l, soft-float only. The problem is a longstanding
GCC bug with fabsl not being correct for signed zero for soft float,
and the fix is using -fno-builtin-fabsl as a workaround, as already
done for various other source files. Tested powerpc-nofpu.
* sysdeps/powerpc/nofpu/Makefile [$(subdir) = math]
(CFLAGS-e_atan2l.c): Use -fno-builtin-fabsl.
sysdeps/powerpc/powerpc32/libgcc-compat.S makes certain symbols that
glibc once accidentally reexported from libgcc into compat symbols.
Where the exports were purely accidental, this is the right thing to
do. However, for powerpc-nofpu the soft-fp symbols are deliberately
exported from libc, given public versions in
sysdeps/powerpc/nofpu/Versions and used by libm in preference to the
libgcc versions that do not support the software exceptions and
rounding modes. The libc versions should also be usable by user
programs, though normally libgcc gets linked in first (meaning,
effectively, that the <fenv.h> functions are broken as regards their
expected effects on user arithmetic).
A longstanding todo item is to remove the functions in question from
libgcc (when built with recent enough glibc) - that is, remove them
from static libgcc and make them compat symbols in shared libgcc - so
that this works properly (this is one of the items mentioned at
<http://gcc.gnu.org/wiki/Software_floating_point> - parts of that page
are obviously out of date, but this item still applies). Doing this
requires first that the functions are actually available from libc for
new links, not just as compat symbols.
This patch stops the symbols in question being compat symbols for
powerpc-nofpu. The nofpu Versions entries for them are removed (the
symbols never were exported at GLIBC_2.3.2, only GLIBC_2.0, because
the compat symbols took precedence).
Tested powerpc-nofpu. The symbols are no longer compat symbols and
libm.so now properly gets undefined references to them (resolved to
libc.so) instead of the libgcc copies getting linked into libm as
before.
* sysdeps/powerpc/powerpc32/libgcc-compat.S
[_SOFT_FLOAT || __NO_FPRS__] (__fixdfdi_v_glibc20): Do not define
as a macro and a compat symbol.
[_SOFT_FLOAT || __NO_FPRS__] (__fixsfdi_v_glibc20): Likewise.
[_SOFT_FLOAT || __NO_FPRS__] (__fixunsdfdi_v_glibc20): Likewise.
[_SOFT_FLOAT || __NO_FPRS__] (__fixunssfdi_v_glibc20): Likewise.
[_SOFT_FLOAT || __NO_FPRS__] (__floatdidf_v_glibc20): Likewise.
[_SOFT_FLOAT || __NO_FPRS__] (__floaddisf_v_glibc20): Likewise.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__fixdfdi): Do
not use .hidden.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__fixsfdi):
Likewise.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__fixunsdfdi):
Likewise.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__fixunssfdi):
Likewise.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__floaddidf):
Likewise.
[HAVE_DOT_HIDDEN && (_SOFT_FLOAT || __NO_FPRS__)] (__floaddisf):
Likewise.
* sysdeps/powerpc/nofpu/Versions (libc): Remove __fixdfdi,
__fixsfdi, __fixunsdfdi, __fixunssfdi, __floatdidf and __floatdisf
from GLIBC_2.3.2.
For PPC64, all the wrappers at sysdeps are superfluous: they are
basically the same implementation from math/w_sqrt.c with the
'#ifdef _IEEE_LIBM'. And the power4 version just force the 'fsqrt'
instruction utilization with an inline assembly, which is already
handled by math_private.h __ieee754_sqrt implementation.
This patch add optimized __mpn_addmul, __mpn_addsub, __mpn_lshift, and
__mpn_mul_1 implementations for PowerPC64. They are originally from GMP
with adjustments for GLIBC.
This patch add static probes for setjmp/longjmp in the way gdb expects,fixing
the gdb.base/longjmp.exp gdb testcases.
It changes the symbol_name and use macros to to avoid change the probe names
and ending up adding more logic on GDB (since with the expected name
GDB work seamlessly).
The ELFv2 ABI changes the calling convention by passing and returning
structures in registers in more cases than the old ABI:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01145.htmlhttp://gcc.gnu.org/ml/gcc-patches/2013-11/msg01147.html
For the most part, this does not affect glibc, since glibc assembler
files do not use structure parameters / return values. However, one
place is affected: the LD_AUDIT interface provides a structure to
the audit routine that contains all registers holding function
argument and return values for the intercepted PLT call.
Since the new ABI now sometimes uses registers to return values
that were never used for this purpose in the old ABI, this structure
has to be extended. To force audit routines to be modified for the
new ABI if necessary, the patch defines v2 variants of the la_ppc64
types and routines.
In addition, the patch contains two unrelated changes to the
PLT trampoline routines: it fixes a bug where FPR return values
were stored in the wrong place, and it removes the unnecessary
save/restore of CR.
This updates glibc for the changes in the ELFv2 relating to the
stack frame layout. These are described in more detail here:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01149.htmlhttp://gcc.gnu.org/ml/gcc-patches/2013-11/msg01146.html
Specifically, the "compiler and linker doublewords" were removed,
which has the effect that the save slot for the TOC register is
now at offset 24 rather than 40 to the stack pointer.
In addition, a function may now no longer necessarily assume that
its caller has set up a 64-byte register save area its use.
To address the first change, the patch goes through all assembler
files and replaces immediate offsets in instructions accessing the
ABI-defined stack slots by symbolic offsets. Those already were
defined in ucontext_i.sym and used in some of the context routines,
but that doesn't really seem like the right place for those defines.
The patch instead defines those symbolic offsets in sysdeps.h,
in two variants for the old and new ABI, and uses them systematically
in all assembler files, not just the context routines.
The second change only affected a few assembler files that used
the save area to temporarily store some registers. In those
cases where this happens within a leaf function, this patch
changes the code to store those registers to the "red zone"
below the stack pointer. Otherwise, the functions already allocate
a stack frame, and the patch changes them to add extra space in
these frames as temporary space for the ELFv2 ABI.
This is a follow-on to the previous patch to support the ELFv2 ABI in the
dynamic loader, split off into its own patch since it is just an optional
optimization.
In the ELFv2 ABI, most functions define both a global and a local entry
point; the local entry requires r2 to be already set up by the caller
to point to the callee's TOC; while the global entry does not require
the caller to know about the callee's TOC, but it needs to set up r12
to the callee's entry point address.
Now, when setting up a PLT slot, the dynamic linker will usually need
to enter the target function's global entry point. However, if the
linker can prove that the target function is in the same DSO as the
PLT slot itself, and the whole DSO only uses a single TOC (which the
linker will let ld.so know via a DT_PPC64_OPT entry), then it is
possible to actually enter the local entry point address into the
PLT slot, for a slight improvement in performance.
Note that this uncovered a problem on the first call via _dl_runtime_resolve,
because that routine neglected to restore the caller's TOC before calling
the target function for the first time, since it assumed that function
would always reload its own TOC anyway ...
This patch adds support for the ELFv2 ABI feature to remove function
descriptors. See this GCC patch for in-depth discussion:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01141.html
This mostly involves two types of changes: updating assembler source
files to the new logic, and updating the dynamic loader.
After the refactoring in the previous patch, most of the assembler source
changes can be handled simply by providing ELFv2 versions of the
macros in sysdep.h. One somewhat non-obvious change is in __GI__setjmp:
this used to "fall through" to the immediately following __setjmp ENTRY
point. This is no longer safe in the ELFv2 since ENTRY defines both
a global and a local entry point, and you cannot simply fall through
to a global entry point as it requires r12 to be set up.
Also, makecontext needs to be updated to set up registers according to
the new ABI for calling into the context's start routine.
The dynamic linker changes mostly consist of removing special code
to handle function descriptors. We also need to support the new PLT
and glink format used by the the ELFv2 linker, see:
https://sourceware.org/ml/binutils/2013-10/msg00376.html
In addition, the dynamic linker now verifies that the dynamic libraries
it loads match its own ABI.
The hack in VDSO_IFUNC_RET to "synthesize" a function descriptor
for vDSO routines is also no longer necessary for ELFv2.
This is the first patch to support the new ELFv2 ABI in glibc.
As preparation, this patch simply refactors some of the powerpc64 assembler
code to move all code related to creating function descriptors (.opd section)
or using function descriptors (function pointer call) into a central place
in sysdep.h.
Note that most locations creating .opd entries were already using macros
in sysdep.h, this patch simply extends this to the remaining places.
No relevant change in generated code expected.
This patch updates glibc in accordance with the binutils patch checked in here:
https://sourceware.org/ml/binutils/2013-10/msg00372.html
This changes the various R_PPC64_..._HI and _HA relocations to report
32-bit overflows. The motivation is that existing uses of @h / @ha
are to build up 32-bit offsets (for the "medium model" TOC access
that GCC now defaults to), and we'd really like to see failures at
link / load time rather than silent truncations.
For those rare cases where a modifier is needed to build up a 64-bit
constant, new relocations _HIGH / _HIGHA are supported.
The patch also fixes a bug in overflow checking for the R_PPC64_ADDR30
and R_PPC64_ADDR32 relocations.
This patch helps some math functions performance by adding the libc_fexxx
variant of inline functions to handle both FPU round and exception set/restore
and by using them on the libc_fexxx_ctx functions. It is based on already coded
fexxx family functions for PPC with fpu.
Here is the summary of performance improvements due this patch (measured on a
POWER7 machine):
Before:
cos(): ITERS:9.5895e+07: TOTAL:5116.03Mcy, MAX:77.6cy, MIN:49.792cy, 18744 calls/Mcy
exp(): ITERS:2.827e+07: TOTAL:5187.15Mcy, MAX:494.018cy, MIN:38.422cy, 5450.01 calls/Mcy
pow(): ITERS:6.1705e+07: TOTAL:5144.26Mcy, MAX:171.95cy, MIN:29.935cy, 11994.9 calls/Mcy
sin(): ITERS:8.6898e+07: TOTAL:5117.06Mcy, MAX:83.841cy, MIN:46.582cy, 16982 calls/Mcy
tan(): ITERS:2.9473e+07: TOTAL:5115.39Mcy, MAX:191.017cy, MIN:172.352cy, 5761.63 calls/Mcy
After:
cos(): ITERS:2.05265e+08: TOTAL:5111.37Mcy, MAX:78.754cy, MIN:24.196cy, 40158.5 calls/Mcy
exp(): ITERS:3.341e+07: TOTAL:5170.84Mcy, MAX:476.317cy, MIN:15.574cy, 6461.23 calls/Mcy
pow(): ITERS:7.6153e+07: TOTAL:5129.1Mcy, MAX:147.5cy, MIN:30.916cy, 14847.2 calls/Mcy
sin(): ITERS:1.58816e+08: TOTAL:5115.11Mcy, MAX:1490.39cy, MIN:22.341cy, 31048.4 calls/Mcy
tan(): ITERS:3.4964e+07: TOTAL:5114.18Mcy, MAX:177.422cy, MIN:146.115cy, 6836.68 calls/Mcy
Autoconf has been deprecating configure.in for quite a long time.
Rename all our configure.in and preconfigure.in files to .ac.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
This patch intends to unify both strcpy and stpcpy implementationsi
for PPC64 and PPC64/POWER7. The idead default powerpc64 implementation
is to provide both doubleword and word aligned memory access.
For PPC64/POWER7 is also provide doubleword and word memory access,
remove the branch hints, use the cmpb instruction for compare
doubleword/words, and add an optimization for inputs of same alignment.
* sysdeps/powerpc/powerpc32/dl-machine.c (__process_machine_rela):
Use stdint types in rather than __attribute__((mode())).
* sysdeps/powerpc/powerpc64/dl-machine.h (elf_machine_rela): Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00105.html
Like strnlen, memchr and memrchr had a number of defects fixed by this
patch as well as adding little-endian support. The first one I
noticed was that the entry to the main loop needlessly checked for
"are we done yet?" when we know the size is large enough that we can't
be done. The second defect I noticed was that the main loop count was
wrong, which in turn meant that the small loop needed to handle an
extra word. Thirdly, there is nothing to say that the string can't
wrap around zero, except of course that we'd normally hit a segfault
on trying to read from address zero. Fixing that simplified a number
of places:
- /* Are we done already? */
- addi r9,r8,8
- cmpld r9,r7
- bge L(null)
becomes
+ cmpld r8,r7
+ beqlr
However, the exit gets an extra test because I test for being on the
last word then if so whether the byte offset is less than the end.
Overall, the change is a win.
Lastly, memrchr used the wrong cache hint.
* sysdeps/powerpc/powerpc64/power7/memchr.S: Replace rlwimi with
insrdi. Make better use of reg selection to speed exit slightly.
Schedule entry path a little better. Remove useless "are we done"
checks on entry to main loop. Handle wrapping around zero address.
Correct main loop count. Handle single left-over word from main
loop inline rather than by using loop_small. Remove extra word
case in loop_small caused by wrong loop count. Add little-endian
support.
* sysdeps/powerpc/powerpc32/power7/memchr.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/memrchr.S: Likewise. Use proper
cache hint.
* sysdeps/powerpc/powerpc32/power7/memrchr.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/rawmemchr.S: Add little-endian
support. Avoid rlwimi.
* sysdeps/powerpc/powerpc32/power7/rawmemchr.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00104.html
One of the things I noticed when looking at power7 timing is that rlwimi
is cracked and the two resulting insns have a register dependency.
That makes it a little slower than the equivalent rldimi.
* sysdeps/powerpc/powerpc64/memset.S: Replace rlwimi with
insrdi. Formatting.
* sysdeps/powerpc/powerpc64/power4/memset.S: Likewise.
* sysdeps/powerpc/powerpc64/power6/memset.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/memset.S: Likewise.
* sysdeps/powerpc/powerpc32/power4/memset.S: Likewise.
* sysdeps/powerpc/powerpc32/power6/memset.S: Likewise.
* sysdeps/powerpc/powerpc32/power7/memset.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00103.html
LIttle-endian support for memcpy. I spent some time cleaning up the
64-bit power7 memcpy, in order to avoid the extra alignment traps
power7 takes for little-endian. It probably would have been better
to copy the linux kernel version of memcpy.
* sysdeps/powerpc/powerpc32/power4/memcpy.S: Add little endian support.
* sysdeps/powerpc/powerpc32/power6/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc32/power7/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc32/power7/mempcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power4/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power6/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/memcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/mempcpy.S: Likewise. Make better
use of regs. Use power7 mtocrf. Tidy function tails.
http://sourceware.org/ml/libc-alpha/2013-08/msg00102.html
This is a rather large patch due to formatting and renaming. The
formatting changes were to make it possible to compare power7 and
power4 versions of memcmp. Using different register defines came
about while I was wrestling with the code, trying to find spare
registers at one stage. I found it much simpler if we refer to a reg
by the same name throughout a function, so it's better if short-term
multiple use regs like rTMP are referred to using their register
number. I made the cr field usage changes when attempting to reload
rWORDn regs in the exit path to byte swap before comparing when
little-endian. That proved a bad idea due to the pipelining involved
in the main loop; Offsets to reload the regs were different first
time around the loop.. Anyway, I left the cr field usage changes in
place for consistency.
Aside from these more-or-less cosmetic changes, I fixed a number of
places where an early exit path restores regs unnecessarily, removed
some dead code, and optimised one or two exits.
* sysdeps/powerpc/powerpc64/power7/memcmp.S: Add little-endian support.
Formatting. Consistently use rXXX register defines or rN defines.
Use early exit labels that avoid restoring unused non-volatile regs.
Make cr field use more consistent with rWORDn compares. Rename
regs used as shift registers for unaligned loop, using rN defines
for short lifetime/multiple use regs.
* sysdeps/powerpc/powerpc64/power4/memcmp.S: Likewise.
* sysdeps/powerpc/powerpc32/power7/memcmp.S: Likewise. Exit with
addi 1,1,64 to pop stack frame. Simplify return value code.
* sysdeps/powerpc/powerpc32/power4/memcmp.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00101.html
Adds little-endian support to optimised strchr assembly. I've also
tweaked the big-endian code a little. In power7/strchr.S there's a
check in the tail of the function that we didn't match 0 before
finding a c match, done by comparing leading zero counts. It's just
as valid, and quicker, to compare the raw output from cmpb.
Another little tweak is to use rldimi/insrdi in place of rlwimi for
the power7 strchr functions. Since rlwimi is cracked, it is a few
cycles slower. rldimi can be used on the 32-bit power7 functions
too.
* sysdeps/powerpc/powerpc64/power7/strchr.S (strchr): Add little-endian
support. Correct typos, formatting. Optimize tail. Use insrdi
rather than rlwimi.
* sysdeps/powerpc/powerpc32/power7/strchr.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strchrnul.S (__strchrnul): Add
little-endian support. Correct typos.
* sysdeps/powerpc/powerpc32/power7/strchrnul.S: Likewise. Use insrdi
rather than rlwimi.
* sysdeps/powerpc/powerpc64/strchr.S (rTMP4, rTMP5): Define. Use
in loop and entry code to keep "and." results.
(strchr): Add little-endian support. Comment. Move cntlzd
earlier in tail.
* sysdeps/powerpc/powerpc32/strchr.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00100.html
The strcpy changes for little-endian are quite straight-forward, just
a matter of rotating the last word differently.
I'll note that the powerpc64 version of stpcpy is just begging to be
converted to use 64-bit loads and stores..
* sysdeps/powerpc/powerpc64/strcpy.S: Add little-endian support:
* sysdeps/powerpc/powerpc32/strcpy.S: Likewise.
* sysdeps/powerpc/powerpc64/stpcpy.S: Likewise.
* sysdeps/powerpc/powerpc32/stpcpy.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00099.html
More little-endian support. I leave the main strcmp loops unchanged,
(well, except for renumbering rTMP to something other than r0 since
it's needed in an addi insn) and modify the tail for little-endian.
I noticed some of the big-endian tail code was a little untidy so have
cleaned that up too.
* sysdeps/powerpc/powerpc64/strcmp.S (rTMP2): Define as r0.
(rTMP): Define as r11.
(strcmp): Add little-endian support. Optimise tail.
* sysdeps/powerpc/powerpc32/strcmp.S: Similarly.
* sysdeps/powerpc/powerpc64/strncmp.S: Likewise.
* sysdeps/powerpc/powerpc32/strncmp.S: Likewise.
* sysdeps/powerpc/powerpc64/power4/strncmp.S: Likewise.
* sysdeps/powerpc/powerpc32/power4/strncmp.S: Likewise.
* sysdeps/powerpc/powerpc64/power7/strncmp.S: Likewise.
* sysdeps/powerpc/powerpc32/power7/strncmp.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00098.html
The existing strnlen code has a number of defects, so this patch is more
than just adding little-endian support. The changes here are similar to
those for memchr.
* sysdeps/powerpc/powerpc64/power7/strnlen.S (strnlen): Add
little-endian support. Remove unnecessary "are we done" tests.
Handle "s" wrapping around zero and extremely large "size".
Correct main loop count. Handle single left-over word from main
loop inline rather than by using small_loop. Correct comments.
Delete "zero" tail, use "end_max" instead.
* sysdeps/powerpc/powerpc32/power7/strnlen.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00097.html
This is the first of nine patches adding little-endian support to the
existing optimised string and memory functions. I did spend some
time with a power7 simulator looking at cycle by cycle behaviour for
memchr, but most of these patches have not been run on cpu simulators
to check that we are going as fast as possible. I'm sure PowerPC can
do better. However, the little-endian support mostly leaves main
loops unchanged, so I'm banking on previous authors having done a
good job on big-endian.. As with most code you stare at long enough,
I found some improvements for big-endian too.
Little-endian support for strlen. Like most of the string functions,
I leave the main word or multiple-word loops substantially unchanged,
just needing to modify the tail.
Removing the branch in the power7 functions is just a tidy. .align
produces a branch anyway. Modifying regs in the non-power7 functions
is to suit the new little-endian tail.
* sysdeps/powerpc/powerpc64/power7/strlen.S (strlen): Add little-endian
support. Don't branch over align.
* sysdeps/powerpc/powerpc32/power7/strlen.S: Likewise.
* sysdeps/powerpc/powerpc64/strlen.S (strlen): Add little-endian support.
Rearrange tmp reg use to suit. Comment.
* sysdeps/powerpc/powerpc32/strlen.S: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00090.html
This patch fixes symbol versioning in setjmp/longjmp. The existing
code uses raw versions, which results in wrong symbol versioning when
you want to build glibc with a base version of 2.19 for LE.
Note that the merging the 64-bit and 32-bit versions in novmx-lonjmp.c
and pt-longjmp.c doesn't result in GLIBC_2.0 versions for 64-bit, due
to the base in shlib_versions.
* sysdeps/powerpc/longjmp.c: Use proper symbol versioning macros.
* sysdeps/powerpc/novmx-longjmp.c: Likewise.
* sysdeps/powerpc/powerpc32/bsd-_setjmp.S: Likewise.
* sysdeps/powerpc/powerpc32/bsd-setjmp.S: Likewise.
* sysdeps/powerpc/powerpc32/fpu/__longjmp.S: Likewise.
* sysdeps/powerpc/powerpc32/fpu/setjmp.S: Likewise.
* sysdeps/powerpc/powerpc32/mcount.c: Likewise.
* sysdeps/powerpc/powerpc32/setjmp.S: Likewise.
* sysdeps/powerpc/powerpc64/setjmp.S: Likewise.
* nptl/sysdeps/unix/sysv/linux/powerpc/pt-longjmp.c: Likewise.
http://sourceware.org/ml/libc-alpha/2013-08/msg00089.html
Little-endian fixes for setjmp/longjmp. When writing these I noticed
the setjmp code corrupts the non volatile VMX registers when using an
unaligned buffer. Anton fixed this, and also simplified it quite a
bit.
The current code uses boilerplate for the case where we want to store
16 bytes to an unaligned address. For that we have to do a
read/modify/write of two aligned 16 byte quantities. In our case we
are storing a bunch of back to back data (consective VMX registers),
and only the start and end of the region need the read/modify/write.
[BZ #15723]
* sysdeps/powerpc/jmpbuf-offsets.h: Comment fix.
* sysdeps/powerpc/powerpc32/fpu/__longjmp-common.S: Correct
_dl_hwcap access for little-endian.
* sysdeps/powerpc/powerpc32/fpu/setjmp-common.S: Likewise. Don't
destroy vmx regs when saving unaligned.
* sysdeps/powerpc/powerpc64/__longjmp-common.S: Correct CR load.
* sysdeps/powerpc/powerpc64/setjmp-common.S: Likewise CR save. Don't
destroy vmx regs when saving unaligned.
http://sourceware.org/ml/libc-alpha/2013-08/msg00088.html
* sysdeps/powerpc/powerpc32/fpu/s_roundf.S: Increase alignment of
constants to usual value for .cst8 section, and remove redundant
high address load.
* sysdeps/powerpc/powerpc32/power4/fpu/s_llround.S: Use float
constant for 0x1p52. Load little-endian words of double from
correct stack offsets.