This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
This patch cleanups the multiarch bzero for powerpc64 by remove
the multiarch objects and use instead the the memset embedded
implementation presented in each multiarch optimization. The
code generate is essentially the same, but the TB_TOCLESS (which
is not essential).
generic HAVE_RM_CTX implementation which is used for ppc/e500 as well
has introduced calls to fegetenv which should be resolved internally
with in libm
Signed-off-by: Khem Raj <raj.khem@gmail.com>
* sysdeps/powerpc/powerpc32/e500/nofpu/fegetenv.c (fegetenv): Add
libm_hidden_ver.
On powerpc, floating-point environment macros are defined as pointers
to constants in the library that contain the bit-patterns of the
desired environment, instead of being magic constants cast to pointer
type.
For soft-float, the bit-patterns used for fenv_t are not laid out the
same as for hard-float. (e500 has a third layout used; that's not an
ABI issue because these values are only meaningful within a single
process, all of whose glibc libraries must come from the same build of
glibc.) While the __fe_dfl_env value for soft-float was appropriate
for the soft-float fenv_t representation, the other two constants had
the same bit-patterns as for hard-float. Those bit patterns had the
effect of having exceptions already raised, causing
math/test-fenv-return to fail; this patch fixes the patterns used.
(__fe_nonieee_env also had exceptions unmasked, though they should be
masked to match hard-float semantics. Since there is no separate
non-IEEE mode for soft-float, it's most appropriate for
__fe_nonieee_env to be the same as __fe_dfl_env; this patch makes it
an alias.)
Tested for powerpc-nofpu.
[BZ #17261]
* sysdeps/powerpc/nofpu/fenv_const.c (__fe_enabled_env): Change
value to 0.
(__fe_nonieee_env): Define as an alias for __fe_dfl_env.
Now that MEMCPY_OK_FOR_FWD_MEMMOVE should be define on memcopy.h there
is no need to specialized powerpc memmove implementation. This patch
moves the define set to powerpc memcopy and cleanup its definition on
powerpc code.
This patch changes power7 memcpy to use VSX instructions only when
memory is aligned to quardword. It is to avoid unaligned kernel traps
on non-cacheable memory (for instance, memory-mapped I/O).
This patch adds an optimized memmove optimization for POWER7/powerpc64.
Basically the idea is to use the memcpy for POWER7 on non-overlapped
memory regions and a optimized backward memcpy for memory regions
that overlap (similar to the idea of string/memmove.c).
The backward memcpy algorithm used is similar the one use for memcpy for
POWER7, with adjustments done for alignment. The difference is memory
is always aligned to 16 bytes before using VSX/altivec instructions.
This patch removes the powerpc specific logic in memmove and instead
include default implementation with MEMCPY_OK_FOR_FWD_MEMMOVE defined.
This lead in a increase performance, since the constraints to use
memcpy in powerpc code are too restrictive and memcpy can be used for
any forward memmove.
The PAGE_COPY_THRESHOLD macro is meant to be overridden by
architecture-specific pagecopy.h, but it is currently done only by
mach; all other architectures use the default. Check to see if the
macro is defined in addition to whether it is set to a non-zero value.
This patch adds an ifunc power7 strcat symbol that uses the logic on
sysdeps/powerpc/strcat.c but call power7 strlen/strcpy symbols instead
of default ones.
shlib-versions files can contain ABI lines that map triplets to a
canonical ABI name. This name was once used for various purposes
where test baseline files for different ABIs went in a single
directory; now these purposes use sysdeps files, generation of headers
which have per-ABI variants uses abi-variants and related Makefile
variables and the shlib-versions ABI names are unused. This patch
duly removes those lines and associated build system support for them.
Tested for x86_64 (both a full testsuite run and confirming the
installed shared libraries are unchanged by the patch).
* Makeconfig ($(common-objpfx)soversions.mk): Do not generate
abi-name definition.
* scripts/soversions.awk: Do not handle or generate ABI lines.
* shlib-versions: Remove ABI entries.
* sysdeps/powerpc/nofpu/shlib-versions: Remove file.
* sysdeps/x86_64/x32/shlib-versions: Remove ABI entry.
This patch defines ELF_MACHINE_NO_RELA on all architectures. Tested
only on x86_64 to verify that the sources before and after are
identical except for two instructions that pass the current line
number in dl-machine.h to assert_fail.
This patch makes non-ex-ports architectures set base_machine and
machine based on the original configured machine value in preconfigure
fragments, like ex-ports architectures, rather than in the toplevel
configure.ac.
Tested x86 that the disassembly of installed shared libraries is
unchanged by the patch.
* configure.ac (base_machine): Do not set specially for particular
machines here.
* configure: Regenerated.
* sysdeps/powerpc/preconfigure: Move machine and base_machine
settings from configure.ac.
* sysdeps/i386/preconfigure: New file.
* sysdeps/s390/preconfigure: Likewise.
* sysdeps/sh/preconfigure: Likewise.
* sysdeps/sparc/preconfigure: Likewise.
Linux commit dd58a092c4202f2bd490adab7285b3ff77f8e467 added the
PPC_FEATURE2_VEC_CRYPTO auvx capability to indicate whether to
hardware supports vector crypto hardware instructions. This patch
adds its definition to powerpc hwcap bits.
This patch makes files using __ASSUME_* macros include
<kernel-features.h> explicitly, rather than relying on some other
header (such as tls.h, lowlevellock.h or pthreadP.h) to include it
implicitly. (I omitted cases where I've already posted or am testing
the patch that stops the file from needing __ASSUME_* at all.) This
accords with the general principle of making source files include the
headers for anything they use, and also helps make it safe to remove
<kernel-features.h> includes from any file that doesn't use
__ASSUME_* (some of those may be stray includes left behind after
increasing the minimum kernel version, others may never have been
needed or may have become obsolete after some other change).
Tested x86_64 that the disassembly of installed shared libraries is
unchanged by this patch.
* nptl/pthread_cond_wait.c: Include <kernel-features.h>.
* nptl/pthread_rwlock_timedrdlock.c: Likewise.
* nptl/pthread_rwlock_timedwrlock.c: Likewise.
* nptl/sysdeps/unix/sysv/linux/lowlevelrobustlock.c: Likewise.
* nscd/nscd.c: Likewise.
* sysdeps/i386/nptl/tcb-offsets.sym: Likewise.
* sysdeps/powerpc/nptl/tcb-offsets.sym: Likewise.
* sysdeps/sh/nptl/tcb-offsets.sym: Likewise.
* sysdeps/x86_64/nptl/tcb-offsets.sym: Likewise.
Optimization is achieved on 8 byte aligned strings with double word
comparison using cmpb instruction. On unaligned strings loop unrolling
is applied for Power7 gain.
This patch fixes the optimized ppc64/power7 strncat strlen call for
static build without ifunc enabled. The strlen symbol to call in such
situation is just strlen, instead of __GI_strlen (since the __GI_
alias is just created for shared objects).
This patch fixes a similar issue to
736c304a1a, where for PPC32 if the symbol
is defined as hidden (memchr) then compiler will create a local branc
(symbol@local) and the linker will not create a required PLT call to
make the ifunc work. It changes the default hidden symbol (__GI_memchr)
to default memchr symbol for powerpc32 (__memchr_ppc32).
This patch fixes the __copysignf optimized macro meant to internal libm
usage when used with constant value. Without the explicit cast to
float, if it is used with const double value (for instance, on
s_casinhf.c) double constants will be used and it may lead to precision
issues in some algorithms.
It fixes the following failures on PPC64/POWER7:
Failure: Test: Real part of: cacos_downward (inf + 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf - 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf + 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_downward (inf - 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf + 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf - 0 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf + 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
Failure: Test: Real part of: cacos_towardzero (inf - 0.5 i)
Result:
is: 1.19209289550781250000e-07 0x1.00000000000000000000p-23
should be: 0.00000000000000000000e+00 0x0.00000000000000000000p+0
The optimization is achieved by following techniques:
> data alignment [gain from aligned memory access on read/write]
> POWER7 gains performance with loop unrolling/unwinding
[gain by reduction of branch penalty].
> zero padding done by calling optimized memset
This patch changes de default symbol redirection for internal call of
memcpy, memset, memchr, and strlen to the IFUNC resolved ones. The
performance improvement is noticeable in algorithms that uses these
symbols extensible, like the regex functions.
This patch optimizes the FPSCR update on exception and rounding change
functions by just updating its value if new value if different from
current one. It also optimizes fedisableexcept and feenableexcept by
removing an unecessary FPSCR read.
This patch fixes some powerpc32 and powerpc64 builds with
--disable-multi-arch option along with different --with-cpu=powerN.
It cleanups the Implies directories by removing the multiarch
folder for non multiarch config and also fixing two assembly
implementations: powerpc64/power7/strncat.S that is calling the
wrong strlen; and power8/fpu/s_isnan.S that misses the hidden_def and
weak_alias directives.
This patch fixes the powerpc32 optimized nearbyint/nearbyintf bogus
results for FE_DOWNWARD rounding mode. This is due wrong instructions
sequence used in the rounding calculation (two subtractions instead of
adition and a subtraction).
Fixes BZ#16815.