__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.
Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).
Tested on powerpc, powerpc64 and powerpc64le.
Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
This patch modifies the current POWER9 implementation of strcpy and
stpcpy to optimize it for POWER9/10.
Since no new POWER10 instructions are used, the original POWER9 strcpy is
modified instead of creating a new implementation for POWER10. This
implementation is based on both the original POWER9 implementation of
strcpy and the preamble of the new POWER10 implementation of strlen.
The changes also affect stpcpy, which uses the same implementation with
some additional code before returning.
On POWER9, averaging improvements across the benchmark
inputs (length/source alignment/destination alignment), for an
experiment that ran the benchmark five times, bench-strcpy showed an
improvement of 5.23%, and bench-stpcpy showed an improvement of 6.59%.
On POWER10, bench-strcpy showed 13.16%, and bench-stpcpy showed 13.59%.
The changes are:
1. Removed the null string optimization.
Although this results in a few extra cycles for the null string, in
combination with the second change, this resulted in improvements for
for other cases.
2. Adapted the preamble from strlen for POWER10.
This is the part of the function that handles up to the first 16 bytes
of the string.
3. Increased number of unrolled iterations in the main loop to 6.
Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Tested-by: Matheus Castanho <msc@linux.ibm.com>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
Programatically generate simple wrappers for interesting libm *f128
objects. Selected functions are transcendental functions or
those with trivial compiler builtins. This can result in a 2-3x
speedup (e.g logf128 and expf128).
A second set of implementation files are generated which include
the first implementation encountered along the search path. This
usually works, except when a wrapper is overriden and makefile
search order slightly diverges from include order. Likewise,
wrapper object files are created for each generated file. These
hold the ifunc selection routines which export ABI.
Next, several shared headers are intercepted to control renaming of
asm function redirects are used first, and sometimes macro renames
if the former is impractical.
Notably, if the request machine supports hardware IEEE128 (i.e POWER9
and newer) this ifunc machinery is disabled. Likewise existing
ifunc support for float128 is consolidated into this (e.g sqrtf128
and fmaf128).
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Add stpncpy support into the POWER9 strncpy.
Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Similar to the strcpy P9 optimization, this version uses VSX to improve
performance.
Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
This started as a trivial change to Anton's rawmemchr. I got
carried away. This is a hybrid between P8's asympotically
faster 64B checks with extremely efficient small string checks
e.g <64B (and sometimes a little bit more depending on alignment).
The second trick is to align to 64B by running a 48B checking loop
16B at a time until we naturally align to 64B (i.e checking 48/96/144
bytes/iteration based on the alignment after the first 5 comparisons).
This allieviates the need to check page boundaries.
Finally, explicly use the P7 strlen with the runtime loader when building
P9. We need to be cautious about vector/vsx extensions here on P9 only
builds.
This defines the macro such that it should behave best on all
supported powerpc targets. Likewise, this allows us to remove the
ppc64le specific s_fmaf128.c.
I have verified powerpc64le multiarch and powerpc64le power9
no-multiarch builds continue to generate optimize fmaf128.
This version uses vector instructions and is up to 60% faster on medium
matches and up to 90% faster on long matches, compared to the POWER7
version. A few examples:
__rawmemchr_power9 __rawmemchr_power7
Length 32, alignment 0: 2.27566 3.77765
Length 64, alignment 2: 2.46231 3.51064
Length 1024, alignment 0: 17.3059 32.6678
Add stpcpy support to the POWER9 strcpy. This is up to 40% faster on
small strings and up to 90% faster on long relatively unaligned strings,
compared to the POWER8 version. A few examples:
__stpcpy_power9 __stpcpy_power8
Length 20, alignments in bytes 4/ 4: 2.58246 4.8788
Length 1024, alignments in bytes 1/ 6: 24.8186 47.8528
This version uses VSX store vector with length instructions and is
significantly faster on small strings and relatively unaligned large
strings, compared to the POWER8 version. A few examples:
__strcpy_power9 __strcpy_power8
Length 16, alignments in bytes 0/ 0: 2.52454 4.62695
Length 412, alignments in bytes 4/ 0: 11.6 22.9185
strcmp is used while resolving PLT references. Vector registers
should not be used during this. The P9 strcmp makes heavy use of
vector registers, so it should be avoided in rtld.
This prevents quiet vector register corruption when glibc is configured
with --disable-multi-arch and --with-cpu=power9. This can be seen with
test-float64x-compat_totalordermag during the first call into
totalordermagf64x@GLIBC_2.27.
Add a guard to fallback to the power8 implementation when building
power9 strcmp for libraries other than libc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Adds a POWER9 version of fmaf128 that uses the xsmaddqp
instruction.
Co-authored-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch adds a new macro, libm_alias_finite, to define all _finite
symbol. It sets all _finite symbol as compat symbol based on its first
version (obtained from the definition at built generated first-versions.h).
The <fn>f128_finite symbols were introduced in GLIBC 2.26 and so need
special treatment in code that is shared between long double and float128.
It is done by adding a list, similar to internal symbol redifinition,
on sysdeps/ieee754/float128/float128_private.h.
Alpha also needs some tricky changes to ensure we still emit 2 compat
symbols for sqrt(f).
Passes buildmanyglibc.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This patch moves little endian specific POWER9 optimization files to
sysdeps/powerpc/powerpc64/le and creates POWER9 ifunc functions
only for little endian.
The creation of the divergent sysdeps directory for powerpc64le
commit 2f7f3cd8cd
Author: Paul E. Murphy <murphyp@linux.vnet.ibm.com>
Date: Fri Jul 15 18:04:40 2016 -0500
powerpc64le: Create divergent sysdep directory for powerpc64le.
allowed float128 to be enabled for powerpc64le (little-endian) and not
for powerpc64 (big-endian). Since the only intended difference between
them was the presence or absence of the float128 interface, the sysdeps
directory for powerpc64le explicitly reused the files from powerpc64
(through the use of Implies files).
Although this works, it also means that files under the powerpc64
directory might be preferred over files under powerpc64le. For
instance, on a build for powerpc64le with target set to power9, a file
from powerpc64/power5 might get built, even though a file with the same
name exists in powerpc64le/power8. That happens because the processor
hierarchy was only defined in the sysdeps directory for powerpc64 (and
borrowed by powerpc64le).
This patch fixes this behavior, by creating new subdirectories under
powerpc64 (i.e.: powerpc64/be and powerpc64/le) and creating new Implies
files to provide the hierarchy of processors for powerpc64 and
powerpc64le separately. These changes have no effect on installed,
stripped binaries (which remain unchanged).
Tested that installed stripped binaries are unchanged and that there are
no regressions on powerpc64 and powerpc64le.