Initially based on the versions found in wcsmbs/* ; these files have
been changed by hand unrolling, and adding some additional variables
to allow some read-ahead to occur, which then relieves some of the
wait-for-increment/wait-for-load/wait-for-compare-results pressure
that was slowing down every iteration through the while-loop.
For 64-bit Power7, These changes give an approx 20% throughput boost
for the wcschr and wcsrchr functions; and approx 40% boost for the
wcscpy function. 32-bit improvements appear to be slightly better
with ~ %30 and ~ %45 respectively. Results for Power6 closely match
those for power7.
Assorted tweaking, twisting and tuning to squeeze a few additional cycles
out of the memchr code. Changes include bypassing the shift pairs
(sld,srd) when they are not required, and unrolling the small_loop that
handles short and trailing strings.
Per scrollpipe data measuring aligned strings for 64-bit, these changes
save between five and eight cycles (9-13% overall) for short strings (<32),
Longer aligned strings see slight improvement of 1-3% due to bypassing the
shifts and the instruction rearranging.
In the past the "-ftree-loop-linear" switch provided a measurable
improvement in performance for certain functions. At some point it
was assigned as the responsibility of Graphite in GCC. It has been
found that even with Graphite enabled these flags no longer perform
any appreciable improvement over the baseline.
Graphite now has some open bugs which need to be fixed in order for it
to provide measurable performance improvements but it lacks active
development. As a result some compiler distributors may disable
Graphite. If Graphite is disabled then building GLIBC will fail if
the "-ftree-loop-linear" switch is used.
This patch removes the use of "-ftree-loop-linear" as unnecessary.
This patch provides optimized logb (1.2x on PPC32 and 2.5x on PPC64),
logbf (1.1x on PPC32 and 2.2x on PPC64), and logbl (1.3x on PPC32 and
50% on PPC64) for the POWER7 processor.
* sysdeps/powerpc/powerpc32/crti.S: New file.
* sysdeps/powerpc/powerpc32/crtn.S: New file.
* sysdeps/powerpc/powerpc64/crti.S: New file.
* sysdeps/powerpc/powerpc64/crtn.S: New file.
This patch tries to organize the implies files for ppc, since there are
a number of processors and most of them are compatible with each other
(backwards compatible).
Having in mind that we start the search for processor-specific files in
the sysdeps/unix/sysv/linux tree
(sysdeps/unix/sysv/linux/powerpc/powerpc[32|64]/[processor]/fpu to be
exact), we would like to grab any linux-specific code from that tree
prior to going through the other tree (sysdeps/powerpc/...).
For that, i removed the Implies files that were originally inside the
fpu directories and placed then in the non-fpu directories (still inside
the unix/sysv/linux tree). If no processor-specific/linux-specific files
could be found, we "imply" the other tree's (sysdeps/powerpc/...) fpu
directory for that specific processor AND also the non-fpu directory for
that same tree.
If, again, no processor-specific code is found, we read another Implies
file that will point to the most compatible processor that we should
grab code from, and so on, until we reach the power4 processor.
So, in summary, the Implies files will live inside these directories
now:
* sysdeps/unix/sysv/linux/powerpc/powerpc[32|64]/[processor]
* sysdeps/powerpc/powerpc[32|64]/[processor]
Practical example of the order we will use to pick power6-specific code
with the new structure.
sysdeps/unix/sysv/linux/powerpc/powerpc[32|64]/power6/fpu ->
sysdeps/unix/sysv/linux/powerpc/powerpc[32|64]/power6 ->
sysdeps/powerpc/powerpc[32|64]/power6/fpu ->
sysdeps/powerpc/powerpc[32|64]/power6 ->
sysdeps/powerpc/powerpc[32|64]/power5+/fpu ->
sysdeps/powerpc/powerpc[32|64]/power5+ ->
sysdeps/powerpc/powerpc[32|64]/power5/fpu ->
sysdeps/powerpc/powerpc[32|64]/power5 ->
sysdeps/powerpc/powerpc[32|64]/power4/fpu ->
sysdeps/powerpc/powerpc[32|64]/power4 (from here, it'll go to the
generic path as usual)
2008-11-13 Ryan S. Arnold <rsa@us.ibm.com>
[BZ #6411]
* sysdeps/powerpc/fpu/Makefile: Added test case tst-setcontext-fpscr.
* sysdeps/powerpc/fpu/feholdexcpt.c (_FPU_MASK_ALL): Define to replace
magic numbers.
* sysdeps/powerpc/fpu/fenv_libc.h (fesetenv_register): Dynamically
choose mtfsf insn based on PPC_FEATURE_HAS_DFP.
(relax_fenv_state): Same as above.
(FPSCR_29): Reserve bit in ISA 2.05.
(FPSCR_NI): Provide define for compat.
* sysdeps/powerpc/fpu/fesetenv.c (_FPU_MASK_ALL): Define to replace
magic numbers.
* sysdeps/powerpc/fpu/feupdateenv.c (_FPU_MASK_ALL): Define to replace
magic numbers.
* sysdeps/powerpc/fpu/tst-setcontext-fpscr.c: New file. Test case to
test setcontext and swapcontext with dynamic 64-bit FPSCR detection.
* sysdeps/powerpc/powerpc32/fpu/__longjmp-common.S (__longjmp): Adjust
access to hwcap to account for hwcap size increase to uint64_t.
* sysdeps/powerpc/powerpc32/fpu/setjmp-common.S (__sigsetjmp ):
Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/getcontext-common.S
(*setcontext): Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/power6/fpu/setcontext.S:
New file.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/power6/fpu/swapcontext.S:
New file.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/setcontext-common.S
(*setcontext): dynamically select mtfsf insn based on
PPC_FEATURE_HAS_DFP. Adjust access to hwcap to account for hwcap size
increase to uint64_t.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/swapcontext-common.S
(*swapcontext): dynamically select mtfsf insn based on
PPC_FEATURE_HAS_DFP. Adjust access to hwcap to account for hwcap size
increase to uint64_t.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/power6/fpu/setcontext.S:
New file.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/power6/fpu/swapcontext.S:
New file.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/setcontext.S
(*setcontext): dynamically select mtfsf insn based on
PPC_FEATURE_HAS_DFP.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/swapcontext.S
(*swapcontext): dynamically select mtfsf insn based on
PPC_FEATURE_HAS_DFP.
* sysdeps/powerpc/powerpc32/fpu/s_lround.S (__lround): Fixed erroneous
result when x is +/-nextafter(+/-0.5,-/+1) i.e. all 1's in the
mantissa.
* sysdeps/powerpc/powerpc32/power4/fpu/s_llround.S (__llround):
Likewise. Also account for when x is an odd number between 2^52
and 2^53-1.
* sysdeps/powerpc/powerpc64/fpu/s_llround.S (__llround): Likewise.
* sysdeps/powerpc/powerpc64/fpu/s_llroundf.S (__llroundf): Likewise.
* math/libm-test.inc (lround_test, llround_test): Added test cases to
detect aforementioned erroneous conditions.
* sysdeps/powerpc/powerpc32/power4/hp-timing.h: New file.
* sysdeps/powerpc/powerpc64/hp-timing.h [_ARCH_PWR4] (HP_TIMING_NOW):
For ISA 2.01 and later replace mftb with mfspr 268.
* sysdeps/i386/i686/memcpy.S: Optimize copying of equally aligned
buffers.