Commit Graph

1456 Commits

Author SHA1 Message Date
Adhemerval Zanella
056fc1c0e3 elf: Issue la_symbind for bind-now (BZ #23734)
The audit symbind callback is not called for binaries built with
-Wl,-z,now or when LD_BIND_NOW=1 is used, nor the PLT tracking callbacks
(plt_enter and plt_exit) since this would change the expected
program semantics (where no PLT is expected) and would have performance
implications (such as for BZ#15533).

LAV_CURRENT is also bumped to indicate the audit ABI change (where
la_symbind flags are set by the loader to indicate no possible PLT
trace).

To handle powerpc64 ELFv1 function descriptor, _dl_audit_symbind
requires to know whether bind-now is used so the symbol value is
updated to function text segment instead of the OPD (for lazy binding
this is done by PPC64_LOAD_FUNCPTR on _dl_runtime_resolve).

Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
powerpc64-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 32612615c5)

Resolved conflicts:
	NEWS - Manual merge.
2022-04-12 13:32:59 -04:00
Adhemerval Zanella
a8e211daea elf: Add _dl_audit_pltexit
It consolidates the code required to call la_pltexit audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8c0664e2b8)

Resolved conflicts:
	sysdeps/hppa/dl-runtime.c
2022-04-08 14:18:12 -04:00
Adhemerval Zanella
b868b45f67 elf: Fix dynamic-link.h usage on rtld.c
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:

  1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
     architectures and it changes the semantics of various machine
     relocation functions.

  2. The elf_get_dynamic_info() change was done sideways, previously
     to 490e6c62aa get-dynamic-info.h was included by the first
     dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
     It means that the code within elf_get_dynamic_info() that uses
     RTLD_BOOTSTRAP is in fact unused.

To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined.  The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.

And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.

To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).

Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header.  Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.

The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).

Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit d6d89608ac)

Resolved conflicts:
	elf/rtld.c
2022-04-08 14:18:11 -04:00
Fangrui Song
b19de59d62 elf: Avoid nested functions in the loader [BZ #27220]
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).

The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)

Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
  elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
  elf_machine_lazy_rel can drop the `scope` parameter.

Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 490e6c62aa)
2022-04-08 14:18:11 -04:00
Fangrui Song
4b5b8a1cdf powerpc: Delete unneeded ELF_MACHINE_BEFORE_RTLD_RELOC
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
(cherry picked from commit 8e2557a2b8)
2022-04-08 14:18:11 -04:00
Paul A. Clarke
82b1acd9de powerpc: Fix unrecognized instruction errors with recent binutils
Recent versions of binutils (with commit
b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a) stopped preserving "sticky"
options across a base `.machine` directive, nullifying the use of
passing "-many" through GCC to the assembler.  As a result, some
instructions which were recognized even under older, more stringent
`.machine` directives become unrecognized instructions in that
context.

In `sysdeps/powerpc/tst-set_ppr.c`, the use of the `mfppr32` extended
mnemonic became unrecognized, as the default compilation with GCC for
32bit powerpc adds a `.machine ppc` in the resulting assembly, so the
command line option `-Wa,-many` is essentially ignored, and the ISA 2.06
instructions and mnemonics, like `mfppr32`, are unrecognized.

The compilation of `sysdeps/powerpc/tst-set_ppr.c` fails with:
Error: unrecognized opcode: `mfppr32'

Add appropriate `.machine` directives in the assembly to bracket the
`mfppr32` instruction.

Part of a 2019 fix (commit 9250e6610f) to
the above test's Makefile to add `-many` to the compilation when GCC
itself stopped passing `-many` to the assember no longer has any effect,
so remove that.

Reported-by: Joseph Myers <joseph@codesourcery.com>
(cherry picked from commit ee874f44fd)
2022-01-18 23:05:42 +01:00
Paul A. Clarke
e64235ff42 powerpc: Fix unrecognized instruction errors with recent GCC
Recent binutils commit b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a
changes the behavior of `.machine` directives to override, rather
than augment, the base CPU. This can result in _reduced_ functionality
when, for example, compiling for default machine "power8", but explicitly
asking for ".machine power5", which loses Altivec instructions.

In tst-ucontext-ppc64-vscr.c, while the instructions provoking the new
error messages are bracketed by ".machine power5", which is ostensibly
Power ISA 2.03 (POWER5), the POWER5 processor did not support the
VSX subset, so these instructions are not recognized as "power5".

Error: unrecognized opcode: `vspltisb'
Error: unrecognized opcode: `vpkuwus'
Error: unrecognized opcode: `mfvscr'
Error: unrecognized opcode: `stvx'

Manually adding the VSX subset via ".machine altivec" is sufficient.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
(cherry picked from commit 064b475a2e)
2022-01-10 22:48:53 +01:00
Matheus Castanho
387bff63dc powerpc64[le]: Fix CFI and LR save address for asm syscalls [BZ #28532]
Syscalls based on the assembly templates are missing CFI for r31, which gets
clobbered when scv is used, and info for LR is inaccurate, placed in the wrong
LOC and not using the proper offset. LR was also being saved to the callee's
frame, while the ABI mandates it to be saved to the caller's frame. These are
fixed by this commit.

After this change:

$ readelf -wF libc.so.6 | grep 0004b9d4.. -A 7 && objdump --disassemble=kill libc.so.6
00004a48 0000000000000020 00004a4c FDE cie=00000000 pc=000000000004b9d4..000000000004ba3c
   LOC           CFA      r31   ra
000000000004b9d4 r1+0     u     u
000000000004b9e4 r1+48    u     u
000000000004b9e8 r1+48    c-16  u
000000000004b9fc r1+48    c-16  c+16
000000000004ba08 r1+48    c-16
000000000004ba18 r1+48    u
000000000004ba1c r1+0     u

libc.so.6:     file format elf64-powerpcle

Disassembly of section .text:

000000000004b9d4 <kill>:
   4b9d4:       1f 00 4c 3c     addis   r2,r12,31
   4b9d8:       2c c3 42 38     addi    r2,r2,-15572
   4b9dc:       25 00 00 38     li      r0,37
   4b9e0:       d1 ff 21 f8     stdu    r1,-48(r1)
   4b9e4:       20 00 e1 fb     std     r31,32(r1)
   4b9e8:       98 8f ed eb     ld      r31,-28776(r13)
   4b9ec:       10 00 ff 77     andis.  r31,r31,16
   4b9f0:       1c 00 82 41     beq     4ba0c <kill+0x38>
   4b9f4:       a6 02 28 7d     mflr    r9
   4b9f8:       40 00 21 f9     std     r9,64(r1)
   4b9fc:       01 00 00 44     scv     0
   4ba00:       40 00 21 e9     ld      r9,64(r1)
   4ba04:       a6 03 28 7d     mtlr    r9
   4ba08:       08 00 00 48     b       4ba10 <kill+0x3c>
   4ba0c:       02 00 00 44     sc
   4ba10:       00 00 bf 2e     cmpdi   cr5,r31,0
   4ba14:       20 00 e1 eb     ld      r31,32(r1)
   4ba18:       30 00 21 38     addi    r1,r1,48
   4ba1c:       18 00 96 41     beq     cr5,4ba34 <kill+0x60>
   4ba20:       01 f0 20 39     li      r9,-4095
   4ba24:       40 48 23 7c     cmpld   r3,r9
   4ba28:       20 00 e0 4d     bltlr+
   4ba2c:       d0 00 63 7c     neg     r3,r3
   4ba30:       08 00 00 48     b       4ba38 <kill+0x64>
   4ba34:       20 00 e3 4c     bnslr+
   4ba38:       c8 32 fe 4b     b       2ed00 <__syscall_error>
        ...
   4ba44:       40 20 0c 00     .long 0xc2040
   4ba48:       68 00 00 00     .long 0x68
   4ba4c:       06 00 5f 5f     rlwnm   r31,r26,r0,0,3
   4ba50:       6b 69 6c 6c     xoris   r12,r3,26987

(cherry picked from commit d120fb9941)
2021-11-30 15:34:03 -03:00
Florian Weimer
7c241325d6 Force building with -fno-common
As a result, is not necessary to specify __attribute__ ((nocommon))
on individual definitions.

GCC 10 defaults to -fno-common on all architectures except ARC,
but this change is compatible with older GCC versions and ARC, too.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-07-09 20:09:14 +02:00
Anton Blanchard
01d7806282 powerpc64le: Fix typo in configure
The configure script checks for -mlong-double-128 but mentions -mlongdouble
when it fails.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-07-08 21:59:28 -03:00
Tulio Magno Quites Machado Filho
20f0491c67 powerpc64: Remove strcspn ifunc from the loader
5 years ago, commit 8f1b841e45
unintentionally added an ifunc to the loader.
That modification has not caused any harm so far, but it doesn't add any
value either, because the hwcap information is available later during
libc initialization.

Suggested-by: Anton Blanchard <anton@ozlabs.org>
2021-07-08 21:59:28 -03:00
Joseph Myers
f517610f3a Update powerpc-nofpu libm-test-ulps 2021-07-07 15:35:04 +00:00
Pedro Franco de Carvalho
813c6ec808 powerpc: optimize strcpy/stpcpy for POWER9/10
This patch modifies the current POWER9 implementation of strcpy and
stpcpy to optimize it for POWER9/10.

Since no new POWER10 instructions are used, the original POWER9 strcpy is
modified instead of creating a new implementation for POWER10.  This
implementation is based on both the original POWER9 implementation of
strcpy and the preamble of the new POWER10 implementation of strlen.

The changes also affect stpcpy, which uses the same implementation with
some additional code before returning.

On POWER9, averaging improvements across the benchmark
inputs (length/source alignment/destination alignment), for an
experiment that ran the benchmark five times, bench-strcpy showed an
improvement of 5.23%, and bench-stpcpy showed an improvement of 6.59%.

On POWER10, bench-strcpy showed 13.16%, and bench-stpcpy showed 13.59%.

The changes are:

1. Removed the null string optimization.

   Although this results in a few extra cycles for the null string, in
   combination with the second change, this resulted in improvements for
   for other cases.

2. Adapted the preamble from strlen for POWER10.

   This is the part of the function that handles up to the first 16 bytes
   of the string.

3. Increased number of unrolled iterations in the main loop to 6.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Tested-by: Matheus Castanho <msc@linux.ibm.com>
2021-07-01 17:58:53 -03:00
Matheus Castanho
ebae2f5a6f Add build option to disable usage of scv on powerpc
Commit 68ab82f566 added support for the scv
syscall ABI on powerpc.  Since then systems that have kernel and processor
support started using scv.  However adding the proper support for a new syscall
ABI requires changes to several other projects (e.g. qemu, valgrind, strace,
kernel), which are gradually receiving support.

Meanwhile, having a way to disable scv on glibc at build time can be useful for
distros that may encounter conflicts with projects that still do not support the
scv ABI, buying time until proper support is added.

This commit adds a --disable-scv option that disables scv support and uses sc
for all syscalls, like before commit 68ab82f566.

Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
2021-06-10 16:23:25 -03:00
Florian Weimer
699361795f Remove stale references to libdl.a
Since commit 0c1c3a771e
("dlfcn: Move dlopen into libc") libdl.a is empty, so linking
against it is no longer necessary.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-06-09 19:14:02 +02:00
Lucas A. M. Magalhaes
a55e2da270 powerpc: Optimized memcmp for power10
This patch was based on the __memcmp_power8 and the recent
__strlen_power10.

Improvements from __memcmp_power8:

1. Don't need alignment code.

   On POWER10 lxvp and lxvl do not generate alignment interrupts, so
they are safe for use on caching-inhibited memory.  Notice that the
comparison on the main loop will wait for both VSR to be ready.
Therefore aligning one of the input address does not improve
performance.  In order to align both registers a vperm is necessary
which add too much overhead.

2. Uses new POWER10 instructions

   This code uses lxvp to decrease contention on load by loading 32 bytes
per instruction.
   The vextractbm is used to have a smaller tail code for calculating the
return value.

3. Performance improvement

   This version has around 35% better performance on average. I saw no
performance regressions for any length or alignment.

Thanks Matheus for helping me out with some details.

Co-authored-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
2021-05-31 18:00:20 -03:00
Nicholas Piggin
7de36744ee powerpc: Fix handling of scv return error codes [BZ #27892]
When using scv for templated ASM syscalls, current code interprets any
negative return value as error, but the only valid error codes are in
the range -4095..-1 according to the ABI.

This commit also fixes 'signal.gen.test' strace test, where the issue
was first identified.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
2021-05-24 17:01:40 -03:00
H.J. Lu
79aec84102 Properly check stack alignment [BZ #27901]
1. Replace

if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0)

which may be optimized out by compiler, with

int
__attribute__ ((weak, noclone, noinline))
is_aligned (void *p, int align)
{
  return (((uintptr_t) p) & (align - 1)) != 0;
}

2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN.
3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment
for both i386 and x86-64.
4. Update powerpc to use TEST_STACK_ALIGN_INIT.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-05-24 07:42:12 -07:00
Florian Weimer
d337345ce1 powerpc64le: Check HWCAP bits against compiler build flags
When built with GCC 11.1 and -mcpu=power9, ld.so prints this error
message when running on POWER8:

Fatal glibc error: CPU lacks ISA 3.00 support (POWER9 or later required)
2021-05-19 11:09:57 +02:00
Matheus Castanho
1a594aa986 powerpc: Add optimized rawmemchr for POWER10
Reuse code for optimized strlen to implement a faster version of rawmemchr.
This takes advantage of the same benefits provided by the strlen implementation,
but needs some extra steps. __strlen_power10 code should be unchanged after this
change.

rawmemchr returns a pointer to the char found, while strlen returns only the
length, so we have to take that into account when preparing the return value.

To quickly check 64B, the loop on __strlen_power10 merges the whole block into
16B by using unsigned minimum vector operations (vminub) and checks if there are
any \0 on the resulting vector. The same code is used by rawmemchr if the char c
is 0. However, this approach does not work when c != 0.  We first need to
subtract each byte by c, so that the value we are looking for is converted to a
0, then taking the minimum and checking for nulls works again.

The new code branches after it has compared ~256 bytes and chooses which of the
two strategies above will be used in the main loop, based on the char c. This
extra branch adds some overhead (~5%) for length ~256, but is quickly amortized
by the faster loop for larger sizes.

Compared to __rawmemchr_power9, this version is ~20% faster for length < 256.
Because of the optimized main loop, the improvement becomes ~35% for c != 0
and ~50% for c = 0 for strings longer than 256.

Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
2021-05-17 10:30:35 -03:00
Raoni Fassina Firmino
17a73a6d8b powerpc64le: Fix ifunc selection for memset, memmove, bzero and bcopy
The hwcap2 check for the aforementioned functions should check for
both PPC_FEATURE2_ARCH_3_1 and PPC_FEATURE2_HAS_ISEL but was
mistakenly checking for any one of them, enabling isa 3.1 version of
the functions in incompatible processors, like POWER8.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-05-07 15:52:23 -03:00
Adhemerval Zanella
db373e4c57 Remove architecture specific sched_cpucount optimizations
And replace the generic algorithm with the Brian Kernighan's one.
GCC optimize it with popcnt if the architecture supports, so there
is no need to add the extra POPCNT define to enable it.

This is really a micro-optimization that only adds complexity:
recent ABIs already support it (x86-64-v2 or power64le) and it
simplifies the code for internal usage, since i686 does not allow an
internal iFUNC call.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
2021-05-07 13:35:29 -03:00
Raoni Fassina Firmino
23fdf8178c powerpc64le: Optimize memset for POWER10
This implementation is based on __memset_power8 and integrates a lot
of suggestions from Anton Blanchard.

The biggest difference is that it makes extensive use of stxvl to
alignment and tail code to avoid branches and small stores.  It has
three main execution paths:

a) "Short lengths" for lengths up to 64 bytes, avoiding as many
   branches as possible.

b) "General case" for larger lengths, it has an alignment section
   using stxvl to avoid branches, a 128 bytes loop and then a tail
   code, again using stxvl with few branches.

c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set
   value being zero.  It is mostly the __memset_power8 code but the
   alignment phase was simplified because, at this point, address is
   already 16-bytes aligned and also changed to use vector stores.
   The tail code was also simplified to reuse the general case tail.

All unaligned stores use stxvl instructions that do not generate
alignment interrupts on POWER10, making it safe to use on
caching-inhibited memory.

On average, this implementation provides something around 30%
improvement when compared to __memset_power8.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-04-30 18:12:08 -03:00
Tulio Magno Quites Machado Filho
e941e0ae80 powerpc64le: Optimize memcpy for POWER10
This implementation is based on __memcpy_power8_cached and integrates
suggestions from Anton Blanchard.
It benefits from loads and stores with length for short lengths and for
tail code, simplifying the code.

All unaligned memory accesses use instructions that do not generate
alignment interrupts on POWER10, making it safe to use on
caching-inhibited memory.

The main loop has also been modified in order to increase instruction
throughput by reducing the dependency on updates from previous iterations.

On average, this implementation provides around 30% improvement when
compared to __memcpy_power7 and 10% improvement in comparison to
__memcpy_power8_cached.
2021-04-30 18:12:08 -03:00
Lucas A. M. Magalhaes
dd59655e93 powerpc64le: Optimized memmove for POWER10
This patch was initially based on the __memmove_power7 with some ideas
from strncpy implementation for Power 9.

Improvements from __memmove_power7:

1. Use lxvl/stxvl for alignment code.

   The code for Power 7 uses branches when the input is not naturally
   aligned to the width of a vector. The new implementation uses
   lxvl/stxvl instead which reduces pressure on GPRs. It also allows
   the removal of branch instructions, implicitly removing branch stalls
   and mispredictions.

2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited
   memory.

   On Power 10 vector load and stores are safe to use on CI memory for
   addresses unaligned to 16B. This code takes advantage of this to
   do unaligned loads.

   The unaligned loads don't have a significant performance impact by
   themselves. However doing so decreases register pressure on GPRs
   and interdependence stalls on load/store pairs. This also improved
   readability as there are now less code paths for different alignments.
   Finally this reduces the overall code size.

3. Improved performance.

   This version runs on average about 30% better than memmove_power7
   for lengths  larger than 8KB. For input lengths shorter than 8KB
   the improvement is smaller, it has on average about 17% better
   performance.

   This version has a degradation of about 50% for input lengths
   in the 0 to 31 bytes range when dest is unaligned.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-04-30 18:12:08 -03:00
Raphael Moreira Zinsly
25cb72820a powerpc: Add log IFUNC multiarch support for POWER10
Checked on ppc64le built without --with-cpu, with --with-cpu=power9
and with --disable-multi-arch.

Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
2021-04-26 10:10:29 -03:00
Florian Weimer
4baf02b332 nptl: Move pthread_spin_trylock into libc
The symbol was moved using scripts/move-symbol-to-libc.py.
2021-04-23 17:06:48 +02:00
Florian Weimer
da8e3710d8 nptl: Move pthread_spin_lock into libc
The symbol was moved using scripts/move-symbol-to-libc.py.
2021-04-23 17:06:46 +02:00
Florian Weimer
ce4b3b7bef nptl: Move pthread_spin_init, Move pthread_spin_unlock into libc
For some architectures, the two functions are aliased, so these
symbols need to be moved at the same time.

The symbols were moved using scripts/move-symbol-to-libc.py.
2021-04-23 17:06:44 +02:00
Matheus Castanho
10624a97e8 powerpc: Add optimized strlen for POWER10
Improvements compared to POWER9 version:

1. Take into account first 16B comparison for aligned strings

   The previous version compares the first 16B and increments r4 by the number
   of bytes until the address is 16B-aligned, then starts doing aligned loads at
   that address. For aligned strings, this causes the first 16B to be compared
   twice, because the increment is 0. Here we calculate the next 16B-aligned
   address differently, which avoids that issue.

2. Use simple comparisons for the first ~192 bytes

   The main loop is good for big strings, but comparing 16B each time is better
   for smaller strings.  So after aligning the address to 16 Bytes, we check
   more 176B in 16B chunks.  There may be some overlaps with the main loop for
   unaligned strings, but we avoid using the more aggressive strategy too soon,
   and also allow the loop to start at a 64B-aligned address.  This greatly
   benefits smaller strings and avoids overlapping checks if the string is
   already aligned at a 64B boundary.

3. Reduce dependencies between load blocks caused by address calculation on loop

   Doing a precise time tracing on the code showed many loads in the loop were
   stalled waiting for updates to r4 from previous code blocks.  This
   implementation avoids that as much as possible by using 2 registers (r4 and
   r5) to hold addresses to be used by different parts of the code.

   Also, the previous code aligned the address to 16B, then to 64B by doing a
   few 48B loops (if needed) until the address was aligned. The main loop could
   not start until that 48B loop had finished and r4 was updated with the
   current address. Here we calculate the address used by the loop very early,
   so it can start sooner.

   The main loop now uses 2 pointers 128B apart to make pointer updates less
   frequent, and also unrolls 1 iteration to guarantee there is enough time
   between iterations to update the pointers, reducing stalled cycles.

4. Use new P10 instructions

   lxvp is used to load 32B with a single instruction, reducing contention in
   the load queue.

   vextractbm allows simplifying the tail code for the loop, replacing
   vbpermq and avoiding having to generate a permute control vector.

Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com>
Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
2021-04-22 16:18:06 -03:00
Florian Weimer
1d95b035c7 nptl: Move __pthread_unwind_next into libc
It's necessary to stub out __libc_disable_asynccancel and
__libc_enable_asynccancel via rtld-stubbed-symbols because the new
direct references to the unwinder result in symbol conflicts when the
rtld exception handling from libc is linked in during the construction
of librtld.map.

unwind-forcedunwind.c is merged into unwind-resume.c.  libc now needs
the functions that were previously only used in libpthread.

The GLIBC_PRIVATE exports of __libc_longjmp and __libc_siglongjmp are
no longer needed, so switch them to hidden symbols.

The symbol __pthread_unwind_next has been moved using
scripts/move-symbol-to-libc.py.

Reviewed-by: Adhemerva Zanella  <adhemerval.zanella@linaro.org>
2021-04-21 19:49:50 +02:00
Tulio Magno Quites Machado Filho
667d9c8d55 powerpc: Update libm test ulps
Update after commit 43576de04a.
2021-04-09 17:41:22 -03:00
Paul Zimmermann
9acda61d94 Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]
For j0f/j1f/y0f/y1f, the largest error for all binary32
inputs is reduced to at most 9 ulps for all rounding modes.

The new code is enabled only when there is a cancellation at the very end of
the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not
give any visible slowdown on average.  Two different algorithms are used:

* around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of
  degree 3 are used, computed using the Sollya tool (https://www.sollya.org/)

* for large inputs, an asymptotic formula from [1] is used

[1] Fast and Accurate Bessel Function Computation,
    John Harrison, Proceedings of Arith 19, 2009.

Inputs yielding the new largest errors are added to auto-libm-test-in,
and ulps are regenerated for various targets (thanks Adhemerval Zanella).

Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-04-02 06:15:48 +02:00
Andreas Schwab
5ccea9a011 powerpc64le: Use ifunc for _Float128 functions also in libc
This fixes missing definition of math functions in libc in a static link
that are no longer built for libm after commit 4898d9712b ("Avoid adding
duplicated symbols into static libraries").
2021-04-01 10:55:42 +02:00
Raphael Moreira Zinsly
a7d88506c2 powerpc: Add optimized llogb* for POWER9
The POWER9 builtins used to improve the ilogb* functions can be
used in the llogb* functions as well.
2021-03-16 12:19:09 -03:00
Raphael Moreira Zinsly
56c81132cc powerpc: Add optimized ilogb* for POWER9
The instructions xsxexpdp and xsxexpqp introduced on POWER9 extract
the exponent from a double-precision and quad-precision floating-point
respectively, thus they can be used to improve ilogb, ilogbf and ilogbf128.
2021-03-16 12:19:09 -03:00
Matheus Castanho
c82e691c56 powerpc: Update libm-test-ulps
Generated with 'make regen-ulps' on POWER8.

Tested on powerpc, powerpc64, and powerpc64le
2021-03-16 09:23:41 -03:00
Florian Weimer
82215c1e25 powerpc: Regenerate ulps
This time on a POWER8 machine.
2021-03-03 18:39:17 +01:00
Matheus Castanho
40d055a2dd powerpc: Update libm-test-ulps
Generated with 'make regen-ulps'

Tested on powerpc, powerpc64, and powerpc64le
2021-03-02 10:08:07 -03:00
Florian Weimer
9fc813e1a3 Implement <unwind-link.h> for dynamically loading the libgcc_s unwinder
This will be used to consolidate the libgcc_s access for backtrace
and pthread_cancel.

Unlike the existing backtrace implementations, it provides some
hardening based on pointer mangling.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-03-01 15:58:01 +01:00
Florian Weimer
035c012e32 Reduce the statically linked startup code [BZ #23323]
It turns out the startup code in csu/elf-init.c has a perfect pair of
ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A
New Method to Bypass 64-bit Linux ASLR").  These functions are not
needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY
are already processed by the dynamic linker.  However, the dynamic
linker skipped the main program for some reason.  For maximum
backwards compatibility, this is not changed, and instead, the main
map is consulted from __libc_start_main if the init function argument
is a NULL pointer.

For statically linked binaries, the old approach based on linker
symbols is still used because there is nothing else available.

A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because
new binaries running on an old libc would not run their ELF
constructors, leading to difficult-to-debug issues.
2021-02-25 12:13:02 +01:00
Raoni Fassina Firmino
5ee506ed35 powerpc64: Workaround sigtramp vdso return call
A not so recent kernel change[1] changed how the trampoline
`__kernel_sigtramp_rt64` is used to call signal handlers.

This was exposed on the test misc/tst-sigcontext-get_pc

Before kernel 5.9, the kernel set LR to the trampoline address and
jumped directly to the signal handler, and at the end the signal
handler, as any other function, would `blr` to the address set.  In
other words, the trampoline was executed just at the end of the signal
handler and the only thing it did was call sigreturn.  But since
kernel 5.9 the kernel set CTRL to the signal handler and calls to the
trampoline code, the trampoline then `bctrl` to the address in CTRL,
setting the LR to the next instruction in the middle of the
trampoline, when the signal handler returns, the rest of the
trampoline code executes the same code as before.

Here is the full trampoline code as of kernel 5.11.0-rc5 for
reference:

    V_FUNCTION_BEGIN(__kernel_sigtramp_rt64)
    .Lsigrt_start:
            bctrl   /* call the handler */
            addi    r1, r1, __SIGNAL_FRAMESIZE
            li      r0,__NR_rt_sigreturn
            sc
    .Lsigrt_end:
    V_FUNCTION_END(__kernel_sigtramp_rt64)

This new behavior breaks how `backtrace()` uses to detect the
trampoline frame to correctly reconstruct the stack frame when it is
called from inside a signal handling.

This workaround rely on the fact that the trampoline code is at very
least two (maybe 3?) instructions in size (as it is in the 32 bits
version, only on `li` and `sc`), so it is safe to check the return
address be in the range __kernel_sigtramp_rt64 .. + 4.

[1] subject: powerpc/64/signal: Balance return predictor stack in signal trampoline
    commit: 0138ba5783ae0dcc799ad401a1e8ac8333790df9
    url: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0138ba5783ae0dcc799ad401a1e8ac8333790df9

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-01-28 13:57:50 -03:00
Florian Weimer
527c89cd32 powerpc64: Select POWER9 machine for the scv instruction
It is not available with the baseline ISA.

Fixes commit 68ab82f566
("powerpc: Runtime selection between sc and scv for syscalls").

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2021-01-22 10:45:27 +01:00
Joseph Myers
a031b3abad Update powerpc-nofpu libm-test-ulps. 2021-01-18 20:21:07 +00:00
Paul Eggert
2b778ceb40 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
2021-01-02 12:17:34 -08:00
Matheus Castanho
68ab82f566 powerpc: Runtime selection between sc and scv for syscalls
Linux kernel v5.9 added support for system calls using the scv
instruction for POWER9 and later.  The new codepath provides better
performance (see below) if compared to using sc.  For the
foreseeable future, both sc and scv mechanisms will co-exist, so this
patch enables glibc to do a runtime check and use scv when it is
available.

Before issuing the system call to the kernel, we check hwcap2 in the TCB
for PPC_FEATURE2_SCV to see if scv is supported by the kernel.  If not,
we fallback to sc and keep the old behavior.

The kernel implements a different error return convention for scv, so
when returning from a system call we need to handle the return value
differently depending on the instruction we used to enter the kernel.

For syscalls implemented in ASM, entry and exit are implemented by
different macros (PSEUDO and PSEUDO_RET, resp.), which may be used in
sequence (e.g. for templated syscalls) or with other instructions in
between (e.g. clone).  To avoid accessing the TCB a second time on
PSEUDO_RET to check which instruction we used, the value read from
hwcap2 is cached on a non-volatile register.

This is not needed when using INTERNAL_SYSCALL macro, since entry and
exit are bundled into the same inline asm directive.

The dynamic loader may issue syscalls before the TCB has been setup
so it always uses sc with no extra checks.  For the static case, there
is no compile-time way to determine if we are inside startup code,
so we also check the value of the thread pointer before effectively
accessing the TCB.  For such situations in which the availability of
scv cannot be determined, sc is always used.

Support for scv in syscalls implemented in their own ASM file (clone and
vfork) will be added later. For now simply use sc as before.

Average performance over 1M calls for each syscall "type":
  - stat: C wrapper calling INTERNAL_SYSCALL
  - getpid: templated ASM syscall
  - syscall: call to gettid using syscall function

  Standard:
     stat : 1.573445 us / ~3619 cycles
   getpid : 0.164986 us / ~379 cycles
  syscall : 0.162743 us / ~374 cycles

  With scv:
     stat : 1.537049 us / ~3535 cycles <~ -84 cycles  / -2.32%
   getpid : 0.109923 us / ~253 cycles  <~ -126 cycles / -33.25%
  syscall : 0.116410 us / ~268 cycles  <~ -106 cycles / -28.34%

Tested on powerpc, powerpc64, powerpc64le (with and without scv)

Tested-by: Lucas A. M. Magalhães <lamm@linux.ibm.com>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2020-12-30 18:26:25 -03:00
Florian Weimer
2aa8ec7dd7 powerpc: Regenerate ulps
For new inputs added in commit cad5ad81d2,
as seen on a POWER8 system.
2020-12-22 19:22:44 +01:00
Florian Weimer
4c38c1a229 powerpc64le: Add glibc-hwcaps support
The "power10" and "power9" subdirectories are selected in a way
that matches the -mcpu=power10 and -mcpu=power9 options of GCC.
2020-12-04 14:50:49 +01:00
Paul E. Murphy
33fc34521d powerpc64le: ifunc select *f128 routines in multiarch mode
Programatically generate simple wrappers for interesting libm *f128
objects.  Selected functions are transcendental functions or
those with trivial compiler builtins.  This can result in a 2-3x
speedup (e.g logf128 and expf128).

A second set of implementation files are generated which include
the first implementation encountered along the search path.  This
usually works, except when a wrapper is overriden and makefile
search order slightly diverges from include order.  Likewise,
wrapper object files are created for each generated file.  These
hold the ifunc selection routines which export ABI.

Next, several shared headers are intercepted to control renaming of
asm function redirects are used first, and sometimes macro renames
if the former is impractical.

Notably, if the request machine supports hardware IEEE128 (i.e POWER9
and newer) this ifunc machinery is disabled.  Likewise existing
ifunc support for float128 is consolidated into this (e.g sqrtf128
and fmaf128).

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2020-11-30 09:56:14 -06:00
Matheus Castanho
1e0a7fd099 powerpc: Make PT_THREAD_POINTER available to assembly code
PT_THREAD_POINTER is currenty defined inside a #ifndef __ASSEMBLER__ block, but
its usage should not be limited to C code, as it can be useful when accessing
the TLS from assembly code as well.

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
2020-11-24 14:15:01 -03:00