Commit Graph

320 Commits

Author SHA1 Message Date
Siddhesh Poyarekar
b5bd5bfe88 glibc.malloc.check: Wean away from malloc hooks
The malloc-check debugging feature is tightly integrated into glibc
malloc, so thanks to an idea from Florian Weimer, much of the malloc
implementation has been moved into libc_malloc_debug.so to support
malloc-check.  Due to this, glibc malloc and malloc-check can no
longer work together; they use altogether different (but identical)
structures for heap management.  This should not make a difference
though since the malloc check hook is not disabled anywhere.
malloc_set_state does, but it does so early enough that it shouldn't
cause any problems.

The malloc check tunable is now in the debug DSO and has no effect
when the DSO is not preloaded.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
2021-07-22 18:38:08 +05:30
Wilco Dijkstra
6a34c928c2 AArch64: Add hp-timing.h
Add hp-timing.h using the cntvct_el0 counter. Return timing in nanoseconds
so it is fully compatible with generic hp-timing. Don't set HP_TIMING_INLINE
in the dynamic linker since it adds unnecessary overheads and some ancient
kernels may not handle emulating cntcvt correctly. Currently cntvct_el0 is
only used for timing in the benchtests.

Reviewed-by: Szabolcs Nagy  <szabolcs.nagy@arm.com>
2021-07-01 15:42:05 +01:00
Wilco Dijkstra
252cad02d4 AArch64: Improve strnlen performance
Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1
large strings are 1.8x faster than the current version, and bench-strnlen is
50% faster overall. This version is MTE compatible.

Reviewed-by: Szabolcs Nagy  <szabolcs.nagy@arm.com>
2021-07-01 15:32:36 +01:00
H.J. Lu
3213ed770c Update math: redirect roundeven function
Redirect target specific roundeven functions for aarch64, ldbl-128ibm
and riscv.
2021-06-27 07:56:57 -07:00
Wilco Dijkstra
6a86bc0992 AArch64: Add support for roundeven[f]
Add inline assembler for the roundeven functions.
Passes GLIBC regression.  Note GCC does not inline the builtin (PR100966),
so this cannot be used for now.
2021-06-08 13:33:09 +01:00
Naohiro Tamura
4f26956d5b aarch64: Added optimized memset for A64FX
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Naohiro Tamura
fa527f345c aarch64: Added optimized memcpy and memmove for A64FX
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Naohiro Tamura
bd4317fbd6 aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
This patch defines BTI_C and BTI_J macros conditionally for
performance.
If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
instruction for ARMv8.5 BTI (Branch Target Identification).
If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
NOP.
2021-05-26 12:01:06 +01:00
Naohiro Tamura
77d175e14e config: Added HAVE_AARCH64_SVE_ASM for aarch64
This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
2021-05-26 12:01:06 +01:00
Szabolcs Nagy
2208066603 elf: Remove lazy tlsdesc relocation related code
Remove generic tlsdesc code related to lazy tlsdesc processing since
lazy tlsdesc relocation is no longer supported.  This includes removing
GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at
load time when that lock is already held.

Added a documentation comment too.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-04-21 14:35:53 +01:00
Szabolcs Nagy
e06e6554c3 aarch64: update libm test ulps
Update after commit 43576de04a.
2021-04-08 08:24:30 +01:00
Szabolcs Nagy
69499bb6ee aarch64: free tlsdesc data on dlclose [BZ #27403]
DL_UNMAP_IS_SPECIAL and DL_UNMAP were not defined. The definitions are
now copied from arm, since the same is needed on aarch64. The cleanup
of tlsdesc data is handled by the custom _dl_unmap.

Fixes bug 27403.
2021-04-06 14:35:05 +01:00
Paul Zimmermann
9acda61d94 Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]
For j0f/j1f/y0f/y1f, the largest error for all binary32
inputs is reduced to at most 9 ulps for all rounding modes.

The new code is enabled only when there is a cancellation at the very end of
the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not
give any visible slowdown on average.  Two different algorithms are used:

* around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of
  degree 3 are used, computed using the Sollya tool (https://www.sollya.org/)

* for large inputs, an asymptotic formula from [1] is used

[1] Fast and Accurate Bessel Function Computation,
    John Harrison, Proceedings of Arith 19, 2009.

Inputs yielding the new largest errors are added to auto-libm-test-in,
and ulps are regenerated for various targets (thanks Adhemerval Zanella).

Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-04-02 06:15:48 +02:00
Szabolcs Nagy
1dc17ea8f8 aarch64: Optimize __libc_mtag_tag_zero_region
This is a target hook for memory tagging, the original was a naive
implementation. Uses the same algorithm as __libc_mtag_tag_region,
but with instructions that also zero the memory.  This was not
benchmarked on real cpu, but expected to be faster than the naive
implementation.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
23fd760add aarch64: Optimize __libc_mtag_tag_region
This is a target hook for memory tagging, the original was a naive
implementation. The optimized version relies on "dc gva" to tag 64
bytes at a time for large allocations and optimizes small cases without
adding too many branches. This was not benchmarked on real cpu, but
expected to be faster than the naive implementation.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
383bc24028 aarch64: inline __libc_mtag_new_tag
This is a common operation when heap tagging is enabled, so inline the
instructions instead of using an extern call.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
40dc773f92 aarch64: inline __libc_mtag_address_get_tag
This is a common operation when heap tagging is enabled, so inline the
instruction instead of using an extern call.

The .inst directive is used instead of the name of the instruction (or
acle intrinsics) because malloc.c is not compiled for armv8.5-a+memtag
architecture, runtime cpu support detection is used.

Prototypes are removed from the comments as they were not always
correct.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
c076a0bc69 malloc: Only support zeroing and not arbitrary memset with mtag
The memset api is suboptimal and does not provide much benefit. Memory
tagging only needs a zeroing memset (and only for memory that's sized
and aligned to multiples of the tag granule), so change the internal
api and the target hooks accordingly.  This is to simplify the
implementation of the target hook.

Reviewed-by: DJ Delorie <dj@redhat.com>
2021-03-26 11:03:06 +00:00
Wilco Dijkstra
db3f7bb558 math: Remove slow paths from asin and acos [BZ #15267]
This patch series removes all remaining slow paths and related code.
First asin/acos, tan, atan, atan2 implementations are updated, and the final
patch removes the unused mpa files, headers and probes. Passes buildmanyglibc.

Remove slow paths from asin/acos. Add ULP annotations based on previous slow
path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps.

Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
2021-03-11 14:26:36 +00:00
Szabolcs Nagy
9fb07fd4e1 aarch64: update ulps.
For new test cases in
commit 5a051454a9
2021-03-01 12:29:42 +00:00
Florian Weimer
035c012e32 Reduce the statically linked startup code [BZ #23323]
It turns out the startup code in csu/elf-init.c has a perfect pair of
ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A
New Method to Bypass 64-bit Linux ASLR").  These functions are not
needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY
are already processed by the dynamic linker.  However, the dynamic
linker skipped the main program for some reason.  For maximum
backwards compatibility, this is not changed, and instead, the main
map is consulted from __libc_start_main if the init function argument
is a NULL pointer.

For statically linked binaries, the old approach based on linker
symbols is still used because there is nothing else available.

A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because
new binaries running on an old libc would not run their ELF
constructors, leading to difficult-to-debug issues.
2021-02-25 12:13:02 +01:00
Szabolcs Nagy
04c6a8073d aarch64: Fix the list of tested IFUNC variants [BZ #26818]
Some IFUNC variants are not compatible with BTI and MTE so don't
set them as usable for testing and benchmarking on a BTI or MTE
enabled system.

As far as IFUNC selectors are concerned a system is BTI enabled if
the cpu supports it and glibc was built with BTI branch protection.

Most IFUNC variants are BTI compatible, but thunderx2 memcpy and
memmove use a jump table with indirect jump, without a BTI j.

Fixes bug 26818.
2021-01-25 16:15:54 +00:00
Szabolcs Nagy
c3c4a25e65 aarch64: Move and update the definition of MTE_ENABLED
The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use
that definition.

Move the definition to init-arch.h so all ifunc selectors can use it
and expose an "mte" shorthand for mte enabled runtime.

For now we allow user code to enable tag checks and use PROT_MTE
mappings without libc involvment, this is not guaranteed ABI, but
can be useful for testing and debugging with MTE.
2021-01-25 15:35:43 +00:00
Shuo Wang
28f2ce2772 aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
In commit 863d775c48, kunpeng920 is added to default memcpy version,
however, there is performance degradation when the copy size is some large bytes, eg: 100k.
This is the result, tested in glibc-2.28:
             before backport  after backport	 Performance improvement
memcpy_1k      0.005              0.005                 0.00%
memcpy_10k     0.032              0.029                 10.34%
memcpy_100k    0.356              0.429                 -17.02%
memcpy_1m      7.470              11.153                -33.02%

This is the demo
#include "stdio.h"
#include "string.h"
#include "stdlib.h"

char a[1024*1024] = {12};
char b[1024*1024] = {13};
int main(int argc, char *argv[])
{
    int i = atoi(argv[1]);
    int j;
    int size = atoi(argv[2]);

    for (j = 0; j < i; j++)
        memcpy(b, a, size*1024);
    return 0;
}

# gcc -g -O0 memcpy.c -o memcpy
# time taskset -c 10 ./memcpy 100000 1024

Co-authored-by: liqingqing <liqingqing3@huawei.com>
2021-01-21 16:44:15 +00:00
Szabolcs Nagy
374cef32ac configure: Check for static PIE support
Add SUPPORT_STATIC_PIE that targets can define if they support
static PIE. This requires PI_STATIC_AND_HIDDEN support and various
linker features as described in

  commit 9d7a3741c9
  Add --enable-static-pie configure option to build static PIE [BZ #19574]

Currently defined on x86_64, i386 and aarch64 where static PIE is
known to work.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-01-21 15:54:50 +00:00
Szabolcs Nagy
2f056e8a5d aarch64: define PI_STATIC_AND_HIDDEN
AArch64 always uses pc relative access to static and hidden object
symbols, but the config setting was previously missing.

This affects ld.so start up code.
2021-01-08 11:14:02 +00:00
Wilco Dijkstra
9e97f239ea Remove dbl-64/wordsize-64 (part 2)
Remove the wordsize-64 implementations by merging them into the main dbl-64
directory.  The second patch just moves all wordsize-64 files and removes a
few wordsize-64 uses in comments and Implies files.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-01-07 15:26:26 +00:00
Shuo Wang
f5082c7010 aarch64: push the set of rules before falling into slow path
It is supposed to save the rules for the instructions before falling into slow path.

Tested in glibc-2.28 before fixing:
Thread 2 "xxxxxxx" hit Breakpoint 1, _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149
149		stp	x1,  x2, [sp, #-32]!
Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.0-20190804.h24.aarch64
(gdb) ni
_dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150
150		stp	x3,  x4, [sp, #16]
(gdb)
_dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157
157		mrs	x4, tpidr_el0
(gdb)
158		ldr	PTR_REG (1), [x0,#TLSDESC_ARG]
(gdb)
159		ldr	PTR_REG (0), [x4,#TCBHEAD_DTV]
(gdb)
160		ldr	PTR_REG (3), [x1,#TLSDESC_GEN_COUNT]
(gdb)
161		ldr	PTR_REG (2), [x0,#DTV_COUNTER]
(gdb)
162		cmp	PTR_REG (3), PTR_REG (2)
(gdb)
163		b.hi	2f
(gdb)
165		ldp	PTR_REG (2), PTR_REG (3), [x1,#TLSDESC_MODID]
(gdb)
166		add	PTR_REG (0), PTR_REG (0), PTR_REG (2), lsl #(PTR_LOG_SIZE + 1)
(gdb)
167		ldr	PTR_REG (0), [x0] /* Load val member of DTV entry.  */
(gdb)
168		cmp	PTR_REG (0), #TLS_DTV_UNALLOCATED
(gdb)
169		b.eq	2f
(gdb) bt
#0  _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:169
#1  0x0000ffffbe4fbb44 in OurFunction (threadId=4294967295)
    at /home/test/test_function.c:30
#2  0x0000000000400c08 in initaaa () at thread.c:58
#3  0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71
#4  0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486
#5  0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) ni
_dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:184
184		stp	x29, x30, [sp,#-16*NSAVEXREGPAIRS]!
(gdb) bt
#0  _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:184
#1  0x0000ffffbe4fbb44 in OurFunction (threadId=4294967295)
    at /home/test/test_function.c:30
#2  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Co-authored-by: liqingqing <liqingqing3@huawei.com>
2021-01-05 09:25:19 +00:00
Shuo Wang
cd6274089f aarch64: fix stack missing after sp is updated
After sp is updated, the CFA offset should be set before next instruction.
Tested in glibc-2.28:
Thread 2 "xxxxxxx" hit Breakpoint 1, _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149
149		stp	x1,  x2, [sp, #-32]!
Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.0-20190804.h24.aarch64
(gdb) bt
#0  _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149
#1  0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184)
    at /home/test/test_function.c:30
#2  0x0000000000400c08 in initaaa () at thread.c:58
#3  0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71
#4  0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486
#5  0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) ni
_dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150
150		stp	x3,  x4, [sp, #16]
(gdb) bt
#0  _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150
#1  0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184)
    at /home/test/test_function.c:30
#2  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) ni
_dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157
157		mrs	x4, tpidr_el0
(gdb) bt
#0  _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157
#1  0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184)
    at /home/test/test_function.c:30
#2  0x0000000000400c08 in initaaa () at thread.c:58
#3  0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71
#4  0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486
#5  0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Signed-off-by: liqingqing <liqingqing3@huawei.com>
Signed-off-by: Shuo Wang <wangshuo47@huawei.com>
2021-01-04 15:37:06 +00:00
Paul Eggert
2b778ceb40 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
2021-01-02 12:17:34 -08:00
Szabolcs Nagy
45b1e17e91 aarch64: use PTR_ARG and SIZE_ARG instead of DELOUSE
DELOUSE was added to asm code to make them compatible with non-LP64
ABIs, but it is an unfortunate name and the code was not compatible
with ABIs where pointer and size_t are different. Glibc currently
only supports the LP64 ABI so these macros are not really needed or
tested, but for now the name is changed to be more meaningful instead
of removing them completely.

Some DELOUSE macros were dropped: clone, strlen and strnlen used it
unnecessarily.

The out of tree ILP32 patches are currently not maintained and will
likely need a rework to rebase them on top of the time64 changes.
2020-12-31 16:50:58 +00:00
Szabolcs Nagy
682cdd6e1a aarch64: update ulps.
For new test cases in
commit cad5ad81d2
2020-12-21 16:40:34 +00:00
Richard Earnshaw
d27f0e5d88 aarch64: Add aarch64-specific files for memory tagging support
This final patch provides the architecture-specific implementation of
the memory-tagging support hooks for aarch64.
2020-12-21 15:25:25 +00:00
Szabolcs Nagy
4033f21eb2 aarch64: remove the strlen_asimd symbol
This symbol is not in the implementation reserved namespace for static
linking and it was never used: it seems it was mistakenly added in the
orignal strlen_asimd commit 436e4d5b96
2020-12-15 14:42:45 +00:00
Guillaume Gardet
d4136903a2 aarch64: fix static PIE start code for BTI [BZ #27068]
A bti c was missing from rcrt1.o which made all -static-pie
binaries fail at program startup on BTI enabled systems.

Fixes bug 27068.
2020-12-15 13:48:45 +00:00
Szabolcs Nagy
cd543b5eb3 aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]
Re-mmap executable segments if possible instead of using mprotect
to add PROT_BTI. This allows using BTI protection with security
policies that prevent mprotect with PROT_EXEC.

If the fd of the ELF module is not available because it was kernel
mapped then mprotect is used and failures are ignored.  To protect
the main executable even when mprotect is filtered the linux kernel
 will have to be changed to add PROT_BTI to it.

The delayed failure reporting is mainly needed because currently
_dl_process_gnu_properties does not propagate failures such that
the required cleanups happen. Using the link_map_machine struct for
error propagation is not ideal, but this seemed to be the least
intrusive solution.

Fixes bug 26831.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2020-12-11 15:46:02 +00:00
Szabolcs Nagy
c00452d775 elf: Pass the fd to note processing
To handle GNU property notes on aarch64 some segments need to
be mmaped again, so the fd of the loaded ELF module is needed.

When the fd is not available (kernel loaded modules), then -1
is passed.

The fd is passed to both _dl_process_pt_gnu_property and
_dl_process_pt_note for consistency. Target specific note
processing functions are updated accordingly.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2020-12-11 15:45:37 +00:00
Szabolcs Nagy
8b8f616e6a aarch64: align address for BTI protection [BZ #26988]
Handle unaligned executable load segments (the bfd linker is not
expected to produce such binaries, but other linkers may).

Computing the mapping bounds follows _dl_map_object_from_fd more
closely now.

Fixes bug 26988.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2020-12-11 15:04:39 +00:00
Szabolcs Nagy
72739c79f6 aarch64: Fix missing BTI protection from dependencies [BZ #26926]
The _dl_open_check and _rtld_main_check hooks are not called on the
dependencies of a loaded module, so BTI protection was missed on
every module other than the main executable and directly dlopened
libraries.

The fix just iterates over dependencies to enable BTI.

Fixes bug 26926.
2020-12-11 14:52:13 +00:00
Florian Weimer
1daccf403b nptl: Move stack list variables into _rtld_global
Now __thread_gscope_wait (the function behind THREAD_GSCOPE_WAIT,
formerly __wait_lookup_done) can be implemented directly in ld.so,
eliminating the unprotected GL (dl_wait_lookup_done) function
pointer.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2020-11-16 19:33:30 +01:00
Florian Weimer
5edf3d9fd6 aarch64: Add unwind information to _start (bug 26853)
This adds CFI directives which communicate that the stack ends
with this function.

Fixes bug 26853.
2020-11-09 11:31:04 +01:00
Szabolcs Nagy
e156dabc76 aarch64: Add variant PCS lazy binding test [BZ #26798]
This test fails without bug 26798 fixed because some integer registers
likely get clobbered by lazy binding and variant PCS only allows x16
and x17 to be clobbered at call time.

The test requires binutils 2.32.1 or newer for handling variant PCS
symbols. SVE registers are not covered by this test, to avoid the
complexity of handling multiple compile- and runtime feature support
cases.
2020-11-02 09:39:24 +00:00
Szabolcs Nagy
558251bd87 aarch64: Fix DT_AARCH64_VARIANT_PCS handling [BZ #26798]
The variant PCS support was ineffective because in the common case
linkmap->l_mach.plt == 0 but then the symbol table flags were ignored
and normal lazy binding was used instead of resolving the relocs early.
(This was a misunderstanding about how GOT[1] is setup by the linker.)

In practice this mainly affects SVE calls when the vector length is
more than 128 bits, then the top bits of the argument registers get
clobbered during lazy binding.

Fixes bug 26798.
2020-11-02 09:39:24 +00:00
Wilco Dijkstra
e11ed9d2b4 AArch64: Use __memcpy_simd on Neoverse N2/V1
Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as
the memcpy/memmove ifunc.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2020-10-14 14:27:50 +01:00
Szabolcs Nagy
238032ead6 aarch64: enforce >=64K guard size [BZ #26691]
There are several compiler implementations that allow large stack
allocations to jump over the guard page at the end of the stack and
corrupt memory beyond that. See CVE-2017-1000364.

Compilers can emit code to probe the stack such that the guard page
cannot be skipped, but on aarch64 the probe interval is 64K by default
instead of the minimum supported page size (4K).

This patch enforces at least 64K guard on aarch64 unless the guard
is disabled by setting its size to 0.  For backward compatibility
reasons the increased guard is not reported, so it is only observable
by exhausting the address space or parsing /proc/self/maps on linux.

On other targets the patch has no effect. If the stack probe interval
is larger than a page size on a target then ARCH_MIN_GUARD_SIZE can
be defined to get large enough stack guard on libc allocated stacks.

The patch does not affect threads with user allocated stacks.

Fixes bug 26691.
2020-10-02 09:57:44 +01:00
Wilco Dijkstra
bd394d131c AArch64: Improve backwards memmove performance
On some microarchitectures performance of the backwards memmove improves if
the stores use STR with decreasing addresses.  So change the memmove loop
in memcpy_advsimd.S to use 2x STR rather than STP.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2020-08-28 17:51:40 +01:00
Szabolcs Nagy
12b2fd0ef9 aarch64: update ulps.
For new j0 test.
2020-08-13 13:02:35 +01:00
Szabolcs Nagy
2dc33b928b aarch64: Use future HWCAP2_MTE in ifunc resolver
Make glibc MTE-safe on systems where MTE is available. This allows
using heap tagging with an LD_PRELOADed malloc implementation that
enables MTE. We don't document this as guaranteed contract yet, so
glibc may not be MTE safe when HWCAP2_MTE is set (older glibcs
certainly aren't). This is mainly for testing and debugging.

The HWCAP flag is not exposed in public headers until Linux adds it
to its uapi. The HWCAP value reservation will be in Linux 5.9.
2020-07-27 12:54:22 +01:00
Szabolcs Nagy
7ebd114211 aarch64: Respect p_flags when protecting code with PROT_BTI
Use PROT_READ and PROT_WRITE according to the load segment p_flags
when adding PROT_BTI.

This is before processing relocations which may drop PROT_BTI in
case of textrels.  Executable stacks are not protected via PROT_BTI
either.  PROT_BTI is hardening in case memory corruption happened,
it's value is reduced if there is writable and executable memory
available so missing it on such memory is fine, but we should
respect the p_flags and should not drop PROT_WRITE.
2020-07-24 08:52:22 +01:00
Wilco Dijkstra
f46ef33ad1 AArch64: Improve strlen_asimd performance (bug 25824)
Optimize strlen using a mix of scalar and SIMD code.  On modern micro
architectures large strings are 2.6 times faster than existing
strlen_asimd and 35% faster than the new MTE version of strlen.

On a random strlen benchmark using small sizes the speedup is 7% vs
strlen_asimd and 40% vs the MTE strlen.  This fixes the main strlen
regressions on Cortex-A53 and other cores with a simple Neon unit.

Rename __strlen_generic to __strlen_mte, and select strlen_asimd when
MTE is not enabled (this is waiting on support for a HWCAP_MTE bit).

This fixes big-endian bug 25824. Passes GLIBC regression tests.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2020-07-17 15:07:23 +01:00