Commit Graph

348 Commits

Author SHA1 Message Date
Adhemerval Zanella
83b8d5027d malloc: Remove memusage.h
And use machine-sp.h instead.  The Linux implementation is based on
already provided CURRENT_STACK_FRAME (used on nptl code) and
STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.
2021-12-28 14:57:57 -03:00
Adhemerval Zanella
92ff345137 Remove atomic-machine.h atomic typedefs
Now that memusage.c uses generic types we can remove them.
2021-12-28 14:57:57 -03:00
Adhemerval Zanella
8c0664e2b8 elf: Add _dl_audit_pltexit
It consolidates the code required to call la_pltexit audit
callback.

Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2021-12-28 08:40:38 -03:00
Adhemerval Zanella
691d9ae9e6 Remove ununsed tcb-offset
Some architectures do not use the auto-generated tcb-offsets.h.
2021-12-17 17:47:29 -03:00
Adhemerval Zanella
ecb94e9587 aarch64: Add math-use-builtins-f{max,min}.h
It allows to remove the arch-specific implementations.
2021-12-13 10:08:46 -03:00
Florian Weimer
627f5ede70 Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGN
TLS_INIT_TCB_ALIGN is not actually used.  TLS_TCB_ALIGN was likely
introduced to support a configuration where the thread pointer
has not the same alignment as THREAD_SELF.  Only ia64 seems to use
that, but for the stack/pointer guard, not for storing tcbhead_t.
Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift
the thread pointer, potentially landing in a different residue class
modulo the alignment, but the changes should not impact that.

In general, given that TLS variables have their own alignment
requirements, having different alignment for the (unshifted) thread
pointer and struct pthread would potentially result in dynamic
offsets, leading to more complexity.

hppa had different values before: __alignof__ (tcbhead_t), which
seems to be 4, and __alignof__ (struct pthread), which was 8
(old default) and is now 32.  However, it defines THREAD_SELF as:

/* Return the thread descriptor for the current thread.  */
# define THREAD_SELF \
  ({ struct pthread *__self;			\
	__self = __get_cr27();			\
	__self - 1;				\
   })

So the thread pointer points after struct pthread (hence __self - 1),
and they have to have the same alignment on hppa as well.

Similarly, on ia64, the definitions were different.  We have:

# define TLS_PRE_TCB_SIZE \
  (sizeof (struct pthread)						\
   + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t)		\
      ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1)	\
	 & ~(__alignof__ (struct pthread) - 1))				\
      : 0))
# define THREAD_SELF \
  ((struct pthread *) ((char *) __thread_self - TLS_PRE_TCB_SIZE))

And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment
(confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c).

On m68k, we have a larger gap between tcbhead_t and struct pthread.
But as far as I can tell, the port is fine with that.  The definition
of TCB_OFFSET is sufficient to handle the shifted TCB scenario.

This fixes commit 23c77f6018
("nptl: Increase default TCB alignment to 32").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-12-09 23:47:49 +01:00
Florian Weimer
ce2248ab91 nptl: Introduce <tcb-access.h> for THREAD_* accessors
These are common between most architectures.  Only the x86 targets
are outliers.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-12-09 09:49:32 +01:00
Florian Weimer
23c77f6018 nptl: Increase default TCB alignment to 32
rseq support will use a 32-byte aligned field in struct pthread,
so the whole struct needs to have at least that alignment.

nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h>
to obtain the fallback definition.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-12-03 20:43:31 +01:00
Wilco Dijkstra
b31bd11454 AArch64: Improve A64FX memcpy
v2 is a complete rewrite of the A64FX memcpy. Performance is improved
by streamlining the code, aligning all large copies and using a single
unrolled loop for all sizes. The code size for memcpy and memmove goes
down from 1796 bytes to 868 bytes. Performance is better in all cases:
bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33%
faster for large sizes, bench-memcpy-walk is 25% faster for small sizes
and 20% for the largest sizes. The geomean of all tests in bench-memcpy
is 5.1% faster, and total time is reduced by 4%.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-12-02 18:36:03 +00:00
Wilco Dijkstra
b51eb35c57 AArch64: Optimize memcmp
Rewrite memcmp to improve performance. On small and medium inputs performance
is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per
iteration, which is 30-50% faster depending on the size.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-12-02 18:35:53 +00:00
Noah Goldstein
9894127d20 String: Add hidden defs for __memcmpeq() to enable internal usage
No bug.

This commit adds hidden defs for all declarations of __memcmpeq. This
enables usage of __memcmpeq without the PLT for usage internal to
GLIBC.
2021-10-26 16:51:29 -05:00
Noah Goldstein
44829b3ddb String: Add support for __memcmpeq() ABI on all targets
No bug.

This commit adds support for __memcmpeq() as a new ABI for all
targets. In this commit __memcmpeq() is implemented only as an alias
to the corresponding targets memcmp() implementation. __memcmpeq() is
added as a new symbol starting with GLIBC_2.35 and defined in string.h
with comments explaining its behavior. Basic tests that it is callable
and works where added in string/tester.c

As discussed in the proposal "Add new ABI '__memcmpeq()' to libc"
__memcmpeq() is essentially a reserved namespace for bcmp(). The means
is shares the same specifications as memcmp() except the return value
for non-equal byte sequences is any non-zero value. This is less
strict than memcmp()'s return value specification and can be better
optimized when a boolean return is all that is needed.

__memcmpeq() is meant to only be called by compilers if they can prove
that the return value of a memcmp() call is only used for its boolean
value.

All tests in string/tester.c passed. As well build succeeds on
x86_64-linux-gnu target.
2021-10-26 16:51:29 -05:00
Adhemerval Zanella
d6d89608ac elf: Fix dynamic-link.h usage on rtld.c
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:

  1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
     architectures and it changes the semantics of various machine
     relocation functions.

  2. The elf_get_dynamic_info() change was done sideways, previously
     to 490e6c62aa get-dynamic-info.h was included by the first
     dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
     It means that the code within elf_get_dynamic_info() that uses
     RTLD_BOOTSTRAP is in fact unused.

To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined.  The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.

And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.

To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).

Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header.  Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.

The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).

Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-10-14 14:52:07 -03:00
Fangrui Song
490e6c62aa elf: Avoid nested functions in the loader [BZ #27220]
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).

The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)

Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
  elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
  elf_machine_lazy_rel can drop the `scope` parameter.

Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-10-07 11:55:02 -07:00
Szabolcs Nagy
c75b106145 aarch64: update libm test ulps
Update after

 commit 6bbf729832.
 Fixed inaccuracy of j0f (BZ #28185)
2021-10-05 13:44:27 +01:00
Naohiro Tamura
381b29616a aarch64: Disable A64FX memcpy/memmove BTI unconditionally
This patch disables A64FX memcpy/memmove BTI instruction insertion
unconditionally such as A64FX memset patch [1] for performance.

[1] commit 07b427296b

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-24 13:26:59 +01:00
Sergey Bugaev
c484da9087 elf: Remove THREAD_GSCOPE_IN_TCB
All the ports now have THREAD_GSCOPE_IN_TCB set to 1. Remove all
support for !THREAD_GSCOPE_IN_TCB, along with the definition itself.

Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20210915171110.226187-4-bugaevc@gmail.com>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
2021-09-16 01:04:20 +02:00
Naohiro Tamura
1d9f99ce1b AArch64: Update A64FX memset not to degrade at 16KB
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.

Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-06 10:23:24 +01:00
Szabolcs Nagy
f873adf3df Revert "AArch64: Update A64FX memset not to degrade at 16KB"
Because of wrong commit author. Will recommit it with right author.

This reverts commit 23777232c2.
2021-09-06 10:23:25 +01:00
Siddhesh Poyarekar
30891f35fa Remove "Contributed by" lines
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date.  Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.

Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions.  These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.

The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively.  These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:

https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc
https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-09-03 22:06:44 +05:30
Naohiro Tamura via Libc-alpha
23777232c2 AArch64: Update A64FX memset not to degrade at 16KB
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.

Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-03 15:59:46 +01:00
Fangrui Song
710ba420fd Remove sysdeps/*/tls-macros.h
They provide TLS_GD/TLS_LD/TLS_IE/TLS_IE macros for TLS testing.  Now
that we have migrated to __thread and tls_model attributes, these macros
are unused and the tls-macros.h files can retire.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-08-18 09:15:20 -07:00
Fangrui Song
43d06ed218 aarch64: Make elf_machine_{load_address,dynamic} robust [BZ #28203]
The AArch64 ABI is largely platform agnostic and does not specify
_GLOBAL_OFFSET_TABLE_[0] ([1]). glibc ld.so turns out to be probably the
only user of _GLOBAL_OFFSET_TABLE_[0] and GNU ld defines the value
to the link-time address _DYNAMIC. [2]

In 2012, __ehdr_start was implemented in GNU ld and gold in binutils
2.23.  Using adrp+add / (-mcmodel=tiny) adr to access
__ehdr_start/_DYNAMIC gives us a robust way to get the load address and
the link-time address of _DYNAMIC.

[1]: From a psABI maintainer, https://bugs.llvm.org/show_bug.cgi?id=49672#c2
[2]: LLD's aarch64 port does not set _GLOBAL_OFFSET_TABLE_[0] to the
link-time address _DYNAMIC.
LLD is widely used on aarch64 Android and ChromeOS devices.  Software
just works without the need for _GLOBAL_OFFSET_TABLE_[0].

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-08-11 09:00:38 -07:00
Wilco Dijkstra
a5db6a5cae [5/5] AArch64: Improve A64FX memset medium loops
Simplify the code for memsets smaller than L1. Improve the unroll8 and
L1_prefetch loops.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:46:20 +01:00
Wilco Dijkstra
e69d9981f8 [4/5] AArch64: Improve A64FX memset by removing unroll32
Remove unroll32 code since it doesn't improve performance.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:44:27 +01:00
Wilco Dijkstra
186092c6ba [3/5] AArch64: Improve A64FX memset for remaining bytes
Simplify handling of remaining bytes. Avoid lots of taken branches and complex
whilelo computations, instead unconditionally write vectors from the end.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:42:07 +01:00
Wilco Dijkstra
9bc2ed8f46 [2/5] AArch64: Improve A64FX memset for large sizes
Improve performance of large memsets. Simplify alignment code. For zero memset
use DC ZVA, which almost doubles performance. For non-zero memsets use the
unroll8 loop which is about 10% faster.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:39:37 +01:00
Wilco Dijkstra
07b427296b [1/5] AArch64: Improve A64FX memset for small sizes
Improve performance of small memsets by reducing instruction counts and
improving code alignment. Bench-memset shows 35-45% performance gain for
small sizes.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:30:27 +01:00
Siddhesh Poyarekar
b5bd5bfe88 glibc.malloc.check: Wean away from malloc hooks
The malloc-check debugging feature is tightly integrated into glibc
malloc, so thanks to an idea from Florian Weimer, much of the malloc
implementation has been moved into libc_malloc_debug.so to support
malloc-check.  Due to this, glibc malloc and malloc-check can no
longer work together; they use altogether different (but identical)
structures for heap management.  This should not make a difference
though since the malloc check hook is not disabled anywhere.
malloc_set_state does, but it does so early enough that it shouldn't
cause any problems.

The malloc check tunable is now in the debug DSO and has no effect
when the DSO is not preloaded.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
2021-07-22 18:38:08 +05:30
Wilco Dijkstra
6a34c928c2 AArch64: Add hp-timing.h
Add hp-timing.h using the cntvct_el0 counter. Return timing in nanoseconds
so it is fully compatible with generic hp-timing. Don't set HP_TIMING_INLINE
in the dynamic linker since it adds unnecessary overheads and some ancient
kernels may not handle emulating cntcvt correctly. Currently cntvct_el0 is
only used for timing in the benchtests.

Reviewed-by: Szabolcs Nagy  <szabolcs.nagy@arm.com>
2021-07-01 15:42:05 +01:00
Wilco Dijkstra
252cad02d4 AArch64: Improve strnlen performance
Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1
large strings are 1.8x faster than the current version, and bench-strnlen is
50% faster overall. This version is MTE compatible.

Reviewed-by: Szabolcs Nagy  <szabolcs.nagy@arm.com>
2021-07-01 15:32:36 +01:00
H.J. Lu
3213ed770c Update math: redirect roundeven function
Redirect target specific roundeven functions for aarch64, ldbl-128ibm
and riscv.
2021-06-27 07:56:57 -07:00
Wilco Dijkstra
6a86bc0992 AArch64: Add support for roundeven[f]
Add inline assembler for the roundeven functions.
Passes GLIBC regression.  Note GCC does not inline the builtin (PR100966),
so this cannot be used for now.
2021-06-08 13:33:09 +01:00
Naohiro Tamura
4f26956d5b aarch64: Added optimized memset for A64FX
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Naohiro Tamura
fa527f345c aarch64: Added optimized memcpy and memmove for A64FX
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Naohiro Tamura
bd4317fbd6 aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
This patch defines BTI_C and BTI_J macros conditionally for
performance.
If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
instruction for ARMv8.5 BTI (Branch Target Identification).
If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
NOP.
2021-05-26 12:01:06 +01:00
Naohiro Tamura
77d175e14e config: Added HAVE_AARCH64_SVE_ASM for aarch64
This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
2021-05-26 12:01:06 +01:00
Szabolcs Nagy
2208066603 elf: Remove lazy tlsdesc relocation related code
Remove generic tlsdesc code related to lazy tlsdesc processing since
lazy tlsdesc relocation is no longer supported.  This includes removing
GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at
load time when that lock is already held.

Added a documentation comment too.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-04-21 14:35:53 +01:00
Szabolcs Nagy
e06e6554c3 aarch64: update libm test ulps
Update after commit 43576de04a.
2021-04-08 08:24:30 +01:00
Szabolcs Nagy
69499bb6ee aarch64: free tlsdesc data on dlclose [BZ #27403]
DL_UNMAP_IS_SPECIAL and DL_UNMAP were not defined. The definitions are
now copied from arm, since the same is needed on aarch64. The cleanup
of tlsdesc data is handled by the custom _dl_unmap.

Fixes bug 27403.
2021-04-06 14:35:05 +01:00
Paul Zimmermann
9acda61d94 Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]
For j0f/j1f/y0f/y1f, the largest error for all binary32
inputs is reduced to at most 9 ulps for all rounding modes.

The new code is enabled only when there is a cancellation at the very end of
the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not
give any visible slowdown on average.  Two different algorithms are used:

* around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of
  degree 3 are used, computed using the Sollya tool (https://www.sollya.org/)

* for large inputs, an asymptotic formula from [1] is used

[1] Fast and Accurate Bessel Function Computation,
    John Harrison, Proceedings of Arith 19, 2009.

Inputs yielding the new largest errors are added to auto-libm-test-in,
and ulps are regenerated for various targets (thanks Adhemerval Zanella).

Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2021-04-02 06:15:48 +02:00
Szabolcs Nagy
1dc17ea8f8 aarch64: Optimize __libc_mtag_tag_zero_region
This is a target hook for memory tagging, the original was a naive
implementation. Uses the same algorithm as __libc_mtag_tag_region,
but with instructions that also zero the memory.  This was not
benchmarked on real cpu, but expected to be faster than the naive
implementation.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
23fd760add aarch64: Optimize __libc_mtag_tag_region
This is a target hook for memory tagging, the original was a naive
implementation. The optimized version relies on "dc gva" to tag 64
bytes at a time for large allocations and optimizes small cases without
adding too many branches. This was not benchmarked on real cpu, but
expected to be faster than the naive implementation.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
383bc24028 aarch64: inline __libc_mtag_new_tag
This is a common operation when heap tagging is enabled, so inline the
instructions instead of using an extern call.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
40dc773f92 aarch64: inline __libc_mtag_address_get_tag
This is a common operation when heap tagging is enabled, so inline the
instruction instead of using an extern call.

The .inst directive is used instead of the name of the instruction (or
acle intrinsics) because malloc.c is not compiled for armv8.5-a+memtag
architecture, runtime cpu support detection is used.

Prototypes are removed from the comments as they were not always
correct.
2021-03-26 11:03:06 +00:00
Szabolcs Nagy
c076a0bc69 malloc: Only support zeroing and not arbitrary memset with mtag
The memset api is suboptimal and does not provide much benefit. Memory
tagging only needs a zeroing memset (and only for memory that's sized
and aligned to multiples of the tag granule), so change the internal
api and the target hooks accordingly.  This is to simplify the
implementation of the target hook.

Reviewed-by: DJ Delorie <dj@redhat.com>
2021-03-26 11:03:06 +00:00
Wilco Dijkstra
db3f7bb558 math: Remove slow paths from asin and acos [BZ #15267]
This patch series removes all remaining slow paths and related code.
First asin/acos, tan, atan, atan2 implementations are updated, and the final
patch removes the unused mpa files, headers and probes. Passes buildmanyglibc.

Remove slow paths from asin/acos. Add ULP annotations based on previous slow
path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps.

Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
2021-03-11 14:26:36 +00:00
Szabolcs Nagy
9fb07fd4e1 aarch64: update ulps.
For new test cases in
commit 5a051454a9
2021-03-01 12:29:42 +00:00
Florian Weimer
035c012e32 Reduce the statically linked startup code [BZ #23323]
It turns out the startup code in csu/elf-init.c has a perfect pair of
ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A
New Method to Bypass 64-bit Linux ASLR").  These functions are not
needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY
are already processed by the dynamic linker.  However, the dynamic
linker skipped the main program for some reason.  For maximum
backwards compatibility, this is not changed, and instead, the main
map is consulted from __libc_start_main if the init function argument
is a NULL pointer.

For statically linked binaries, the old approach based on linker
symbols is still used because there is nothing else available.

A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because
new binaries running on an old libc would not run their ELF
constructors, leading to difficult-to-debug issues.
2021-02-25 12:13:02 +01:00
Szabolcs Nagy
04c6a8073d aarch64: Fix the list of tested IFUNC variants [BZ #26818]
Some IFUNC variants are not compatible with BTI and MTE so don't
set them as usable for testing and benchmarking on a BTI or MTE
enabled system.

As far as IFUNC selectors are concerned a system is BTI enabled if
the cpu supports it and glibc was built with BTI branch protection.

Most IFUNC variants are BTI compatible, but thunderx2 memcpy and
memmove use a jump table with indirect jump, without a BTI j.

Fixes bug 26818.
2021-01-25 16:15:54 +00:00