The include cleanup on dl-minimal.c removed too much for some
targets.
Also for Hurd, __sbrk is removed from localplt.data now that
tunables allocated memory through mmap.
Checked with a build for all affected architectures.
The rtld_malloc functions are moved to its own file so it can be
used on csu code. Also, the functiosn are renamed to __minimal_*
(since there are now used not only on loader code).
Using the __minimal_malloc on tunables_strdup() avoids potential
issues with sbrk() calls while processing the tunables (I see
sporadic elf/tst-dso-ordering9 on powerpc64le with different
tests failing due ASLR).
Also, using __minimal_malloc over plain mmap optimizes the memory
allocation on both static and dynamic case (since it will any unused
space in either the last page of data segments, avoiding mmap() call,
or from the previous mmap() call).
Checked on x86_64-linux-gnu, i686-linux-gnu, and powerpc64le-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
A recent GCC mainline change introduces errors of the form:
vfprintf-internal.c: In function 'group_number':
vfprintf-internal.c:2093:15: error: 'memmove' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
2093 | memmove (w, s, (front_ptr -s) * sizeof (CHAR_T));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a genuine bug in the glibc code: s > front_ptr is always true
at this point in the code, and the intent is clearly for the
subtraction to be the other way round. The other arguments to the
memmove call here also appear to be wrong; w and s point just *after*
the destination and source for copying the rest of the number, so the
size needs to be subtracted to get appropriate pointers for the
copying. Adjust the memmove call to conform to the apparent intent of
the code, so fixing the -Wstringop-overflow error.
Now, if the original code were ever executed, a buffer overrun would
result. However, I believe this code (introduced in commit
edc1686af0, "vfprintf: Reuse work_buffer
in group_number", so in glibc 2.26) is unreachable in prior glibc
releases (so there is no need for a bug in Bugzilla, no need to
consider any backports unless someone wants to build older glibc
releases with GCC 12 and no possibility of this buffer overrun
resulting in a security issue).
work_buffer is 1000 bytes / 250 wide characters. This case is only
reachable if an initial part of the number, plus a grouped copy of the
rest of the number, fail to fit in that space; that is, if the grouped
number fails to fit in the space. In the wide character case,
grouping is always one wide character, so even with a locale (of which
there aren't any in glibc) grouping every digit, a number would need
to occupy at least 125 wide characters to overflow, and a 64-bit
integer occupies at most 23 characters in octal including a leading 0.
In the narrow character case, the multibyte encoding of the grouping
separator would need to be at least 42 bytes to overflow, again
supposing grouping every digit, but MB_LEN_MAX is 16. So even if we
admit the case of artificially constructed locales not shipped with
glibc, given that such a locale would need to use one of the character
sets supported by glibc, this code cannot be reached at present. (And
POSIX only actually specifies the ' flag for grouping for decimal
output, though glibc acts on it for other bases as well.)
With binary output (if you consider use of grouping there to be
valid), you'd need a 15-byte multibyte character for overflow; I don't
know if any supported character set has such a character (if, again,
we admit constructed locales using grouping every digit and a grouping
separator chosen to have a multibyte encoding as long as possible, as
well as accepting use of grouping with binary), but given that we have
this code at all (clearly it's not *correct*, or in accordance with
the principle of avoiding arbitrary limits, to skip grouping on
running out of internal space like that), I don't think it should need
any further changes for binary printf support to go in.
On the other hand, support for large sizes of _BitInt in printf (see
the N2858 proposal) *would* require something to be done about such
arbitrary limits (presumably using dynamic allocation in printf again,
for sufficiently large _BitInt arguments only - currently only
floating-point uses dynamic allocation, and, as previously discussed,
that could actually be replaced by bounded allocation given smarter
code).
Tested with build-many-glibcs.py for aarch64-linux-gnu (GCC mainline).
Also tested natively for x86_64.
The collate-test.c triggers UB with an signed integer overflow,
which results in an error on some architectures (powerpc32).
Checked on x86_64, i686, and powerpc.
commit d585ba47fc
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Nov 1 00:49:48 2021 -0500
string: Make tests birdirectional test-memcpy.c
This commit updates the memcpy tests to test both dst > src and dst <
src. This is because there is logic in the code based on the
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
significantly increased the number of tests. On Intel Core i7-1165G7,
test-memcpy takes 120 seconds to run when machine is idle. Double
TIMEOUT to (8 * 60) for test-memcpy to avoid timeout when machine is
under heavy load.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The close_range () function implements the same API as the Linux and
FreeBSD syscalls. It operates atomically and reliably. The specified
upper bound is clamped to the actual size of the file descriptor table;
it is expected that the most common use case is with last = UINT_MAX.
Like in the Linux syscall, it is also possible to pass the
CLOSE_RANGE_CLOEXEC flag to mark the file descriptors in the range
cloexec instead of acually closing them.
Also, add a Hurd version of the closefrom () function. Since unlike on
Linux, close_range () cannot fail due to being unuspported by the
running kernel, a fallback implementation is never necessary.
Signed-off-by: Sergey Bugaev <bugaevc@gmail.com>
Message-Id: <20211106153524.82700-1-bugaevc@gmail.com>
No bug.
The optimizations are as follows:
1) Always align entry to 64 bytes. This makes behavior more
predictable and makes other frontend optimizations easier.
2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
significant benefits in the case that:
0 < (dst - src) < [256, 512]
3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
improvement and for FSRM [-10%, 25%].
In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This commit adds a new partial overlap benchmark. This is generally
the most interesting performance case for memmove and was missing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This commit adds more benchmarks for the common memcpy/memmove
benchmarks. The most signifcant cases are the half page offsets. The
current versions leaves dst and src near page aligned which leads to
false 4k aliasing on x86_64. This can add noise due to false
dependencies from one run to the next. As well, this seems like more
of an edge case that common case so it shouldn't be the only thing
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This commit updates the memcpy tests to test both dst > src and dst <
src. This is because there is logic in the code based on the
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
generate-md5 was removed by
commit d73f5331ce
Author: Roland McGrath <roland@gnu.org>
Date: Fri May 2 02:20:45 2003 +0000
2003-05-01 Roland McGrath <roland@redhat.com>
Remove its last trace. This fixes BZ #28554.
Starting from GCC 12, the .init_array and .fini_array sections are enabled
unconditionally by
commit 13a39886940331149173b25d6ebde0850668d8b9
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Tue Jun 8 16:09:24 2021 -0700
Always enable DT_INIT_ARRAY/DT_FINI_ARRAY on Linux
configure GCC with --enable-initfini-array to enable them when using GCC
release branches.
Fixes BZ #27945.
Separated debuginfo files have PT_DYNAMIC with p_filesz == 0. We
need to check for that before the _dl_map_segments call because
that could attempt to write to mappings that extend beyond the end
of the file, resulting in SIGBUS.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Bugfix 27256 has introduced another issue:
In conversion from ISO-2022-JP-3 encoding, it is possible
to force iconv to emit extra NUL character on internal state reset.
To do this, it is sufficient to feed iconv with escape sequence
which switches active character set.
The simplified check 'data->__statep->__count != ASCII_set'
introduced by the aforementioned bugfix picks that case and
behaves as if '\0' character has been queued thus emitting it.
To eliminate this issue, these steps are taken:
* Restore original condition
'(data->__statep->__count & ~7) != ASCII_set'.
It is necessary since bits 0-2 may contain
number of buffered input characters.
* Check that queued character is not NUL.
Similar step is taken for main conversion loop.
Bundled test case follows following logic:
* Try to convert ISO-2022-JP-3 escape sequence
switching active character set
* Reset internal state by providing NULL as input buffer
* Ensure that nothing has been converted.
Signed-off-by: Nikita Popov <npv1310@gmail.com>
There are a few places where only known numeric values are acceptable for
`asm` parameters, yet the constraint "i" is used. "i" can include
"symbolic constants whose values will be known only at assembly time or
later."
Use "n" instead of "i" where known numeric values are required.
Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
It allows build both glibc and tests with lld (Since lld does not
support R_RISCV_ALIGN linker relaxation).
Checked with a build for riscv32-linux-gnu-rv32imafdc-ilp32d and
riscv64-linux-gnu-rv64imafdc-lp64d.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Clang cannot assemble movzx in the AT&T dialect mode.
../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
movzx (%rsi), %ecx
^~~~
Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This refactor moves four functions out of a nested scope and converts
them into static always_inline functions. collseqwc, table_size,
symb_table, extra are now initialized to zero because they are passed as
function arguments.
On x86-64, .text is 16 byte larger likely due to the 4 stores.
This is nothing compared to the amount of work that regcomp has to do
looking up the collation weights, or other functions.
If the non-buildable `sysdeps/generic/dl-machine.h` doesn't count,
this patch removes the last `auto inline` usage from glibc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The patch removes the the ELF_DURING_STARTUP optimization and assume
both .rel.dyn and .rel.plt might not be subsequent. This allows some
code simplification since relocation will be handled independently
where it is done on bootstrap.
At least on x86_64_64, I can not measure any performance implications.
Running 10000 time the command
LD_DEBUG=statistics ./elf/ld.so ./libc.so
And filtering the "total startup time in dynamic loader" result,
the geometric mean is:
patched master
Ryzen 7 5900x 24140 24952
i7-4510U 45957 45982
(The results do show some variation, I did not make any statistical
analysis).
It also allows build arm with lld, since it inserts ".ARM.exidx"
between ".rel.dyn" and ".rel.plt" for the loader.
Checked on x86_64-linux-gnu and arm-linux-gnueabihf.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Based on random input arguments. About 85% tuples have exponents
of the two arguments close together (+-1 range).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Instead of inputs based on the algorithm implementation details.
About 85% tuples have exponents of the two arguments close
together (+-1 range).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The lld linker does not support TLSDESC for arm. The have-arm-tls-desc
is a leftover of 56583289b1 to support NaCL.
Reviewed-by: Fangrui Song <maskray@google.com>
The lld does not support R_ARM_GOTOFF32 to preemptible symbol (_dl_argv
has default visibility). Use the internal alias instead (one option
would to use HIDDEN_JUMPTARGET, bu the macro is not defined for
!__ASSEMBLER__ and I made this patch arm-specific to avoid require to
check extensivelly on other architecture it this might break something).
Checked on arm-linux-gnueabihf.
Reviewed-by: Fangrui Song <maskray@google.com>
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
In strcmp-evex.S, to compare 2 32-byte strings, replace
VMOVU (%rdi, %rdx), %YMM0
VMOVU (%rsi, %rdx), %YMM1
/* Each bit in K0 represents a mismatch in YMM0 and YMM1. */
VPCMP $4, %YMM0, %YMM1, %k0
VPCMP $0, %YMMZERO, %YMM0, %k1
VPCMP $0, %YMMZERO, %YMM1, %k2
/* Each bit in K1 represents a NULL in YMM0 or YMM1. */
kord %k1, %k2, %k1
/* Each bit in K1 represents a NULL or a mismatch. */
kord %k0, %k1, %k1
kmovd %k1, %ecx
testl %ecx, %ecx
jne L(last_vector)
with
VMOVU (%rdi, %rdx), %YMM0
VPTESTM %YMM0, %YMM0, %k2
/* Each bit cleared in K1 represents a mismatch or a null CHAR
in YMM0 and 32 bytes at (%rsi, %rdx). */
VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2}
kmovd %k1, %ecx
incl %ecx
jne L(last_vector)
It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.
Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
Add acosf function to bench-math and copy acosf-inputs to benchtests.
Motivation for this patch is to prepare for upcoming libmvec new
functions. Float and double version of libmvec functions stays
together.
acosf-inputs file generated from acos-inputs file using following
scaling formula:
f = d * (FLT_MAX/DBL_MAX)
Where d is input(double) and f is output(float). If scaled float value
is duplicate in new input file, nextafterf() function used to find next
float value, ensuring no duplicates.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Improve the random memcpy benchmark. Double the number of tests and increase
the size of the memory region to test between 32KB and 1024KB. This improves
accuracy on modern cores. Clean up formatting of the frequency array.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
My build-many-glibcs.py bot shows -Waggressive-loop-optimizations
errors building the glibc testsuite for 32-bit architectures with GCC
mainline, which seem to have appeared between GCC commits
4abc0c196b10251dc80d0743ba9e8ab3e56c61ed and
d8edfadfc7a9795b65177a50ce44fd348858e844:
In function 'dynarray_long_noscratch_resize',
inlined from 'test_long_overflow' at tst-dynarray.c:489:5,
inlined from 'do_test' at tst-dynarray.c:571:3:
../malloc/dynarray-skeleton.c:391:36: error: iteration 1073741823 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
391 | DYNARRAY_ELEMENT_INIT (&list->u.dynarray_header.array[i]);
tst-dynarray.c:39:37: note: in definition of macro 'DYNARRAY_ELEMENT_INIT'
39 | #define DYNARRAY_ELEMENT_INIT(e) (*(e) = 23)
| ^
In file included from tst-dynarray.c:42:
../malloc/dynarray-skeleton.c:389:37: note: within this loop
389 | for (size_t i = old_size; i < size; ++i)
| ~~^~~~~~
In function 'dynarray_long_resize',
inlined from 'test_long_overflow' at tst-dynarray.c:479:5,
inlined from 'do_test' at tst-dynarray.c:571:3:
../malloc/dynarray-skeleton.c:391:36: error: iteration 1073741823 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
391 | DYNARRAY_ELEMENT_INIT (&list->u.dynarray_header.array[i]);
tst-dynarray.c:27:37: note: in definition of macro 'DYNARRAY_ELEMENT_INIT'
27 | #define DYNARRAY_ELEMENT_INIT(e) (*(e) = 17)
| ^
In file included from tst-dynarray.c:28:
../malloc/dynarray-skeleton.c:389:37: note: within this loop
389 | for (size_t i = old_size; i < size; ++i)
| ~~^~~~~~
I don't know what GCC change made these errors appear, or why they
only appear for 32-bit architectures. However, the warnings appear to
be both true (that iteration would indeed involve undefined behavior
if executed) and useless in this particular case (that iteration is
never executed, because the allocation size overflows and so the
allocation fails - but the check for allocation size overflow is in a
separate source file and so can't be seen by the compiler when
compiling this test). So use the DIAG_* macros to disable
-Waggressive-loop-optimizations around the calls in question to
dynarray_long_resize and dynarray_long_noscratch_resize in this test.
Tested with build-many-glibcs.py (GCC mainline) for arm-linux-gnueabi,
where it restores a clean testsuite build.
Compiling mmap_internal fails to compile when we use -1 for MMAP2_PAGE_UNIT
on 32 bit architectures. The error is as follows:
../sysdeps/unix/sysv/linux/mmap_internal.h:30:8: error: unknown type
name 'uint64_t'
|
30 | static uint64_t page_unit;
|
| ^~~~~~~~
Fix by adding including stdint.h.
Since some linkers (for instance lld for i386) does not support it
for all architectures.
Checked on i686-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
GCC does not print the correct linker when -fuse-ld=lld is used with
the -print-prog-name=ld:
$ gcc -v 2>&1 | tail -n 1
gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2)
$ gcc
ld
This is different than for gold:
$ gcc -fuse-ld=gold -print-prog-name=ld
ld.gold
Using ld.lld as the static linker name prints the expected result.
This is only required when -fuse-ld=lld is used, if lld is used as
the 'ld' programs (through a symlink) LIBC_PROG_BINUTILS works
as expected.
Checked on x86_64-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
These tests takes the address of a protected symbol (foo_protected)
and lld does not support copy relocations on protected data symbols.
Checked on x86_64-linux-gnu.
Reviewed-by: Fangrui Song <maskray@google.com>
Hoist the NULL check for malloc_usable_size into its entry points in
malloc-debug and malloc and assume non-NULL in all callees. This fixes
BZ #28506
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
The following commit:
commit cf4fd28ea4
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Tue Oct 26 19:43:18 2021 -0500
Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S
was not defined outside of multiarch and the alias for __memcmpeq in
x86_64/memcmp.S was removed.
This commit fixes that issue by adding x86_64/memcmpeq.S.
make xcheck passes on x86_64 with and without --disable-multi-arch
There are several packages like sysvinit and buildroot that expect
-lutil to work. Rather than impacting them with having to change
the linker flags provide an empty libutil.a.
A non-local STV_DEFAULT defined symbol is by default preemptible in a
shared object. j/jal cannot target a preemptible symbol. On other
architectures, such a jump instruction either causes PLT [BZ #18822], or
if short-ranged, sometimes rejected by the linker (but not by GNU ld's
riscv port [ld PR/28509]).
Use HIDDEN_JUMPTARGET to target a non-preemptible symbol instead.
With this patch, ld.so and libc.so can be linked with LLD if source
files are compiled/assembled with -mno-relax/-Wa,-mno-relax.
Acked-by: Palmer Dabbelt <palmer@dabbelt.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
No bug. This commit adds new optimized __memcmpeq implementation for
evex.
The primary optimizations are:
1) skipping the logic to find the difference of the first mismatched
byte.
2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
No bug. This commit adds new optimized __memcmpeq implementation for
avx2.
The primary optimizations are:
1) skipping the logic to find the difference of the first mismatched
byte.
2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
No bug. This commit does not modify any of the memcmp
implementation. It just adds __memcmpeq ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.
No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.
No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
use the existing ones in memcmp. This will be useful for testing
implementations of __memcmpeq that do not just alias memcmp.
No bug. This commit just adds __memcmpeq as a build target so that
implementations for __memcmpeq that are not just aliases to memcmp can
be supported.
No bug.
This commit adds tests for the new function __memcmpeq. The new tests
use the existing tests in 'test-memcmp.c' but relax the result
requirement to only check for zero or non-zero returns.
All string tests include test-memcmpeq are passing.