This patch introduces two new convenience functions to set the default
thread attributes used for creating threads. This allows a programmer
to set the default thread attributes just once in a process and then
run pthread_create without additional attributes.
GCC 4.8 enables -ftree-loop-distribute-patterns at -O3 by default and
this optimization may transform loops into memset/memmove calls. Without
proper handling this may generate unexpected PLT calls on GLIBC.
This patch fixes by create memset/memmove alias to internal GLIBC
__GI_memset/__GI_memmove symbols.
The most common use case of math functions is with default rounding
mode, i.e. rounding to nearest. Setting and restoring rounding mode
is an unnecessary overhead for this, so I've added support for a
context, which does the set/restore only if the FP status needs a
change. The code is written such that only x86 uses these. Other
architectures should be unaffected by it, but would definitely benefit
if the set/restore has as much overhead relative to the rest of the
code, as the x86 bits do.
Here's a summary of the performance improvement due to these
improvements; I've only mentioned functions that use the set/restore
and have benchmark inputs for x86_64:
Before:
cos(): ITERS:4.69335e+08: TOTAL:28884.6Mcy, MAX:4080.28cy, MIN:57.562cy, 16248.6 calls/Mcy
exp(): ITERS:4.47604e+08: TOTAL:28796.2Mcy, MAX:207.721cy, MIN:62.385cy, 15543.9 calls/Mcy
pow(): ITERS:1.63485e+08: TOTAL:28879.9Mcy, MAX:362.255cy, MIN:172.469cy, 5660.86 calls/Mcy
sin(): ITERS:3.89578e+08: TOTAL:28900Mcy, MAX:704.859cy, MIN:47.583cy, 13480.2 calls/Mcy
tan(): ITERS:7.0971e+07: TOTAL:28902.2Mcy, MAX:1357.79cy, MIN:388.58cy, 2455.55 calls/Mcy
After:
cos(): ITERS:6.0014e+08: TOTAL:28875.9Mcy, MAX:364.283cy, MIN:45.716cy, 20783.4 calls/Mcy
exp(): ITERS:5.48578e+08: TOTAL:28764.9Mcy, MAX:191.617cy, MIN:51.011cy, 19071.1 calls/Mcy
pow(): ITERS:1.70013e+08: TOTAL:28873.6Mcy, MAX:689.522cy, MIN:163.989cy, 5888.18 calls/Mcy
sin(): ITERS:4.64079e+08: TOTAL:28891.5Mcy, MAX:6959.3cy, MIN:36.189cy, 16062.8 calls/Mcy
tan(): ITERS:7.2354e+07: TOTAL:28898.9Mcy, MAX:1295.57cy, MIN:380.698cy, 2503.7 calls/Mcy
So the improvements are:
cos: 27.9089%
exp: 22.6919%
pow: 4.01564%
sin: 19.1585%
tan: 1.96086%
The downside of the change is that it will have an adverse performance
impact on non-default rounding modes, but I think the tradeoff is
justified.
This is the initial support for string function performance tests,
along with copying tests for memcpy and memcpy-ifunc as proof of
concept. The string function benchmarks perform operations at
different alignments and for different sizes and compare performance
between plain operations and the optimized string operations. Due to
this their output is incompatible with the function benchmarks where
we're interested in fastest time, throughput, etc.
In future, the correctness checks in the benchmark tests can be
removed. Same goes for the performance measurements in the
string/test-*.
__clock_gettime and other __clock_* functions could result in an extra
PLT reference within libc.so if it actually gets used. None of the
code currently uses them, which is why this probably went unnoticed.
When setting BENCH_DURATION in CPPFLAGS-nonlib, append to the variable
instead of assigning to it, to avoid overwriting earlier set flags,
notably the -DNOT_IN_libc=1 flag.
In 128-bit IBM long double the precision of the type
decreases as you approach subnormal numbers, equaling
that of a double for subnormal numbers. Therefore
adjust the computation in ulp to use 2^(MIN_EXP - MANT_DIG)
which is correct for FP_SUBNORMAL for all types.
Resolves: #15465
The program name may be unavailable if the user application tampers
with argc and argv[]. Some parts of the dynamic linker caters for
this while others don't, so this patch consolidates the check and
fallback into a single macro and updates all users.
Added descriptive titles to the Belarusian,
English (American), and Chinese (simplified)
po/pot files.
---
2013-05-28 Carlos O'Donell <carlos@redhat.com>
* po/be.po: Add descriptive title.
* po/zh_CN.po: Likewise.
* po/header.pot: Likewise.
When mkstemp fails, the error message the user gets back is:
cannot create temporary file: No such file or directory
That isn't terribly useful in figuring out why, so include the full
filename we tried to create in the error output.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
It is the magnitude of the return value which lies
in [0.5, 1), not the return value itself.
---
2013-05-28 Ben North <ben@redfrontdoor.org>
* manual/arith.texi (frexp): It is the magnitude of the return
value which lies in [0.5, 1), not the return value itself.
The current value used for ulp near zero is wrong,
and this commit fixes it such that ulp(0) is the smallest
subnormal value nearest to zero, which makes the most
sense for testing values near zero. Note that this is not
what Java does; they use the nearest normal value, which
is less accurate than what we want for glibc. Note that
there is no correct implementation of ulp since there
is no strict mathmatical definition that is accepted by
all groups using IEEE 754.
Previously with the large ulp values near zero there
were tests that previously passed, but were in fact
billions of ulp away from the precise answer. With this
commit we now need to disable one of the cpow tests which
is revealed to be inaccurate (bug 14473).
---
2013-05-24 Carlos O'Donell <carlos@redhat.com>
* math/libm-test.inc (MAX_EXP): Define.
(ULPDIFF): Define.
(ulp): New function.
(check_float_internal): Use ULPDIFF.
(cpow_test): Disable failing test.
(check_ulp): Test ulp() implemetnation.
(main): Call check_ulp before starting tests.
Fixes 15381.
Using wide character function is on byte oriented memstream is undefined
behaviour. This behaviour was masked by not initializing wide struct
info. We now initialize it to cause a predictable crash.
In dl-hwcaps.c the comment read that rounding was done
to ElfW(Addr), but it's actually rounded to ElfW(Word).
In ldconfig.c we make each comment a sentence and
mention that the "tls" pseudo-hwcap is just for legacy
installations where TLS was optional.
---
2013-05-22 Carlos O'Donell <carlos@redhat.com>
* elf/ldconfig.c (is_hwcap_platform): Make comments full setences.
(main): Mention "tls" pseudo-hwcap is legacy.
* elf/dl-hwcaps.c (_dl_important_hwcaps): Correct rounding comment.
This patch fixes two issues, and perhaps should be two distinct commits,
but I present it here as one for the sake of completeness.
Commit 006dd86111 fails to check malloc's
return in intl/dcigettext.c (_nl_find_msg):
~~~
freemem_size = INITIAL_BLOCK_SIZE;
newmem = (transmem_block_t *) malloc (freemem_size);
...
newmem->next = transmem_list;
transmem_list = newmem;
~~~
If malloc fails then newmem is NULL then newmem->next results in a
fault.
The fix is easy enough, check for newmem != NULL, and fall through to
the error condition below which returns (char *) -1 e.g. resource error.
The problem is that returning (char *) -1 will break all sorts of other
code, so while what we did is correct, the real failure case fix is
slightly broader.
There are 4 other places where _nl_find_msg is called, one is OK, the
other three are fixed to handle -1 error return value.
No regressions on x86-64 or x86.
However, no regressions isn't really a useful metric for this code.
The change was tested as documented here:
http://sourceware.org/glibc/wiki/Testing/WhiteBox
using SystemTap for fault injection to simulate malloc failure.
---
2013-05-03 Carlos O'Donell <carlos at redhat.com>
[BZ #15441]
* intl/dcigettext.c (DCIGETTEXT): Skip translating if _nl_find_msg
returns -1.
(_nl_find_msg): Return -1 if recursive call returned -1. If newmem is
null return -1.
* intl/loadmsgcat.c (_nl_load_domain): If _nl_find_msg returns -1 abort
loading the domain.
This helps testing for regression of BZ#15339. Creation of network
isolated environments is a privileged operation and therefore is not
included to the test.
Fixes BZ #15339.
NSS_STATUS_UNAVAIL may mean that a necessary input resource is not
available. This could occur in a number of cases including when the
network is down, system runs out of file descriptors, etc. The
correct differentiator in such a case is the h_errno, which gives the
nature of failure. In case of failures other than a simple 'not
found', we set h_errno as NETDB_INTERNAL and let errno be the
identifier for the exact error.
This implementation speed up memset in several ways. First is avoiding
expensive computed jump. Second is using fact that arguments of memset
are most of time aligned to 8 bytes.
Benchmark results on:
kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2
We add new memcpy version that uses unaligned loads which are fast
on modern processors. This allows second improvement which is avoiding
computed jump which is relatively expensive operation.
Tests available here:
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
Resolves: #15424
The compiler would optimize the benchmark function call out of the
loop and call it only once, resulting in blazingly fast times for some
benchmarks (notably atan, sin and cos). Mark the inputs as volatile
so that the code is forced to read again from the input for each
iteration.
[BZ #15442] This adds support for the inverse interpretation of the
quiet bit of IEEE 754 floating-point NaN data that some processors
use. This includes in particular MIPS architecture processors; the
payload used for the canonical qNaN encoding is updated accordingly
so as not to interfere with the quiet bit.
The EXTRACT_WORDS64 and INSERT_WORDS64 macros use movd for a 64-bit
operation. Somehow gcc manages to turn this into movq, but LLVM won't.
2013-05-15 Peter Collingbourne <pcc@google.com>
* sysdeps/x86_64/fpu/math_private.h (MOVQ): New macro.
(EXTRACT_WORDS64) Use where appropriate.
(INSERT_WORDS64) Likewise.
While these instructions accept memory operands, only one operand
may be a memory operand. Giving two operands xm constraints gives
the compiler the option of using memory for both operands, which
would result in invalid assembly code. Using x for all operands is
more appropriate, as most x86_64 calling conventions will pass the
arguments in registers anyway.
2013-05-15 Peter Collingbourne <pcc@google.com>
* sysdeps/x86_64/fpu/multiarch/s_fma.c (__fma_fma4): Replace xm
constraints with x constraints.
* sysdeps/x86_64/fpu/multiarch/s_fmaf.c (__fmaf_fma4): Likewise.
it is impossible to create an alias of a common symbol (as
compat_symbol does), because common symbols do not have a section or
an offset until linked. GNU as tolerates aliases of common symbols by
simply creating another common symbol, but other assemblers (notably
LLVM's integrated assembler) are less tolerant.
2013-05-15 Peter Collingbourne <pcc@google.com>
* malloc/obstack.c (_obstack_compat): Add initializer.
-
Loading of the vDSO pseudo-hwcap from the type 2 GNU note is
a rather arcane and poorly documented process. Given that I had
a chance to review this code today I thought I would add all
of the things I had to lookup to verify the validity of the
process.
With a single .note.GNU the vDSO can register up to 64 flags,
though in practice you are limited to 64 - _DL_FIRST_EXTRA
bits which on x86 is 12 bits.
The only use of this that I know of is in the Xen support
in Linux where they use the 1st bit to indicate "nosegneg".
I see "We use bit 1 to avoid bugs in some versions of glibc
when bit 0 is used; the choice is otherwise arbitrary.", but
no reference to a glibc bug anywhere. The code as-is should
support bit zero, so we still have that free for future use.
The kernel, glibc, and ld.so.cache must coordinate to ensure
that bit values don't go too high and are used consistently.
---
2013-05-13 Carlos O'Donell <carlos@redhat.com>
* elf/dl-hwcaps.c (_dl_important_hwcaps): Comment vDSO hwcap loading.
* elf/ldconfig.c (is_hwcap_platform): Comment each hwcap check.
(main): Comment "tls" pseudo-hwcap.
HP_TIMING uses native timestamping instructions if available, thus
greatly reducing the overhead of recording start and end times for
function calls. For architectures that don't have HP_TIMING
available, we fall back to the clock_gettime bits. One may also
override this by invoking the benchmark as follows:
make USE_CLOCK_GETTIME=1 bench
and get the benchmark results using clock_gettime. One has to do
`make bench-clean` to ensure that the benchmark programs are rebuilt.
The algorithm for scanning dependencies upon dlclose is
less than immediately obvious. This patch adds two bits
of comments that explain why you start the dependency
search at l_initfini[1], and why you need to restart
the search.
---
2013-05-09 Carlos O'Donell <carlos@redhat.com>
* elf/dl-close.c (_dl_close_worker): Add comments.
Rewrite the first paragraph to talk about users not humans,
and to use correct English.
Clarify that it is the mapping of messages to IDs that
impacts the design of the message translation API.
---
2013-05-07 Carlos O'Donell <carlos@redhat.com>
* manual/message.texi (Message Translation): Talk about users.
Message to key mapping impacts design.
PowerPC kernel now provides a vDSO implementation for time syscall
(commit fcb41a2030abe0eb716ef0798035ef9562097f42). This patch changes
time syscall wrapper to use the vDSO when available. It also changes
the default non vDSO time on PowerPC to use sysdeps/posix/time.c
(since gettimeofday is a vDSO call).
* sysdeps/gnu/netinet/tcp.h (TCP_TIMESTAMP): New value, from
Linux 3.9.
* sysdeps/unix/sysv/linux/bits/socket.h (PF_VSOCK, AF_VSOCK):
Add.
(PF_MAX): Adjust for VSOCK change.
We add yesstr and nostr to three more locales.
We ignore the issue of capitalization of the first
character in yesstr and nostr. All locales will need
to be revisited to make this uniform policy change.
---
2013-05-02 Carlos O'Donell <carlos@redhat.com>
[BZ #15264]
* localedata/locales/en_CA (LC_MESSAGES): Define yesstr and nostr.
* localedata/locales/es_AR (LC_MESSAGES): Copy es_ES.
* localedata/locales/es_ES (LC_MESSAGES): Define yesstr and nostr.
Use the __gnu_inline__ attribute in _FORTIFY_SOURCE's __extern_always_inline
macro whenever the compiler supports it. Previously this macro only included
the __gnu_inline__ attribute in C++ mode for gcc >= 4.3. However,
__gnu_inline__ semantics are always desired for the __extern_always_inline
functions, and are available in g++ 4.2 (and some releases of g++ 4.1, and
also in Clang, which claims to be g++ 4.2).
This change stops g++-4.2 from emitting weak definitions for the fortify
wrapper functions if they can't be inlined, and also improves Clang
compatibility.
Some math functions have distinct performance characteristics in
specific domains of inputs, where some inputs return via a fast path
while other inputs require multiple precision calculations, that too
at different precision levels. The way to implement different domains
was to have a separate source file and benchmark definition, resulting
in separate programs.
This clutters up the benchmark, so this change allows these domains to
be consolidated into the same input file. To do this, the input file
format is now enhanced to allow comments with a preceding # and
directives with two # at the begining of a line. A directive that
looks like:
tells the benchmark generation script that what follows is a different
domain of inputs. The value of the 'name' directive (in this case,
foo) is used in the output. The two input domains are then executed
sequentially and their results collated separately. with the above
directive, there would be two lines in the result that look like:
func(): ....
func(foo): ...
The idea to run benchmarks for a constant number of iterations is
problematic. While the benchmarks may run for 10 seconds on x86_64,
they could run for about 30 seconds on powerpc and worse, over 3
minutes on arm. Besides that, adding a new benchmark is cumbersome
since one needs to find out the number of iterations needed for a
sufficient runtime.
A better idea would be to run each benchmark for a specific amount of
time. This patch does just that. The run time defaults to 10 seconds
and it is configurable at command line:
make BENCH_DURATION=5 bench