Joseph pointed out in the bug report (and in an earlier thread) that
systemtap probes cause build time warnings like the following:
../sysdeps/ieee754/dbl-64/e_atan2.c:602:4: warning: the address of
'p' will always evaluate as 'true' [-Waddress]
due to the fact that we're now passing non-weak variables to
LIBC_PROBE in the libm probes. This happens only on configurations
that do not enable systemtap. The macro definition of LIBC_PROBE in
this case only acts as a sanity checker to ensure that the number
parameters passed to LIBC_PROBE is equal to the argument count
parameter passed before it. This can be done in a much simpler manner
by just adding a macro definition for each number of arguments. I am
assuming here that we don't really want to bother with supporting
LIBC_PROBE with an indeterminate number of arguments and if there is a
need for a probe to have more data than what is currently supported (4
arguments), one could simply add an additional macro here.
Rather than open coding the masks, add helper macros to do the magic.
This makes code easier to read.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Check wheter the compiler has the option -fno-tree-loop-distribute-patterns
to inhibit loop transformation to library calls and uses it on memset
and memmove default implementation to avoid recursive calls.
The most common use case of math functions is with default rounding
mode, i.e. rounding to nearest. Setting and restoring rounding mode
is an unnecessary overhead for this, so I've added support for a
context, which does the set/restore only if the FP status needs a
change. The code is written such that only x86 uses these. Other
architectures should be unaffected by it, but would definitely benefit
if the set/restore has as much overhead relative to the rest of the
code, as the x86 bits do.
Here's a summary of the performance improvement due to these
improvements; I've only mentioned functions that use the set/restore
and have benchmark inputs for x86_64:
Before:
cos(): ITERS:4.69335e+08: TOTAL:28884.6Mcy, MAX:4080.28cy, MIN:57.562cy, 16248.6 calls/Mcy
exp(): ITERS:4.47604e+08: TOTAL:28796.2Mcy, MAX:207.721cy, MIN:62.385cy, 15543.9 calls/Mcy
pow(): ITERS:1.63485e+08: TOTAL:28879.9Mcy, MAX:362.255cy, MIN:172.469cy, 5660.86 calls/Mcy
sin(): ITERS:3.89578e+08: TOTAL:28900Mcy, MAX:704.859cy, MIN:47.583cy, 13480.2 calls/Mcy
tan(): ITERS:7.0971e+07: TOTAL:28902.2Mcy, MAX:1357.79cy, MIN:388.58cy, 2455.55 calls/Mcy
After:
cos(): ITERS:6.0014e+08: TOTAL:28875.9Mcy, MAX:364.283cy, MIN:45.716cy, 20783.4 calls/Mcy
exp(): ITERS:5.48578e+08: TOTAL:28764.9Mcy, MAX:191.617cy, MIN:51.011cy, 19071.1 calls/Mcy
pow(): ITERS:1.70013e+08: TOTAL:28873.6Mcy, MAX:689.522cy, MIN:163.989cy, 5888.18 calls/Mcy
sin(): ITERS:4.64079e+08: TOTAL:28891.5Mcy, MAX:6959.3cy, MIN:36.189cy, 16062.8 calls/Mcy
tan(): ITERS:7.2354e+07: TOTAL:28898.9Mcy, MAX:1295.57cy, MIN:380.698cy, 2503.7 calls/Mcy
So the improvements are:
cos: 27.9089%
exp: 22.6919%
pow: 4.01564%
sin: 19.1585%
tan: 1.96086%
The downside of the change is that it will have an adverse performance
impact on non-default rounding modes, but I think the tradeoff is
justified.
__clock_gettime and other __clock_* functions could result in an extra
PLT reference within libc.so if it actually gets used. None of the
code currently uses them, which is why this probably went unnoticed.
Since we want to use this in installed headers, move it to the installed
sys/cdefs.h. This requires a slight tweaking of the name (add trailing
underscores).
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
These prototypes are duplicated in many places. Add a dedicated
header for holding prototypes for program-specific functions to
avoid that.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
This feature is specifically for the C++ compiler to offload calling
thread_local object destructors on thread program exit, to glibc.
This is to overcome the possible complication of destructors of
thread_local objects getting called after the DSO in which they're
defined is unloaded by the dynamic linker. The DSO is marked as
'unloadable' if it has a constructed thread_local object and marked as
'unloadable' again when all the constructed thread_local objects
defined in it are destroyed.
These internal knobs are not exposed as part of the public ABI, so mark
them hidden to avoid generating relocations against them.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>