The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.
The benchmark inputs are based on exp2f ones.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6042 48.7104 -19.96%
x86_64v2 40.7506 35.9032 11.90%
x86_64v3 35.2301 31.7956 9.75%
i686 102.094 94.6657 7.28%
aarch64 18.2704 15.1387 17.14%
power10 11.9444 8.2402 31.01%
reciprocal-throughput master patched improvement
x86_64 20.8683 16.1428 22.64%
x86_64v2 19.5076 10.4474 46.44%
x86_64v3 19.2106 10.4014 45.86%
i686 56.4054 59.3004 -5.13%
aarch64 12.0781 7.3953 38.77%
power10 6.5306 5.9388 9.06%
The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded). This explains the performance difference for x86_64.
Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'. When built wth same
flags, the new algorithm shows a better performance:
master patched improvement
latency 102.094 91.2823 10.59%
reciprocal-throughput 56.4054 52.7984 6.39%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 45.4690 49.5845 -9.05%
x86_64v2 46.1604 36.2665 21.43%
x86_64v3 37.8442 31.0359 17.99%
i686 121.367 93.0079 23.37%
aarch64 21.1126 15.0165 28.87%
power10 12.7426 8.4929 33.35%
reciprocal-throughput master patched improvement
x86_64 19.6005 17.4005 11.22%
x86_64v2 19.6008 11.1977 42.87%
x86_64v3 17.5427 10.2898 41.34%
i686 59.4215 60.9675 -2.60%
aarch64 13.9814 7.9173 43.37%
power10 6.7814 6.4258 5.24%
The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.
Tested on x86_64 and x86 (cfarm26).
With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:
"tgammaf": {
"": {
"duration": 3.50188e+09,
"iterations": 2e+07,
"max": 602.891,
"min": 65.1415,
"mean": 175.094
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.30825e+09,
"iterations": 5e+07,
"max": 211.592,
"min": 32.0325,
"mean": 66.1649
}
}
With the initial GNU libc code it gave on cfarm26 (i686):
"tgammaf": {
"": {
"duration": 3.70505e+09,
"iterations": 6e+06,
"max": 2420.23,
"min": 243.154,
"mean": 617.509
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.24497e+09,
"iterations": 1.8e+07,
"max": 1238.15,
"min": 101.155,
"mean": 180.276
}
}
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Changes in v2:
- include <math.h> (fix the linknamespace failures)
- restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
- restored original wrapper code (math/w_tgammaf_compat.c),
except for the dealing with the sign
- removed the tgammaf/float entries in all libm-test-ulps files
- address other comments from Joseph Myers
(https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)
Changes in v3:
- pass NULL argument for signgam from w_tgammaf_compat.c
- use of math_narrow_eval
- added more comments
Changes in v4:
- initialize local_signgam to 0 in math/w_tgamma_template.c
- replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file
Changes in v5:
- do not mention local_signgam any more in math/w_tgammaf_compat.c
- initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
and added comment
Changes in v6:
- pass NULL as 2nd argument of __ieee754_gammaf_r in
w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c
Changes in v7:
- added Signed-off-by line for Alexei Sibidanov (author of the code)
Changes in v8:
- added Signed-off-by line for Paul Zimmermann (submitted of the patch)
Changes in v9:
- address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Several copies of the licenses in files contained whitespace related
problems. Two cases are addressed here, the first is two spaces
after a period which appears between "PURPOSE." and "See". The other
is a space after the last forward slash in the URL. Both issues are
corrected and the licenses now match the official textual description
of the license (and the other license in the sources).
Since these whitespaces changes do not alter the paragraph structure of
the license, nor create new sentences, they do not change the license.
These functions are exp10m1, exp2m1, log10p1, log2p1.
Also regenerated ulps on x86_64.
For each format, there are 4 values, one for each rounding mode.
(For the intel96 format, there are 8 values, 4 for Intel hardware,
and 4 for AMD hardware. However, regen-ulps was only run on Intel.
It should be run in a separate patch on a AMD x86_64.)
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the exp2m1 and exp10m1 functions (exp2(x)-1 and
exp10(x)-1, like expm1).
As with other such functions, these use type-generic templates that
could be replaced with faster and more accurate type-specific
implementations in future. Test inputs are copied from those for
expm1, plus some additions close to the overflow threshold (copied
from exp2 and exp10) and also some near the underflow threshold.
exp2m1 has the unusual property of having an input (M_MAX_EXP) where
whether the function overflows (under IEEE semantics) depends on the
rounding mode. Although these could reasonably be XFAILed in the
testsuite (as we do in some cases for arguments very close to a
function's overflow threshold when an error of a few ulps in the
implementation can result in the implementation not agreeing with an
ideal one on whether overflow takes place - the testsuite isn't smart
enough to handle this automatically), since these functions aren't
required to be correctly rounding, I made the implementation check for
and handle this case specially.
The Makefile ordering expected by lint-makefiles for the new functions
is a bit peculiar, but I implemented it in this patch so that the test
passes; I don't know why log2 also needed moving in one Makefile
variable setting when it didn't in my previous patches, but the
failure showed a different place was expected for that function as
well.
The powerpc64le IFUNC setup seems not to be as self-contained as one
might hope; it shouldn't be necessary to add IFUNCs for new functions
such as these simply to get them building, but without setting up
IFUNCs for the new functions, there were undefined references to
__GI___expm1f128 (that IFUNC machinery results in no such function
being defined, but doesn't stop include/math.h from doing the
redirection resulting in the exp2m1f128 and exp10m1f128
implementations expecting to call it).
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log10p1 functions (log10(1+x): like log1p, but for
base-10 logarithms).
This is directly analogous to the log2p1 implementation (except that
whereas log2p1 has a smaller underflow range than log1p, log10p1 has a
larger underflow range). The test inputs are copied from those for
log1p and log2p1, plus a few more inputs in that wider underflow
range.
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p). As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.
Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.
The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either). It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately. For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log2p1 functions (log2(1+x): like log1p, but for
base-2 logarithms).
This illustrates the intended structure of implementations of all
these function families: define them initially with a type-generic
template implementation. If someone wishes to add type-specific
implementations, it is likely such implementations can be both faster
and more accurate than the type-generic one and can then override it
for types for which they are implemented (adding benchmarks would be
desirable in such cases to demonstrate that a new implementation is
indeed faster).
The test inputs are copied from those for log1p. Note that these
changes make gen-auto-libm-tests depend on MPFR 4.2 (or later).
The bulk of the changes are fairly generic for any such new function.
(sysdeps/powerpc/nofpu/Makefile only needs changing for those
type-generic templates that use fabs.)
Tested for x86_64 and x86, and with build-many-glibcs.py.
When -mapxf is used to build glibc, the resulting glibc will never run
on FMA4 machines. Exclude FMA4 IFUNC functions when -mapxf is used.
This requires GCC which defines __APX_F__ for -mapxf with commit:
1df56719bd8 x86: Define __APX_F__ for -mapxf
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
The implementations of trunc functions using x87 floating point (i386 and
x86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The implementations of floor functions using x87 floating point (i386 and
86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The implementations of ceil functions using x87 floating point (i386 and
x86_64 long double only) traps when FE_INEXACT is enabled. Although
this is a GNU extension outside the scope of the C standard, other
architectures that also support traps do not show this behavior.
The fix moves the implementation to a common one that holds any
exceptions with a 'fnclex' (libc_feholdexcept_setround_387).
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
When glibc is built with ISA level 3 or higher by default, the resulting
glibc binaries won't run on SSE or FMA4 processors. Exclude SSE, AVX and
FMA4 variants in libm multiarch when ISA level 3 or higher is enabled by
default.
When glibc is built with ISA level 2 enabled by default, only keep SSE4.1
variant.
Fixes BZ 31335.
NB: elf/tst-valgrind-smoke test fails with ISA level 4, because valgrind
doesn't support AVX512 instructions:
https://bugs.kde.org/show_bug.cgi?id=383010
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
According to ISO C23 (7.6.4.4), fesetexcept is supposed to set
floating-point exception flags without raising a trap (unlike
feraiseexcept, which is supposed to raise a trap if feenableexcept
was called with the appropriate argument).
The flags can be set in the 387 unit or in the SSE unit. When we need
to clear a flag, we need to do so in both units, due to the way
fetestexcept is implemented.
When we need to set a flag, it is sufficient to do it in the SSE unit,
because that is guaranteed to not trap. However, on i386 CPUs that have
only a 387 unit, set the flags in the 387, as long as this cannot trap.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
On Skylake, it changes log1p bench performance by:
Before After Improvement
max 63.349 58.347 8%
min 4.448 5.651 -30%
mean 12.0674 10.336 14%
The minimum code path is
if (hx < 0x3FDA827A) /* x < 0.41422 */
{
if (__glibc_unlikely (ax >= 0x3ff00000)) /* x <= -1.0 */
{
...
}
if (__glibc_unlikely (ax < 0x3e200000)) /* |x| < 2**-29 */
{
math_force_eval (two54 + x); /* raise inexact */
if (ax < 0x3c900000) /* |x| < 2**-54 */
{
...
}
else
return x - x * x * 0.5;
FMA and non-FMA code sequences look similar. Non-FMA version is slightly
faster. Since log1p is called by asinh and atanh, it improves asinh
performance by:
Before After Improvement
max 75.645 63.135 16%
min 10.074 10.071 0%
mean 15.9483 14.9089 6%
and improves atanh performance by:
Before After Improvement
max 91.768 75.081 18%
min 15.548 13.883 10%
mean 18.3713 16.8011 8%
On Skylake, it improves expm1 bench performance by:
Before After Improvement
max 70.204 68.054 3%
min 20.709 16.2 22%
mean 22.1221 16.7367 24%
NB: Add
extern long double __expm1l (long double);
extern long double __expm1f128 (long double);
for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.
Based on feedback by Mike Gilbert <floppym@gentoo.org>
Linux-6.1.38-dist x86_64 AMD Phenom-tm- II X6 1055T Processor
-march=amdfam10
failures occur for x32 ABI
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Applying this commit results in a bit-identical rebuild of
mathvec/libmvec.so.1 (which is the only binary that gets rebuilt).
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This patch enables libmvec on AArch64. The proposed change is mainly
implementing build infrastructure to add the new routines to ABI,
tests and benchmarks. I have demonstrated how this all fits together
by adding implementations for vector cos, in both single and double
precision, targeting both Advanced SIMD and SVE.
The implementations of the routines themselves are just loops over the
scalar routine from libm for now, as we are more concerned with
getting the plumbing right at this point. We plan to contribute vector
routines from the Arm Optimized Routines repo that are compliant with
requirements described in the libmvec wiki.
Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising
the minimum GCC by such a big jump, we allow users to disable libmvec
if their compiler is too old.
Note that at this point users have to manually call the vector math
functions. This seems to be acceptable to some downstream users.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
The divss instruction clobbers its first argument, and the constraints
need to reflect that. Fortunately, with GCC 12, generated code does
not actually change, so there is no externally visible bug.
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
This allows other targets to use the same inputs for their own libmvec
microbenchmarks without having to duplicate them in their own
subdirectory.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Update libm test ulps for
commit 3efbf11fdf
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date: Tue Feb 14 11:24:59 2023 +0100
update auto-libm-test-out-hypot
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
vfprintf is entangled with vfwprintf (of course), __printf_fp,
__printf_fphex, __vstrfmon_l_internal, and the strfrom family of
functions. The latter use the internal snprintf functionality,
so vsnprintf is converted as well.
The simples conversion is __printf_fphex, followed by
__vstrfmon_l_internal and __printf_fp, and finally
__vfprintf_internal and __vfwprintf_internal. __vsnprintf_internal
and strfrom* are mostly consuming the new interfaces, so they
are comparatively simple.
__printf_fp is a public symbol, so the FILE *-based interface
had to preserved.
The __printf_fp rewrite does not change the actual binary-to-decimal
conversion algorithm, and digits are still not emitted directly to
the target buffer. However, the staging buffer now uses bytes
instead of wide characters, and one buffer copy is eliminated.
The changes are at least performance-neutral in my testing.
Floating point printing and snprintf improved measurably, so that
this Lua script
for i=1,5000000 do
print(i, i * math.pi)
end
runs about 5% faster for me. To preserve fprintf performance for
a simple "%d" format, this commit has some logic changes under
LABEL (unsigned_number) to avoid additional function calls. There
are certainly some very easy performance improvements here: binary,
octal and hexadecimal formatting can easily avoid the temporary work
buffer (the number of digits can be computed ahead-of-time using one
of the __builtin_clz* built-ins). Decimal formatting can use a
specialized version of _itoa_word for base 10.
The existing (inconsistent) width handling between strfmon and printf
is preserved here. __print_fp_buffer_1 would have to use
__translated_number_width to achieve ISO conformance for printf.
Test expectations in libio/tst-vtables-common.c are adjusted because
the internal staging buffer merges all virtual function calls into
one.
In general, stack buffer usage is greatly reduced, particularly for
unbuffered input streams. __printf_fp can still use a large buffer
in binary128 mode for %g, though.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Most of these don't really matter as there was no dirty upper state
but we should generally avoid stray sse when its not needed.
The one case that really matters is in svml_d_tanh4_core_avx2.S:
blendvps %xmm0, %xmm8, %xmm7
When there was a dirty upper state.
Tested on x86_64-linux
Optimizations are:
1. Reduce code size (-112 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-4k+ rodata is shared with avx2).
Result is roughly a 15-16% speedup:
Function, New Time, Old Time, New / Old
_ZGVbN4v_tanhf, 3.158, 3.749, 0.842
Optimizations are:
1. Reduce code size (-81 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-32 bytes).
Result is roughly a 17-18% speedup:
Function, New Time, Old Time, New / Old
_ZGVdN8v_tanhf, 1.977, 2.402, 0.823
tanhf-avx2 and tanhf-sse4 use the same data tables so we can save
over 4kb using a shared datatable. This does increase the memory
footprint of the sse4 version (as now all the targets are 32 bytes
instead of 16), generally it seems worth the code size save.
NB: This patch doesn't do anything itself, it is setup for future
patches.
Optimizations are:
1. Reduce code size (-67 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata usage (-448 bytes).
Result is roughly a 14% speedup:
Function, New Time, Old Time, New / Old
_ZGVeN16v_tanhf, 0.649, 0.752, 0.863
Improvements are:
1. Reduce code size (-62 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata usage (-16 bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVbN4v_atanhf, 8.821, 8.903, 0.991
Improvements are:
1. Reduce code size (-60 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Shrink rodata usage (-32 bytes).
The throughput improvement is not that significant (3-5%) as the
port 0 bottleneck is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVdN8v_atanhf, 2.799, 2.923, 0.958
Improvements are:
1. Reduce code size (-64 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata size ([-128, -188] bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf, 1.39, 1.408, 0.987
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins). Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.
It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.
On ia64 it also fixes the sNAN handling:
math/test-float64x-fabs
math/test-ldouble-fabs
Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
Improve libmvec benchmark integration so that in future other
architectures may be able to run their libmvec benchmarks as well. This
now allows libmvec benchmarks to be run with `make BENCHSET=bench-math`.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The libmvec benchmarks print a message indicating that a certain CPU
feature is unsupported and exit prematurelyi, which breaks the JSON in
bench.out.
Handle this more elegantly in the bench makefile target by adding
support for an UNSUPPORTED exit status (77) so that bench.out continues
to have output for valid tests.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>