Use __builtin_fma more in dbl-64 code.

sysdeps/ieee754/dbl-64/dla.h can use a macro DLA_FMS for more efficient double-width operations when fused multiply-subtract is supported. However, this macro is only defined for x86_64, conditional on architecture-specific __FMA4__. This patch makes the code use __builtin_fma conditional on __FP_FAST_FMA, as used elsewhere in glibc. Tested for x86_64, x86 and powerpc. On powerpc (where this is causing fused operations to be used where they weren't previously) I see an increase from 1ulp to 2ulp in the imaginary part of clog10: testing double (without inline functions) Failure: Test: Imaginary part of: clog10 (0x1.7a858p+0 - 0x6.d940dp-4 i) Result: is: -1.2237865208199886e-01 -0x1.f5435146bb61ap-4 should be: -1.2237865208199888e-01 -0x1.f5435146bb61cp-4 difference: 2.7755575615628914e-17 0x1.0000000000000p-55 ulp : 2.0000 max.ulp : 1.0000 Maximal error of real part of: clog10 is : 3 ulp accepted: 3 ulp Maximal error of imaginary part of: clog10 is : 2 ulp accepted: 1 ulp This is actually resulting from atan2 becoming *more* accurate (atan2 (-0x6.d940dp-4, 0x1.7a858p+0) should ideally be -0x1.208cd6e841554p-2 but was -0x1.208cd6e841555p-2 from a powerpc libm built before this change, and is -0x1.208cd6e841554p-2 from a powerpc libm built after this change). Since these functions are not expected to be correctly rounding by glibc's accuracy goals, neither result is a problem, but this does imply that some of this code, although designed to be correctly rounding, is not in fact correctly rounding (possibly because of GCC creating fused operations where the code does not expect it, something we've only disabled for specific functions where it was found to cause large errors). (Of course as previously discussed I think we should remove the slow cases where an error analysis shows this wouldn't increase the errors much above 0.5ulp; it's only functions such as cratan2 that are expected to be correctly rounding, not atan2.) * sysdeps/ieee754/dbl-64/dla.h [__FP_FAST_FMA] (DLA_FMS): Define macro to use __builtin_fma. * sysdeps/x86_64/fpu/dla.h: Remove file.
2024-11-08 14:20:07 +00:00 · 2016-09-30 15:49:51 +00:00 · 2016-09-30 15:49:51 +00:00 · f280fa6d17
commit f280fa6d17
parent 8278d50ce7
3 changed files with 8 additions and 8 deletions
--- a/4
+++ b/4
@ -1,5 +1,9 @@
 2016-09-30  Joseph Myers  <joseph@codesourcery.com>

+	* sysdeps/ieee754/dbl-64/dla.h [__FP_FAST_FMA] (DLA_FMS): Define
+	macro to use __builtin_fma.
+	* sysdeps/x86_64/fpu/dla.h: Remove file.
+
 	* sysdeps/ieee754/ldbl-128ibm/bits/iscanonical.h
 	[__NO_LONG_DOUBLE_MATH] (__iscanonicall): Do not declare.
 	[__NO_LONG_DOUBLE_MATH] (iscanonical): Define to evaluate to 1.
--- a/sysdeps/ieee754/dbl-64/dla.h
+++ b/sysdeps/ieee754/dbl-64/dla.h
@ -57,6 +57,10 @@
 	   z=(x)-(y);  zz=(fabs(x)>fabs(y)) ? (((x)-(z))-(y)) : ((x)-((y)+(z)));


+#ifdef __FP_FAST_FMA
+# define DLA_FMS(x, y, z) __builtin_fma (x, y, -(z))
+#endif
+
 /* Exact multiplication of two single-length floating point numbers,   */
 /* Veltkamp. The macro produces a double-length number (z,zz) that     */
 /* satisfies z+zz = x*y exactly. p,hx,tx,hy,ty are temporary           */
--- a/sysdeps/x86_64/fpu/dla.h
+++ b/sysdeps/x86_64/fpu/dla.h
@ -1,8 +0,0 @@
-#include <features.h>
-
-#ifdef __FMA4__
-# define DLA_FMS(x,y,z) \
-  __builtin_fma (x, y, -(z))
-#endif
-
-#include "sysdeps/ieee754/dbl-64/dla.h"