We have a single thread that runs a no-op initialization once and then
repeatedly runs checks of the initialization (i.e., an acquire load and
conditional jump) in a tight loop. This gives us, on average, the
best-case latency of pthread_once (the initialization is the
exactly-once slow path, and we're not looking at initialization-related
synchronization overheads in this case).
Without this flag it is possible that the compiler will optimize
away the calls to ffs/ffsll.
ChangeLog:
2014-04-01 Will Newton <will.newton@linaro.org>
* benchtests/Makefile (CFLAGS-bench-ffs.c): Add
-fno-builtin. (CFLAGS-bench-ffsll.c): Likewise.
Add benchtests for ffs and ffsll. There is no benchtest for ffsl as
it is identical to one of the other functions.
2014-03-31 Will Newton <will.newton@linaro.org>
* benchtests/Makefile (bench): Add ffs and ffsll to list
of tests.
* benchtests/ffs-inputs: New file.
* benchtests/ffsll-inputs: Likewise.
This patch adds an option to get detailed benchmark output for
functions. Invoking the benchmark with 'make DETAILED=1 bench' causes
each benchmark program to store a mean execution time for each input
it works on. This is useful to give a more comprehensive picture of
performance of functions compared to just the single mean figure.
This patch changes the output format of the main benchmark output file
(bench.out) to an extensible format. I chose JSON over XML because in
addition to being extensible, it is also not too verbose.
Additionally it has good support in python.
The significant change I have made in terms of functionality is to put
timing information as an attribute in JSON instead of a string and to
do that, there is a separate program that prints out a JSON snippet
mentioning the type of timing (hp_timing or clock_gettime). The mean
timing has now changed from iterations per unit to actual timing per
iteration.
In <https://sourceware.org/ml/libc-alpha/2014-01/msg00196.html> I
noted it was necessary to add includes of Makeconfig early in various
subdirectory makefiles for the tests-special variable settings added
by that patch to be conditional on configuration information. No-one
commented on the general question there of whether Makeconfig should
always be included immediately after the definition of subdir.
This patch implements that early inclusion of Makeconfig in each
directory (which is a lot easier than consistent placement of includes
of Rules). Includes are added if needed, or moved up if already
present. Subdirectory "all:" targets are removed, since Makeconfig
provides one.
There is potential for further cleanups I haven't done. Rules and
Makerules have code such as
ifneq "$(findstring env,$(origin headers))" ""
headers :=
endif
to override to empty any value of various variables that came from the
environment. I think there is a case for Makeconfig setting all the
subdirectory variables (other than subdir) to empty to ensure no
outside value is going to take effect if a subdirectory fails to
define a variable. (A list of such variables, possibly out of date
and incomplete, is in manual/maint.texi.) Rules and Makerules would
give errors if Makeconfig hadn't already been included, instead of
including it themselves. The special code to override values coming
from the environment would then be obsolete and could be removed.
Tested x86_64, including that installed binaries are identical before
and after the patch.
* argp/Makefile: Include Makeconfig immediately after defining
subdir.
* assert/Makefile: Likewise.
* benchtests/Makefile: Likewise.
* catgets/Makefile: Likewise.
* conform/Makefile: Likewise.
* crypt/Makefile: Likewise.
* csu/Makefile: Likewise.
(all): Remove target.
* ctype/Makefile: Include Makeconfig immediately after defining
subdir.
* debug/Makefile: Likewise.
* dirent/Makefile: Likewise.
* dlfcn/Makefile: Likewise.
* gmon/Makefile: Likewise.
* gnulib/Makefile: Likewise.
* grp/Makefile: Likewise.
* gshadow/Makefile: Likewise.
* hesiod/Makefile: Likewise.
* hurd/Makefile: Likewise.
(all): Remove target.
* iconvdata/Makefile: Include Makeconfig immediately after
defining subdir.
* inet/Makefile: Likewise.
* intl/Makefile: Likewise.
* io/Makefile: Likewise.
* libio/Makefile: Likewise.
(all): Remove target.
* locale/Makefile: Include Makeconfig immediately after defining
subdir.
* login/Makefile: Likewise.
* mach/Makefile: Likewise.
(all): Remove target.
* malloc/Makefile: Include Makeconfig immediately after defining
subdir.
(all): Remove target.
* manual/Makefile: Include Makeconfig immediately after defining
subdir.
* math/Makefile: Likewise.
* misc/Makefile: Likewise.
* nis/Makefile: Likewise.
* nss/Makefile: Likewise.
* po/Makefile: Likewise.
(all): Remove target.
* posix/Makefile: Include Makeconfig immediately after defining
subdir.
* pwd/Makefile: Likewise.
* resolv/Makefile: Likewise.
* resource/Makefile: Likewise.
* rt/Makefile: Likewise.
* setjmp/Makefile: Likewise.
* shadow/Makefile: Likewise.
* signal/Makefile: Likewise.
* socket/Makefile: Likewise.
* soft-fp/Makefile: Likewise.
* stdio-common/Makefile: Likewise.
* stdlib/Makefile: Likewise.
* streams/Makefile: Likewise.
* string/Makefile: Likewise.
* sunrpc/Makefile: Likewise.
(all): Remove target.
* sysvipc/Makefile: Include Makeconfig immediately after defining
subdir.
* termios/Makefile: Likewise.
* time/Makefile: Likewise.
* timezone/Makefile: Likewise.
(all): Remove target.
* wcsmbs/Makefile: Include Makeconfig immediately after defining
subdir.
* wctype/Makefile: Likewise.
libidn/ChangeLog:
* Makefile: Include Makeconfig immediately after defining subdir.
localedata/ChangeLog:
* Makefile: Include Makeconfig immediately after defining subdir.
(all): Remove target.
nptl/ChangeLog:
* Makefile: Include Makeconfig immediately after defining subdir.
nptl_db/ChangeLog:
* Makefile: Include Makeconfig immediately after defining subdir.
This patch adds some more directives to the benchmark inputs file,
moving functionality from the Makefile and making the code generation
script a bit cleaner. The function argument and return types that
were earlier added as variables in the makefile and passed to the
script via command line arguments are now the 'args' and 'ret'
directive respectively. 'args' should be a colon separated list of
argument types (skipped if the function doesn't accept any arguments)
and 'ret' should be the return type.
Additionally, an 'includes' directive may have a comma separated list
of headers to include in the source. For example, the pow input file
now looks like this:
42.0, 42.0
1.0000000000000020, 1.5
I did this to unclutter the benchtests Makefile a bit and eventually
eliminate dependency of the tests on the Makefile and have tests
depend on their respective include files only.
The benchmark for memcpy got disabled accidentally. Re-enable it.
ChangeLog:
2013-09-06 Will Newton <will.newton@linaro.org>
* benchtests/Makefile (string-bench): Add memcpy.
LDFLAGS puts the library too early in the command line if --as-needed
is being used. Use LDLIBS instead.
ChangeLog:
2013-09-04 Will Newton <will.newton@linaro.org>
* benchtests/Makefile: Use LDLIBS instead of LDFLAGS.
This is the initial support for string function performance tests,
along with copying tests for memcpy and memcpy-ifunc as proof of
concept. The string function benchmarks perform operations at
different alignments and for different sizes and compare performance
between plain operations and the optimized string operations. Due to
this their output is incompatible with the function benchmarks where
we're interested in fastest time, throughput, etc.
In future, the correctness checks in the benchmark tests can be
removed. Same goes for the performance measurements in the
string/test-*.
When setting BENCH_DURATION in CPPFLAGS-nonlib, append to the variable
instead of assigning to it, to avoid overwriting earlier set flags,
notably the -DNOT_IN_libc=1 flag.
HP_TIMING uses native timestamping instructions if available, thus
greatly reducing the overhead of recording start and end times for
function calls. For architectures that don't have HP_TIMING
available, we fall back to the clock_gettime bits. One may also
override this by invoking the benchmark as follows:
make USE_CLOCK_GETTIME=1 bench
and get the benchmark results using clock_gettime. One has to do
`make bench-clean` to ensure that the benchmark programs are rebuilt.
Some math functions have distinct performance characteristics in
specific domains of inputs, where some inputs return via a fast path
while other inputs require multiple precision calculations, that too
at different precision levels. The way to implement different domains
was to have a separate source file and benchmark definition, resulting
in separate programs.
This clutters up the benchmark, so this change allows these domains to
be consolidated into the same input file. To do this, the input file
format is now enhanced to allow comments with a preceding # and
directives with two # at the begining of a line. A directive that
looks like:
tells the benchmark generation script that what follows is a different
domain of inputs. The value of the 'name' directive (in this case,
foo) is used in the output. The two input domains are then executed
sequentially and their results collated separately. with the above
directive, there would be two lines in the result that look like:
func(): ....
func(foo): ...
The idea to run benchmarks for a constant number of iterations is
problematic. While the benchmarks may run for 10 seconds on x86_64,
they could run for about 30 seconds on powerpc and worse, over 3
minutes on arm. Besides that, adding a new benchmark is cumbersome
since one needs to find out the number of iterations needed for a
sufficient runtime.
A better idea would be to run each benchmark for a specific amount of
time. This patch does just that. The run time defaults to 10 seconds
and it is configurable at command line:
make BENCH_DURATION=5 bench
Appending benchmark program output on every run could result in a case
where the benchmark run was cancelled, resulting in a partially
written file. This file gets used again on the next run, resulting in
results being appended to old results.
It could have been possible to remove the file before every benchmark
run, but it is easier to just write the output to bench.out-tmp only
once.
Benchmark programs are generated using parameters from the Makefile,
so it is necessary to rebuild them whenever the parameters in the
Makefile are updated. Hence, added a dependency for the generated C
source on the Makefile so that it gets regenerated when the Makefile
is updated.
Separate benchmarks for the fast and slow implementations of pow and
exp since measuring both together doesn't make sense. Adjust the
iterations for pow and exp accordingly so that they run long enough
for the measurements to be meaningful.
The branch prediction hints is actually hurts performance in this case.
The assembly implementation make two assumptions: 1. 'fabs (x) < 2^52'
is unlikely and 2. 'x > 0.0' is unlike (if 1. is true). Since it a
general floating point function, expected input is not bounded and then
it is better to let the hardware handle the branches.