This patch makes the sqrt benchmark use -fno-builtin, as already done
for benchmarks of ffs and ffsll, so that it actually benchmarks the
glibc function as (presumably) intended even in the presence of the
compiler inlining sqrt.
Tested for x86_64 and also used for benchmarking my ARM sqrt patch.
* benchtests/Makefile (CFLAGS-bench-sqrt.c): New variable.
Clear the destination buffer updated by the previous run in bench-memcpy.c
and test-memcpy.c to catch the error when the following implementations do
not copy anything.
[BZ #19907]
* benchtests/bench-memcpy.c (do_one_test): Clear the destination
buffer updated by the previous run.
* string/test-memcpy.c (do_one_test): Likewise.
* benchtests/bench-memmove.c (do_one_test): Add a comment.
* string/test-memmove.c (do_one_test): Likewise.
This patch adds full support for cross-building benchmarks. Some
benchmarks like those that need locales to be generated cannot be
built and are hence skipped for cross builds.
Tested by cross building for aarch64 on x86_64 and then running the
generated benchmark on aarch64.
* benchtests/Makefile (wcsmbs-benchset): Include only for
native builds and runs.
(LOCALES): Likewise.
(bench-build): Build timing-type here instead of the bench
target. Generate locale only for native builds.
* benchtests/README: Add note for cross-building.
For situations where we are cross-building or where we want to avoid
building on the target system, we want a way to only build benchmarks
and then copy them over to the target system to run them. I have also
added a simple enhancement for the 'bench' target where all benchmark
binaries are built and then the benchmarks executed.
Tested on arm.
Makefile.in (bench-build): New target.
Rules (PHONY): Add bench-build target.
benchtests/Makefile (bench): Depend on bench-build.
(bench-build): New target.
From the bug:
Obsolete locale. The ISO-639 code for Hebrew was changed from 'iw'
to 'he' in 1989, according to Bruno Haible on libc-alpha 2003-09-01.
Reported-by: Chris Leonard <cjlhomeaddress@gmail.com>
benchtests should use $(test-via-rtld-prefix) and $(+link-tests) like
other glibc tests.
[BZ #19783]
* benchtests/Makefile (run-bench): Replace $(rtld-prefix) with
$(test-via-rtld-prefix).
($(binaries-bench)): Replace $(+link) with $(+link-tests).
The ffs and ffsll functions were listed as math functions when they
are actually defined in strings.h and string.h respectively. Shuffle
around the Makefile variables a bit and make a separate space for ffs
and ffsll.
The sincos benchmark has only about a dozen inputs that don't measure
the impact of changes to various passes. Since much of the code
properties are inherited from sin and cos, copy those inputs in to get
more comprehensive coverage.
Prevent function calls that don't return anything from being optimized
out by the compiler by marking its input variables as used.
This prevents the sincos function call from being optimized out in the
benchmark.
ChangeLog:
2015-09-18 Wilco Dijkstra <wdijkstr@arm.com>
* benchtests/Makefile: Add bench-math-inlines, link with libm.
* benchtests/bench-math-inlines.c: New benchmark.
* benchtests/bench-util.h: New file.
* benchtests/bench-util.c: New file.
* benchtests/bench-skeleton.c: Add include of bench-util.c/h.
This patch provides optimized versions of strcmp and wcscmp with the z13
vector instructions.
The architecture specific string.h had a typo, which leads to ommiting the
inline version in this file if __USE_STRING_INLINES is defined.
Tested this inline version by tweaking test-strcmp.c.
ChangeLog:
* sysdeps/s390/multiarch/strcmp-vx.S: New File.
* sysdeps/s390/multiarch/strcmp.c: Likewise.
* sysdeps/s390/multiarch/wcscmp-c.c: Likewise.
* sysdeps/s390/multiarch/wcscmp-vx.S: Likewise.
* sysdeps/s390/multiarch/wcscmp.c: Likewise.
* sysdeps/s390/s390-32/multiarch/strcmp.c: Likewise.
* sysdeps/s390/s390-64/multiarch/strcmp.c: Likewise.
* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add strcmp and
wcscmp functions.
* sysdeps/s390/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Add ifunc test for strcmp, wcscmp.
* string/strcmp.c (STRCMP): Define and use macro.
* benchtests/bench-wcscmp.c: New File.
* benchtests/Makefile (wcsmbs-bench): Add wcscmp.
* sysdeps/s390/bits/string.h: Fix typo: _HAVE_STRING_ARCH_strcmp
instead of _HAVE_STRING_ARCH_memchr.
This patch provides optimized versions of strlen and wcslen with the z13 vector
instructions.
The helper macro IFUNC_VX_IMPL is introduced and is used to register all
__<func>_c() and __<func>_vx() functions within __libc_ifunc_impl_list()
to the ifunc test framework.
ChangeLog:
* sysdeps/s390/multiarch/Makefile: New File.
* sysdeps/s390/multiarch/strlen-c.c: Likewise.
* sysdeps/s390/multiarch/strlen-vx.S: Likewise.
* sysdeps/s390/multiarch/strlen.c: Likewise.
* sysdeps/s390/multiarch/wcslen-c.c: Likewise.
* sysdeps/s390/multiarch/wcslen-vx.S: Likewise.
* sysdeps/s390/multiarch/wcslen.c: Likewise.
* string/strlen.c (STRLEN): Define and use macro.
* sysdeps/s390/multiarch/ifunc-impl-list.c
(IFUNC_VX_IMPL): New macro function.
(__libc_ifunc_impl_list): Add ifunc test for strlen, wcslen.
* benchtests/Makefile (wcsmbs-bench): New variable.
(string-bench-all): Added wcsmbs-bench.
* benchtests/bench-wcslen.c: New File.
This script is a sample implementation that uses import_bench to
construct two benchmark objects and compare them. If detailed timing
information is available (when one does `make DETAILED=1 bench`), it
writes out graphs for all functions it benchmarks and prints
significant differences in timings of the two benchmark runs. If
detailed timing information is not available, it points out
significant differences in aggregate times.
Call this script as follows:
compare_bench.py schema_file.json bench1.out bench2.out
Alternatively, if one wants to set a different threshold for warnings
(default is a 10% difference):
compare_bench.py schema_file.json bench1.out bench2.out 25
The threshold in the example above is 25%. schema_file.json is the
JSON schema (which is $srcdir/benchtests/scripts/benchout.schema.json
for the benchmark output file) and bench1.out and bench2.out are the
two benchmark output files to compare.
The key functionality here is the compress_timings function which
groups together points that are close together into a single point
that is the mean of all its representative points. Any point in such
a group is at most 1.5x the smallest point in that group. The
detailed derivation is a comment in the function.
* benchtests/scripts/compare_bench.py: New file.
* benchtests/scripts/import_bench.py (mean): New function.
(split_list): Likewise.
(do_for_all_timings): Likewise.
(compress_timings): Likewise.
This is the beginning of a module to import and process benchmark
outputs. The module currently supports importing of a bench.out and
validating it against a schema file. In future this could grow a set
of routines that benchmark consumers may find useful to build their
own analysis tools. I have altered validate_bench to use this module
too.
* benchtests/scripts/import_bench.py: New file.
* benchtests/scripts/validate_benchout.py: Import import_bench
instead of jsonschema.
(validate_bench): Remove function.
(main): Use import_bench.
This patch optimizes strcpy for ppc64/power7 for unaligned source or
destination address. The source or destination address is aligned
to doubleword and data is shifted based on the alignment and
added with the previous loaded data to be written as a doubleword.
For each load, cmpb instruction is used for faster null check.
The word aligned optimization is also removed, since the new unaligned
code path shows better results handling word-aligned strings.
More combination of unaligned inputs is also added in benchtest
to measure the improvement.The new optimization shows 2 to 80% of
performance improvement for longer string though it does not show
big difference on string size less than 16 due to additional checks.
TEST_IFUNC is only tested in two headers, bench-string.h and
test-string.h, after it gets defined by those headers, and it never
gets undefined.
Thus no defines of TEST_IFUNC are needed, and the *-ifunc.c tests that
just define TEST_IFUNC and include other tests are also redundant, as
is the code to remove $(tests-ifunc) and $(xtests-ifunc) conditionally
from tests and xtests. This patch removes the useless defines and
tests of TEST_IFUNC and the associated useless tests and makefile
code. It thereby fixes a series of warnings
"../string/test-string.h:21:0: warning: "TEST_IFUNC" redefined" where
test-string.h defines TEST_IFUNC to empty, other files define it to 1
and this produces warnings.
Tested for x86_64.
* debug/test-stpcpy_chk-ifunc.c: Remove file.
* debug/test-strcpy_chk-ifunc.c: Likewise.
* wcsmbs/test-wcschr-ifunc.c: Likewise.
* wcsmbs/test-wcscmp-ifunc.c: Likewise.
* wcsmbs/test-wcscpy-ifunc.c: Likewise.
* wcsmbs/test-wcslen-ifunc.c: Likewise.
* wcsmbs/test-wcsrchr-ifunc.c: Likewise.
* wcsmbs/test-wmemcmp-ifunc.c: Likewise.
* Rules [$(multi-arch) = no] (tests): Do not filter out
$(tests-ifunc).
[$(multi-arch) = no] (xtests): Do not filter out $(xtests-ifunc).
* debug/Makefile (tests-ifunc): Remove variable.
(tests): Do not add $(tests-ifunc).
* wcsmbs/Makefile (tests-ifunc): Remove variable.
(tests): Do not add $(tests-ifunc).
* benchtests/bench-string.h (TEST_IFUNC): Remove macro.
[TEST_IFUNC]: Remove conditionals.
* string/test-string.h (TEST_IFUNC): Remove macro.
[TEST_IFUNC]: Remove conditionals.
Add a microbenchmark for measuring malloc and free performance with
varying numbers of threads. The benchmark allocates and frees buffers
of random sizes in a random order and measures the overall execution
time and RSS. Variants of the benchmark are run with 1, 8, 16 and
32 threads.
The random block sizes used follow an inverse square distribution
which is intended to mimic the behaviour of real applications which
tend to allocate many more small blocks than large ones.
ChangeLog:
2014-11-05 Will Newton <will.newton@linaro.org>
* benchtests/Makefile: (bench-malloc): Add malloc thread
scalability benchmark.
* benchtests/bench-malloc-threads.c: New file.
This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.