Use the generic C memset/memcpy/memmove in benchtests since comparing
against a slow byte-oriented implementation makes no sense.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2019-08-29 Wilco Dijkstra <wdijkstr@arm.com>
* benchtests/bench-memcpy.c (simple_memcpy): Remove.
(generic_memcpy): Include generic C memcpy.
* benchtests/bench-memmove.c (simple_memmove): Remove.
(generic_memmove): Include generic C memmove.
* benchtests/bench-memset.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* benchtests/bench-memset-large.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* benchtests/bench-memset-walk.c (simple_memset): Remove.
(generic_memset): Include generic C memset.
* string/memcpy.c (MEMCPY): Add defines to enable redirection.
* string/memset.c (MEMSET): Likewise.
* sysdeps/x86_64/memcopy.h: Remove empty file.
* benchtests/Makefile (bench-math): Add logb.
* benchtests/logb-inputs: New file.
* benchtests/logbf-inputs: New file.
Reviewed-by: Gabriel F. T. Gomes <gabrielftg@linux.ibm.com>
Inputs are based on argument reductions from generic and powerpc
implementation.
* benchtests/Makefile (bench-math): Add hypot.
* benchtests/hypot-inputs: New file.
Reviewed-by: Gabriel F. T. Gomes <gabrielftg@linux.ibm.com>
Some benchmarks with a very short runtime show significantly
different results across runs on Aarch64 - up to tens of percents.
Increasing the runtime to 100ms+ makes the deviation under 5%.
Tested on Aarch64 and x86-64.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* benchtests/bench-memccpy.c: Replace INNER_LOOP_ITERS
with INNER_LOOP_ITERS_LARGE.
* benchtests/bench-memchr.c: Likewise.
* benchtests/bench-rawmemchr.c: Likewise.
* benchtests/bench-strcat.c: Likewise.
* benchtests/bench-strchr.c: Likewise.
* benchtests/bench-string.h: Likewise.
* benchtests/bench-strlen.c: Likewise.
* benchtests/bench-strncpy.c: Likewise.
* benchtests/bench-strnlen.c: Likewise.
This patch fixes the following gcc 9 warnings for "make xcheck" / "make bench":
-string/tst-strcasestr.c:
../include/bits/../../misc/bits/error.h:42:5: error: ‘%s’ directive argument is null [-Werror=format-overflow=]
-argp/argp-test.c:
argp-test.c:130:20: error: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Werror=format-overflow=]
argp-test.c:130:19: note: directive argument in the range [-2147483648, 122]
argp-test.c:130:5: note: ‘sprintf’ output between 2 and 12 bytes into a destination of size 10
-nss/tst-field.c:
tst-field.c:52:7: error: ‘%s’ directive argument is null [-Werror=format-overflow=]
-benchtests/bench-strstr.c:
../include/bits/../../misc/bits/error.h:42:5: error: ‘%s’ directive argument is null [-Werror=format-overflow=]
-benchtests/bench-malloc-simple.c:
bench-malloc-simple.c:93:16: error: iteration 3 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
ChangeLog:
[BZ #24556]
* string/test-strcasestr.c (check_result): Add NULL check.
* nss/tst-field.c (check_rewrite): Likewise.
* benchtests/bench-strstr.c (do_one_test): Likewise.
* string/test-strstr.c (check_result): Likewise.
* argp/argp-test.c (popt): Increase size of buf to 12.
* benchtests/bench-malloc-simple.c (bench):
Do not initialize tests array out of bounds.
* benchtests/Makefile (bench-math): Add isnan, isinf, and isfinite.
(CFLAGS-bench-isnan.c, CFLAGS-bench-isinf.c,
CFLAGS-bench-isfinite.c): New rule.
* benchtests/isnan-input: New file.
* benchtests/isinf-input: New file.
* benchtests/isfinite-input: New file.
Reviewed-by: Gabriel F. T. Gomes <gabrielftg@linux.ibm.com>
Benchmark needles which exhibit worst-case performance. This shows that
basic_strstr is quadratic and thus unsuitable for large needles.
On the other hand the Two-way and new strstr implementations are linear with
increasing needle sizes. The slowest cases of the two implementations are
within a factor of 2 on several different microarchitectures. Two-way is
slowest on inputs which cause a branch mispredict on almost every character.
The new strstr is slowest on inputs which almost match and result in many
calls to memcmp. Thanks to Szabolcs for providing various hard needles.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* benchtests/bench-strstr.c (test_hard_needle): New function.
Improve string benchtest timing. Many tests run for 0.01s which is way too
short to give accurate results. Other tests take over 40 seconds which is
way too long. Significantly increase the iterations of the short running
tests. Reduce number of alignment variations in the long running memcpy walk
tests so they take less than 5 seconds.
As a result most tests take at least 0.1s and all finish within 5 seconds.
* benchtests/bench-memcpy-random.c (do_one_test): Use medium iterations.
* benchtests/bench-memcpy-walk.c (test_main): Reduce alignment tests.
* benchtests/bench-memmem.c (do_one_test): Use small iterations.
* benchtests/bench-memmove-walk.c (test_main): Reduce alignment tests.
* benchtests/bench-memset-walk.c (test_main): Reduce alignment tests.
* benchtests/bench-strcasestr.c (do_one_test): Use small iterations.
* benchtests/bench-string.h (INNER_LOOP_ITERS): Increase iterations.
(INNER_LOOP_ITERS_MEDIUM): New define.
(INNER_LOOP_ITERS_SMALL): New define.
* benchtests/bench-strpbrk.c (do_one_test): Use medium iterations.
* benchtests/bench-strsep.c (do_one_test): Use small iterations.
* benchtests/bench-strspn.c (do_one_test): Use medium iterations.
* benchtests/bench-strstr.c (do_one_test): Use small iterations.
* benchtests/bench-strtok.c (do_one_test): Use small iterations.
Benchmarks should reflect distribution build policies, so it makes
sense to honor the BIND_NOW configuration for them.
This commit keeps using $(+link-tests), so that the benchmarks are
linked according to the --enable-hardcoded-path-in-tests configure
option.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reduce the total time taken by benchtests. The malloc thread test takes 4
minutes to run which is significantly more than most other tests. Reduce
this to a more reasonable 40 seconds. The math tests take 10 seconds each,
eventhough all they do is loop on the same input. Anything more than 1
second runtime is way overkill, so set the limit to 1 second.
* benchtests/Makefile (BENCH_DURATION): Set to 1 second.
* benchtests/bench-malloc-thread.c (BENCH_DURATION): Set to 10 seconds.
Replace slow byte-oriented tests in several string benchmarks with the
generic implementations from the string/ directory so the comparisons
are more realistic and useful.
* benchtests/bench-stpcpy.c (SIMPLE_STPCPY): Remove function.
(generic_stpcpy): New function.
* benchtests/bench-stpncpy.c (SIMPLE_STPNCPY): Remove function.
(generic_stpncpy): New function.
* benchtests/bench-strcat.c (SIMPLE_STRCAT): Remove function.
(generic_strcat): New function.
* benchtests/bench-strcpy.c (SIMPLE_STRCPY): Remove function.
(generic_strcpy): New function.
* benchtests/bench-strncat.c (SIMPLE_STRNCAT): Remove function.
(STUPID_STRNCAT): Remove function.
(generic_strncat): New function.
* benchtests/bench-strncpy.c (SIMPLE_STRNCPY): Remove function.
(STUPID_STRNCPY): Remove function.
(generic_strncpy): New function.
* benchtests/bench-strnlen.c (SIMPLE_STRNLEN): Remove function.
(generic_strnlen): New function.
(memchr_strnlen): New function.
* benchtests/bench-strlen.c (generic_strlen): Define for WIDE.
(memchr_strlen): Likewise.
Improve bench-strstr by using an extract from the manual as the input
to make the test more realistic. Use the same input for both found and
fail cases rather than using a memset of '0' for most of the string,
which measures performance of strchr rather than strstr. Add result
checking to catch potential errors. Remove the repeated tests at slightly
different alignments and add more large needle and haystack testcases.
Replace stupid_strstr with an efficient basic implementation. Add the
Two-way implementation to simplify comparisons with much faster generic
implementations.
* benchtests/bench-strstr.c (input): Add realistic input text.
(stupid_strstr): Remove function.
(basic_strstr): Add function.
(twoway_strstr): Add function.
(do_one_test): Add result checking.
(do_test): Use new input text. Remove accidental early matches.
(test_main): Improve range of tests, reduce unaligned cases.
Improve bench-memmem by replacing simple_memmem with a more efficient
implementation. Add the Two-way implementation to enable direct comparison
with the optimized memmem.
* benchtests/bench-memmem.c (simple_memmem): Remove function.
(basic_memmem): Add function.
(twoway_memmem): Add function.
Add missing generic hp_timing support. It uses clock_gettime (CLOCK_MONOTONIC)
which has unspecified starting time, nano-second accuracy, and should faster on
architectures that implementes the symbol as vDSO.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu. I also
checked the builds for all afected ABIs.
* benchtests/Makefile (USE_CLOCK_GETTIME) Remove.
* benchtests/README: Update description.
* benchtests/bench-timing.h: Default to hp-timing.
* sysdeps/generic/hp-timing.h (HP_TIMING_DIFF, HP_TIMING_ACCUM_NT,
HP_TIMING_PRINT): Remove.
(HP_TIMING_NOW): Add generic implementation.
(hp_timing_t): Change to uint64_t.
This patch refactor how hp-timing is used on loader code for statistics
report. The HP_TIMING_AVAIL and HP_SMALL_TIMING_AVAIL are removed and
HP_TIMING_INLINE is used instead to check for hp-timing avaliability.
For alpha, which only defines HP_SMALL_TIMING_AVAIL, the HP_TIMING_INLINE
is set iff for IS_IN(rtld).
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu. I also
checked the builds for all afected ABIs.
* benchtests/bench-timing.h: Replace HP_TIMING_AVAIL with
HP_TIMING_INLINE.
* nptl/descr.h: Likewise.
* elf/rtld.c (RLTD_TIMING_DECLARE, RTLD_TIMING_NOW, RTLD_TIMING_DIFF,
RTLD_TIMING_ACCUM_NT, RTLD_TIMING_SET): Define.
(dl_start_final_info, _dl_start_final, dl_main, print_statistics):
Abstract hp-timing usage with RTLD_* macros.
* sysdeps/alpha/hp-timing.h (HP_TIMING_INLINE): Define iff IS_IN(rtld).
(HP_TIMING_AVAIL, HP_SMALL_TIMING_AVAIL): Remove.
* sysdeps/generic/hp-timing.h (HP_TIMING_AVAIL, HP_SMALL_TIMING_AVAIL,
HP_TIMING_NONAVAIL): Likewise.
* sysdeps/ia64/hp-timing.h (HP_TIMING_AVAIL, HP_SMALL_TIMING_AVAIL):
Likewise.
* sysdeps/powerpc/powerpc32/power4/hp-timing.h (HP_TIMING_AVAIL,
HP_SMALL_TIMING_AVAIL): Likewise.
* sysdeps/powerpc/powerpc64/hp-timing.h (HP_TIMING_AVAIL,
HP_SMALL_TIMING_AVAIL): Likewise.
* sysdeps/sparc/sparc32/sparcv9/hp-timing.h (HP_TIMING_AVAIL,
HP_SMALL_TIMING_AVAIL): Likewise.
* sysdeps/sparc/sparc64/hp-timing.h (HP_TIMING_AVAIL,
HP_SMALL_TIMING_AVAIL): Likewise.
* sysdeps/x86/hp-timing.h (HP_TIMING_AVAIL, HP_SMALL_TIMING_AVAIL):
Likewise.
* sysdeps/generic/hp-timing-common.h: Update comment with
HP_TIMING_AVAIL removal.
Add a malloc micro benchmark to enable accurate testing of the
various paths in malloc and free. The benchmark does a varying
number of allocations of a given block size, then frees them again.
It tests 3 different scenarios: single-threaded using main arena,
multi-threaded using thread-arena, main arena with SINGLE_THREAD_P
false.
* benchtests/Makefile: Add malloc-simple benchmark.
* benchtests/bench-malloc-simple.c: New benchmark.
The ORIG_SRC argument is likely a useless relic from the original
correctness tests that are not needed in the benchmarks. Remove the
argument and use S1 to point to the source to avoid confusion.
* benchtests/bench-memmove.c (do_one_test): Remove unused
ORIG_SRC.
(do_test): Adjust.
* benchtests/bench-memmove-large.c (do_one_test): Remove unused
ORIG_SRC.
(do_test): Adjust.
The current bench-strlen compares against a slow byte-oriented strlen which
is not useful given it's too easy to beat. Remove it and compare against the
generic C strlen version and memchr.
* benchtests/bench-strlen.c (generic_strlen): New function.
(memchr_strlen): New function.
Non-consumable data, alias data not related to benchmarks, should be sent to
the standard error, thus pipelines can work as expected.
* benchtests/scripts/compare_bench.py (do_compare): write to stderr in case
stat is not present.
* benchtests/scripts/compare_bench.py (plot_graphs): write to stderr in case
timings field is not present. Also string showing the output filename goes
into the stderr.
Allows user to pick a statistic, defaulting to min and mean, from command
line. At the same time, if stat does not exit, catch the run-time exception
and keep comparing the rest of benchmarked functions. Finally, take care of
division-by-zero exceptions and as the latter, keep comparing the rest of the
functions, turning the script a bit more fault tolerant thus useful.
* benchtests/scripts/compare_bench.py (do_compare): Catch KeyError and
ZeroDivisorError exceptions.
* benchtests/scripts/compare_bench.py (compare_runs): Use stats argument to
loop through user provided statistics.
* benchtests/scripts/compare_bench.py (main): Include the --stats argument.
Allows other functions to be processed, making the script a bit more fault
tolerant thus useful.
* benchtests/scripts/compare_bench.py (compare_runs): Continue instead of return.
This patch makes Python 3.4 or later a required tool for building
glibc, so allowing changes of awk, perl etc. code used in the build
and test to Python code without any such changes needing makefile
conditionals or to handle older Python versions.
This patch makes the configure test for Python check the version and
give an error if Python is missing or too old, and removes makefile
conditionals that are no longer needed. It does not itself convert
any code from another language to Python, and does not remove any
compatibility with older Python versions from existing scripts.
Tested for x86_64.
* configure.ac (PYTHON_PROG): Use AC_CHECK_PROG_VER. Set
critic_missing for versions before 3.4.
* configure: Regenerated.
* manual/install.texi (Tools for Compilation): Document
requirement for Python to build glibc.
* INSTALL: Regenerated.
* Rules [PYTHON]: Make code unconditional.
* benchtests/Makefile [PYTHON]: Likewise.
* conform/Makefile [PYTHON]: Likewise.
* manual/Makefile [PYTHON]: Likewise.
* math/Makefile [PYTHON]: Likewise.
RDTSCP waits until all previous instructions have executed and all
previous loads are globally visible before reading the counter. RDTSC
doesn't wait until all previous instructions have been executed before
reading the counter. All x86 processors since 2010 support RDTSCP
instruction. This patch adds RDTSCP support to benchtests.
* benchtests/Makefile (CPPFLAGS-nonlib): Add -DUSE_RDTSCP if
USE_RDTSCP is defined.
* sysdeps/x86/hp-timing.h (HP_TIMING_NOW): Use RDTSCP if
USE_RDTSCP is defined.
Otherwise, we see the following runtime error when using the parameter:
File "./glibc/benchtests/scripts/compare_bench.py", line 46, in do_compare
if d > threshold:
TypeError: '>' not supported between instances of 'float' and 'str'
* benchtests/scripts/compare_bench.py (main): set float type on
threshold argument.
Add the workload test properties (max-throughput, latency, etc.) to
the schema to prevent benchmark output validation from failing.
* benchtests/scripts/benchout.schema.json (properties): Add
new properties.
Add the duration and iterations attributes to the workloads tests to
make the json schema parser happy
* benchtests/bench-skeleton.c (main): Add duration and
iterations attributes.
Drop realloc_bufs in favour of making alloc_bufs transparently
reallocate the buffers if it had allocated before. Also consolidate
computation of buffer lengths so that they don't get repeated on every
reallocation.
* benchtests/bench-string.h (buf1_size, buf2_size): New
variables.
(init_sizes): New function.
(test_init): Use it.
(alloc_buf, exit_error): New functions.
(alloc_bufs): Use ALLOC_BUF.
(realloc_bufs): Remove.
* benchtests/bench-memcmp.c (do_test): Adjust.
* benchtests/bench-memset-large.c (do_test): Likewise.
* benchtests/bench-memset-walk.c (do_test): Likewise.
* benchtests/bench-memset.c (do_test): Likewise.
* benchtests/bench-strncmp.c (do_test): Likewise.
Python 2 does not have a FileNotFoundError so drop it in favour of
simply printing out the last (and most informative) line of the
exception.
* benchtests/scripts/compare_strings.py: Import traceback.
(parse_file): Pretty-print error.
The argparse library is used on compare_bench script to improve command line
argument parsing. The 'schema validation file' is now optional, reducing by
one the number of required parameters.
* benchtests/scripts/compare_bench.py (__main__): use the argparse
library to improve command line parsing.
(__main__): make schema file as optional parameter (--schema),
defaulting to benchtests/scripts/benchout.schema.json.
(main): move out of the parsing stuff to __main_ and leave it
only as caller of main comparison functions.
Improve strstr performance. Strstr tends to be slow because it uses
many calls to memchr and a slow byte loop to scan for the next match.
Performance is significantly improved by using strnlen on larger blocks
and using strchr to search for the next matching character. strcasestr
can also use strnlen to scan ahead, and memmem can use memchr to check
for the next match.
On the GLIBC bench tests the performance gains on Cortex-A72 are:
strstr: +25%
strcasestr: +4.3%
memmem: +18%
On a 256KB dataset strstr performance improves by 67%, strcasestr by 47%.
Reviewd-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On x86-64, there may be multiple IFUNC implementations for a given
function. But we may be only interested in a subset of them. This
patch adds -f/--functions argument to compare a subset of IFUNC
implementations.
* benchtests/scripts/compare_strings.py (process_results): Add
funcs argument. Compare only functions which are selected.
(main): Check if base function is among selected functions.
Pass selected functions to process_results.
(__main__): Add -f/--functions argument.
Catch runtime exceptions in case the user provided: wrong base
function, attribute(s) or input file. In any of the latter, quit
immediately with non-zero return code.
* benchtests/scripts/compare_string.py: (process_results) Catch
exception in non-existent base_func and catch exception in
non-existent attribute.
(parse_file) Catch exception in non-existent input file.
Having a string comparison report with neither diff numbers nor header
yields a more useful output to be consumed by other tools.
* benchtests/scripts/compare_string.py: Add --no-diff and --no-header
options to avoid diff calculation and omit header, respectively.
(main): process --no-diff and --no-header
This is a minor style change to move the definition of I to its usage
scope instead of at the top of the function. This is consistent with
glibc style guidelines and more importantly it was getting in the way
of my testing.
* benchtests/bench-memcpy-walk.c (do_test): Move declaration
of I into loop header.
* benchtests/bench-memmove-walk.c (do_test): Likewise.
Add an undefine of attribute_hidden since it may be defined in some cases
(it must be defined since it is used by some hp-timing configurations).
* benchtests/bench-timing.h (attribute_hidden): Undefine.
Currently the benchtests are run with internal GLIBC headers, which is incorrect.
Defining _ISOMAC in the makefile ensures the internal headers are bypassed.
Fix all tests which were relying on internal defines or includes.
* benchtests/Makefile: Define _ISOMAC.
* benchtests/bench-strcoll.c: Add missing sys/stat.h include.
* benchtests/bench-string.h: Define inhibit_loop_to_libcall macro.
* benchtests/bench-strstr.c: Define empty libc_hidden_builtin_def.
* benchtests/bench-strtok.c (oldstrtok): Use rawmemchr.
* benchtests/bench-timing.h: Define attribute_hidden.
The 0 length strncmp is interesting for correctness but not for
performance.
* benchtests/bench-strncmp.c (test_main): Remove 0 length tests.
(do_test_limit): Likewise.
Don't reuse buffers for different strncmp implementations since the
earlier implementation will end up warming the cache for the later
one. Eventually there should be a more elegant way to do this.
* benchtests/bench-strncmp.c (do_test_limit): Reallocate buffers
for every implementation.
(do_test): Likewise.