Keeping the same buffers along with copying the same size of data into
the same location means that the first routine is typically the
slowest since it has to bear the cost of fetching data into to cache.
Reallocating buffers stabilizes numbers by a bit.
* benchtests/bench-string.h (realloc_bufs): New function.
(test_init): Call it.
* benchtests/bench-memset-large.c (do_test): Likewise.
* benchtests/bench-memset.c (do_test): Likewise.
Make the memset benchmarks (bench-memset and bench-memset-large) print
their output in JSON so that they can be evaluated using the
compare_strings.py script.
* benchtests/bench-memset-large.c: Print output in JSON
format.
* benchtests/bench-memset.c: Likewise.
The test run is unnecessary and interferes with the benchmark. The
tests are done during make check, so they're unnecessary here.
* benchtests/bench-memccpy.c (do_one_test): Remove checks.
* benchtests/bench-memchr.c (do_one_test): Likewise.
* benchtests/bench-memcpy-large.c (do_one_test): Likewise.
* benchtests/bench-memcpy.c (do_one_test): Likewise.
* benchtests/bench-memmove-large.c (do_one_test): Likewise.
* benchtests/bench-memmove.c (do_one_test): Likewise.
* benchtests/bench-memset-large.c (do_one_test): Likewise.
* benchtests/bench-memset.c (do_one_test): Likewise.
* benchtests/bench-string.h (test_init): Remove memsets.
This patch adds an optimized memset implementation for POWER8. For
sizes from 0 to 255 bytes, a word/doubleword algorithm similar to
POWER7 optimized one is used.
For size higher than 255 two strategies are used:
1. If the constant is different than 0, the memory is written with
altivec vector instruction;
2. If constant is 0, dbcz instructions are used. The loop is unrolled
to clear 512 byte at time.
Using vector instructions increases throughput considerable, with a
double performance for sizes larger than 1024. The dcbz loops unrolls
also shows performance improvement, by doubling throughput for sizes
larger than 8192 bytes.
Switch the string benchmarks to using bench-timing.h instead
of hp-timing.h directly. This allows the string benchmarks to
be run usefully on architectures such as ARM that do not have
support for hp-timing.h.
In order to do this the tests have been changed from timing each
individual call and picking the lowest execution time recorded to
timing a number of calls and taking the mean execution time.
ChangeLog:
2013-09-04 Will Newton <will.newton@linaro.org>
* benchtests/bench-timing.h (TIMING_PRINT_MEAN): New macro.
* benchtests/bench-string.h: Include bench-timing.h instead
of including hp-timing.h directly. (INNER_LOOP_ITERS): New
define. (HP_TIMING_BEST): Delete macro. (test_init): Remove
call to HP_TIMING_DIFF_INIT.
* benchtests/bench-memccpy.c: Use bench-timing.h macros
instead of hp-timing.h macros.
* benchtests/bench-memchr.c: Likewise.
* benchtests/bench-memcmp.c: Likewise.
* benchtests/bench-memcpy.c: Likewise.
* benchtests/bench-memmem.c: Likewise.
* benchtests/bench-memmove.c: Likewise.
* benchtests/bench-memset.c: Likewise.
* benchtests/bench-rawmemchr.c: Likewise.
* benchtests/bench-strcasecmp.c: Likewise.
* benchtests/bench-strcasestr.c: Likewise.
* benchtests/bench-strcat.c: Likewise.
* benchtests/bench-strchr.c: Likewise.
* benchtests/bench-strcmp.c: Likewise.
* benchtests/bench-strcpy.c: Likewise.
* benchtests/bench-strcpy_chk.c: Likewise.
* benchtests/bench-strlen.c: Likewise.
* benchtests/bench-strncasecmp.c: Likewise.
* benchtests/bench-strncat.c: Likewise.
* benchtests/bench-strncmp.c: Likewise.
* benchtests/bench-strncpy.c: Likewise.
* benchtests/bench-strnlen.c: Likewise.
* benchtests/bench-strpbrk.c: Likewise.
* benchtests/bench-strrchr.c: Likewise.
* benchtests/bench-strspn.c: Likewise.
* benchtests/bench-strstr.c: Likewise.
Check wheter the compiler has the option -fno-tree-loop-distribute-patterns
to inhibit loop transformation to library calls and uses it on memset
and memmove default implementation to avoid recursive calls.