The ORIG_SRC argument is likely a useless relic from the original
correctness tests that are not needed in the benchmarks. Remove the
argument and use S1 to point to the source to avoid confusion.
* benchtests/bench-memmove.c (do_one_test): Remove unused
ORIG_SRC.
(do_test): Adjust.
* benchtests/bench-memmove-large.c (do_one_test): Remove unused
ORIG_SRC.
(do_test): Adjust.
The current bench-strlen compares against a slow byte-oriented strlen which
is not useful given it's too easy to beat. Remove it and compare against the
generic C strlen version and memchr.
* benchtests/bench-strlen.c (generic_strlen): New function.
(memchr_strlen): New function.
Non-consumable data, alias data not related to benchmarks, should be sent to
the standard error, thus pipelines can work as expected.
* benchtests/scripts/compare_bench.py (do_compare): write to stderr in case
stat is not present.
* benchtests/scripts/compare_bench.py (plot_graphs): write to stderr in case
timings field is not present. Also string showing the output filename goes
into the stderr.
Allows user to pick a statistic, defaulting to min and mean, from command
line. At the same time, if stat does not exit, catch the run-time exception
and keep comparing the rest of benchmarked functions. Finally, take care of
division-by-zero exceptions and as the latter, keep comparing the rest of the
functions, turning the script a bit more fault tolerant thus useful.
* benchtests/scripts/compare_bench.py (do_compare): Catch KeyError and
ZeroDivisorError exceptions.
* benchtests/scripts/compare_bench.py (compare_runs): Use stats argument to
loop through user provided statistics.
* benchtests/scripts/compare_bench.py (main): Include the --stats argument.
Allows other functions to be processed, making the script a bit more fault
tolerant thus useful.
* benchtests/scripts/compare_bench.py (compare_runs): Continue instead of return.
This patch makes Python 3.4 or later a required tool for building
glibc, so allowing changes of awk, perl etc. code used in the build
and test to Python code without any such changes needing makefile
conditionals or to handle older Python versions.
This patch makes the configure test for Python check the version and
give an error if Python is missing or too old, and removes makefile
conditionals that are no longer needed. It does not itself convert
any code from another language to Python, and does not remove any
compatibility with older Python versions from existing scripts.
Tested for x86_64.
* configure.ac (PYTHON_PROG): Use AC_CHECK_PROG_VER. Set
critic_missing for versions before 3.4.
* configure: Regenerated.
* manual/install.texi (Tools for Compilation): Document
requirement for Python to build glibc.
* INSTALL: Regenerated.
* Rules [PYTHON]: Make code unconditional.
* benchtests/Makefile [PYTHON]: Likewise.
* conform/Makefile [PYTHON]: Likewise.
* manual/Makefile [PYTHON]: Likewise.
* math/Makefile [PYTHON]: Likewise.
RDTSCP waits until all previous instructions have executed and all
previous loads are globally visible before reading the counter. RDTSC
doesn't wait until all previous instructions have been executed before
reading the counter. All x86 processors since 2010 support RDTSCP
instruction. This patch adds RDTSCP support to benchtests.
* benchtests/Makefile (CPPFLAGS-nonlib): Add -DUSE_RDTSCP if
USE_RDTSCP is defined.
* sysdeps/x86/hp-timing.h (HP_TIMING_NOW): Use RDTSCP if
USE_RDTSCP is defined.
Otherwise, we see the following runtime error when using the parameter:
File "./glibc/benchtests/scripts/compare_bench.py", line 46, in do_compare
if d > threshold:
TypeError: '>' not supported between instances of 'float' and 'str'
* benchtests/scripts/compare_bench.py (main): set float type on
threshold argument.
Add the workload test properties (max-throughput, latency, etc.) to
the schema to prevent benchmark output validation from failing.
* benchtests/scripts/benchout.schema.json (properties): Add
new properties.
Add the duration and iterations attributes to the workloads tests to
make the json schema parser happy
* benchtests/bench-skeleton.c (main): Add duration and
iterations attributes.
Drop realloc_bufs in favour of making alloc_bufs transparently
reallocate the buffers if it had allocated before. Also consolidate
computation of buffer lengths so that they don't get repeated on every
reallocation.
* benchtests/bench-string.h (buf1_size, buf2_size): New
variables.
(init_sizes): New function.
(test_init): Use it.
(alloc_buf, exit_error): New functions.
(alloc_bufs): Use ALLOC_BUF.
(realloc_bufs): Remove.
* benchtests/bench-memcmp.c (do_test): Adjust.
* benchtests/bench-memset-large.c (do_test): Likewise.
* benchtests/bench-memset-walk.c (do_test): Likewise.
* benchtests/bench-memset.c (do_test): Likewise.
* benchtests/bench-strncmp.c (do_test): Likewise.
Python 2 does not have a FileNotFoundError so drop it in favour of
simply printing out the last (and most informative) line of the
exception.
* benchtests/scripts/compare_strings.py: Import traceback.
(parse_file): Pretty-print error.
The argparse library is used on compare_bench script to improve command line
argument parsing. The 'schema validation file' is now optional, reducing by
one the number of required parameters.
* benchtests/scripts/compare_bench.py (__main__): use the argparse
library to improve command line parsing.
(__main__): make schema file as optional parameter (--schema),
defaulting to benchtests/scripts/benchout.schema.json.
(main): move out of the parsing stuff to __main_ and leave it
only as caller of main comparison functions.
Improve strstr performance. Strstr tends to be slow because it uses
many calls to memchr and a slow byte loop to scan for the next match.
Performance is significantly improved by using strnlen on larger blocks
and using strchr to search for the next matching character. strcasestr
can also use strnlen to scan ahead, and memmem can use memchr to check
for the next match.
On the GLIBC bench tests the performance gains on Cortex-A72 are:
strstr: +25%
strcasestr: +4.3%
memmem: +18%
On a 256KB dataset strstr performance improves by 67%, strcasestr by 47%.
Reviewd-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On x86-64, there may be multiple IFUNC implementations for a given
function. But we may be only interested in a subset of them. This
patch adds -f/--functions argument to compare a subset of IFUNC
implementations.
* benchtests/scripts/compare_strings.py (process_results): Add
funcs argument. Compare only functions which are selected.
(main): Check if base function is among selected functions.
Pass selected functions to process_results.
(__main__): Add -f/--functions argument.
Catch runtime exceptions in case the user provided: wrong base
function, attribute(s) or input file. In any of the latter, quit
immediately with non-zero return code.
* benchtests/scripts/compare_string.py: (process_results) Catch
exception in non-existent base_func and catch exception in
non-existent attribute.
(parse_file) Catch exception in non-existent input file.
Having a string comparison report with neither diff numbers nor header
yields a more useful output to be consumed by other tools.
* benchtests/scripts/compare_string.py: Add --no-diff and --no-header
options to avoid diff calculation and omit header, respectively.
(main): process --no-diff and --no-header
This is a minor style change to move the definition of I to its usage
scope instead of at the top of the function. This is consistent with
glibc style guidelines and more importantly it was getting in the way
of my testing.
* benchtests/bench-memcpy-walk.c (do_test): Move declaration
of I into loop header.
* benchtests/bench-memmove-walk.c (do_test): Likewise.
Add an undefine of attribute_hidden since it may be defined in some cases
(it must be defined since it is used by some hp-timing configurations).
* benchtests/bench-timing.h (attribute_hidden): Undefine.
Currently the benchtests are run with internal GLIBC headers, which is incorrect.
Defining _ISOMAC in the makefile ensures the internal headers are bypassed.
Fix all tests which were relying on internal defines or includes.
* benchtests/Makefile: Define _ISOMAC.
* benchtests/bench-strcoll.c: Add missing sys/stat.h include.
* benchtests/bench-string.h: Define inhibit_loop_to_libcall macro.
* benchtests/bench-strstr.c: Define empty libc_hidden_builtin_def.
* benchtests/bench-strtok.c (oldstrtok): Use rawmemchr.
* benchtests/bench-timing.h: Define attribute_hidden.
The 0 length strncmp is interesting for correctness but not for
performance.
* benchtests/bench-strncmp.c (test_main): Remove 0 length tests.
(do_test_limit): Likewise.
Don't reuse buffers for different strncmp implementations since the
earlier implementation will end up warming the cache for the later
one. Eventually there should be a more elegant way to do this.
* benchtests/bench-strncmp.c (do_test_limit): Reallocate buffers
for every implementation.
(do_test): Likewise.
Remove the slow paths from pow. Like several other double precision math
functions, pow is exactly rounded. This is not required from math functions
and causes major overheads as it requires multiple fallbacks using higher
precision arithmetic if a result is close to 0.5ULP. Ridiculous slowdowns
of up to 100000x have been reported when the highest precision path triggers.
All GLIBC math tests pass on AArch64 and x64 (with ULP of pow set to 1).
The worst case error is ~0.506ULP. A simple test over a few hundred million
values shows pow is 10% faster on average. This fixes BZ #13932.
[BZ #13932]
* sysdeps/ieee754/dbl-64/uexp.h (err_1): Remove.
* benchtests/pow-inputs: Update comment for slow path cases.
* manual/probes.texi (slowpow_p10): Delete removed probe.
(slowpow_p10): Likewise.
* math/Makefile: Remove halfulp.c and slowpow.c.
* sysdeps/aarch64/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/generic/math_private.h (__exp1): Remove error argument.
(__halfulp): Remove.
(__slowpow): Remove.
* sysdeps/i386/fpu/halfulp.c: Delete file.
* sysdeps/i386/fpu/slowpow.c: Likewise.
* sysdeps/ia64/fpu/halfulp.c: Likewise.
* sysdeps/ia64/fpu/slowpow.c: Likewise.
* sysdeps/ieee754/dbl-64/e_exp.c (__exp1): Remove error argument,
improve comments and add error analysis.
* sysdeps/ieee754/dbl-64/e_pow.c (__ieee754_pow): Add error analysis.
(power1): Remove function:
(log1): Remove error argument, add error analysis.
(my_log2): Remove function.
* sysdeps/ieee754/dbl-64/halfulp.c: Delete file.
* sysdeps/ieee754/dbl-64/slowpow.c: Likewise.
* sysdeps/m68k/m680x0/fpu/halfulp.c: Likewise.
* sysdeps/m68k/m680x0/fpu/slowpow.c: Likewise.
* sysdeps/powerpc/power4/fpu/Makefile: Remove CPPFLAGS-slowpow.c.
* sysdeps/x86_64/fpu/libm-test-ulps: Set ULP of pow to 1.
* sysdeps/x86_64/fpu/multiarch/Makefile: Remove slowpow-fma.c,
slowpow-fma4.c, halfulp-fma.c, halfulp-fma4.c.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma.c (__slowpow): Remove define.
* sysdeps/x86_64/fpu/multiarch/e_pow-fma4.c (__slowpow): Likewise.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma.c: Delete file.
* sysdeps/x86_64/fpu/multiarch/halfulp-fma4.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma.c: Likewise.
* sysdeps/x86_64/fpu/multiarch/slowpow-fma4.c: Likewise.
Keeping the buffers the same across test runs gives later invocations
the advantage since they access cached data. Reallocate so that all
test runs are on equal grounds.
* benchtests/bench-memcmp.c (do_test): Call realloc_buf for
every test run.
This patch adds BENCHSET variable to benchtests/Makefile in order to
provide the capability to run a list of subsets of benchmark tests, ie;
make bench BENCHSET="bench-pthread bench-math malloc-thread"
This helps users to benchmark specific glibc area
ChangeLog:
* benchtests/Makefile:Add BENCHSET to allow subsets of
benchmarks to be run.
* benchtests/README: Add documentation for: Running subsets of
benchmarks.
Signed-off-by: Victor Rodriguez <victor.rodriguez.bahena@intel.com>
Signed-off-by: Icarus Sparry <icarus.w.sparry@intel.com>
Reviewed-By: Siddhesh Poyarekar <siddhesh@sourceware.org>
When executing bench-math the benchmark output is invalid with this
error msg:
Invalid benchmark output: 'workload-spec2006.wrf' does not match any of
the regexes: '^[_a-zA-Z0-9]*$¹ or Invalid benchmark output: Additional
properties are not allowed ('workload-spec2006.wrf' was unexpected)
The error was seen when running the test:
workload-spec2006.wrf, 'stack=1024,guard=1' and 'stack=1024,guard=2'.
The problem is that the current regex's do not accept the hyphen, dot, equal
and comma in the output.
This patch changes the regex in benchout.schema.json to accept symbols in
benchmark tests names.
ChangeLog:
* benchtests/scripts/benchout.schema.json: Fix regex to accept a
wider range of tests names.
Signed-off-by: Victor Rodriguez <victor.rodriguez.bahena@intel.com>
Reviewed-By: Siddhesh Poyarekar <siddhesh@sourceware.org>
Benchmark workload-spec2006.wrf does not produce max, min or mean
results but instead produce throughput. This is represented in
benchtests/bench-skeleton.c. This patch adjust benchout.schema.json to consider
bench.out from bench-math benchmarks as valid
ChangeLog:
* benchtests/scripts/benchout.schema.json: Add throughput as accepted
result from property and remove "max", min" and "mean" from required
properties based on benchtests/bench-skeleton.c.
Signed-off-by: Victor Rodriguez <victor.rodriguez.bahena@intel.com>
Reviewed-By: Siddhesh Poyarekar <siddhesh@sourceware.org>
Numbers for very small sizes (< 128B) are much noisier for non-cached
benchmarks like the walk benchmarks, so don't include them.
* benchtests/bench-memcpy-walk.c (START_SIZE): Set to 128.
* benchtests/bench-memmove-walk.c (START_SIZE): Likewise.
* benchtests/bench-memset-walk.c (START_SIZE): Likewise.
Make the walking benchmarks walk only backwards since copying both
ways is biased in favour of implementations that use non-temporal
stores for larger sizes; falkor is one of them. This also fixes up
bugs in computation of the result which ended up multiplying the
length with the timing result unnecessarily.
* benchtests/bench-memcpy-walk.c (do_one_test): Copy only
backwards. Fix timing computation.
* benchtests/bench-memmove-walk.c (do_one_test): Likewise.
* benchtests/bench-memset-walk.c (do_one_test): Walk backwards
on memset by N at a time. Fix timing computation.
This benchmark is an attempt to eliminate cache effects from string
benchmarks. The benchmark walks both ways through a large memory area
and copies different sizes of memory and alignments one at a time
instead of looping around in the same memory area. This is a good
metric to have alongside the simple memmove benchmark (which is only
really useful for smaller sizes) especially for larger sizes where the
likelihood of the call being done only once is pretty high.
This benchmark is different from memcpy in that it also tests
overlapping copies.
* benchtests/bench-memmove-walk.c: New file.
* benchtests/Makefile (string-benchset): Add it.
This benchmark is an attempt to eliminate cache effects from string
benchmarks. The benchmark walks backward through a large memory area
and sets different sizes of memory and alignments one at a time
instead of looping around in the same memory area. This is a good
metric to have alongside the simple memset benchmark (which is only
really useful for smaller sizes) especially for larger sizes where the
likelihood of the call being done only once is pretty high.
* benchtests/bench-memset-walk.c: New file.
* benchtests/Makefile (string-benchset): Add it.
This benchmark is an attempt to eliminate cache effects from string
benchmarks. The benchmark walks both ways through a large memory area
and copies different sizes of memory and alignments one at a time
instead of looping around in the same memory area. This is a good
metric to have alongside the other memcpy benchmarks, especially for
larger sizes where the likelihood of the call being done only once is
pretty high.
* benchtests/bench-memcpy-walk.c: New file.
* benchtests/Makefile (string-benchset): Add it.
exp2f and log2f benchmark traces are just copies of the existing
expf and logf traces from wrf_r.
* benchtests/Makefile: Add exp2f and log2f benchmarks.
* benchtests/exp2f-inputs: Copy of expf-inputs.
* benchtests/log2f-inputs: Copy of logf-inputs.
Add a trace for logf. This is a reduced trace based on 2.8 billion
samples extracted from wrf_r.
* benchtests/Makefile: Add logf benchmark.
* benchtests/logf-inputs: Add reduced trace from wrf_r.
Add a trace for expf. This is a reduced trace based on 2.4 billion
samples extracted from wrf_r.
* benchtests/Makefile: Add expf benchmark.
* benchtests/expf-inputs: Add reduced trace from wrf_r.
This patch adds benchtests for the trunc and truncf functions. The
inputs listed are fairly arbitrary; I do not assert they are
representative of any particular application.
* benchtests/Makefile (bench-math): Add trunc and truncf.
(CFLAGS-bench-trunc.c): New variable.
(CFLAGS-bench-truncf.c): Likewise.
* benchtests/trunc-inputs: New file.
* benchtests/truncf-inputs: Likewise.
The compare_strings.py option unconditionally generates a graph PNG
image of the input data, which can be unnecessary and slow. Put this
behind an optional flag -g.
* benchtests/scripts/compare_strings.py: New option -g.
(draw_graph): Print a message that a graph is being generated.
(process_results): Generate graph only if -g is passed.
(main): Process option -g.
Make the column widths for the outputs fixed so that they look a
little less messy. They will still look bad with lots of IFUNCs (like
on x86) but it's still a step forward.
* benchtests/scripts/compare_strings.py (process_results):
Better spacing for output.
Make the script more usable by adding proper command line options
along with a way to query the options. The script is capable of doing
a bunch of things right now like choosing a base for comparison,
choosing to generate graphs, etc. and they should be accessible via
command line switches.
* benchtests/scripts/compare_strings.py: Use argparse.
* benchtests/README: Document existence of compare_strings.py.
Keeping the same buffers along with copying the same size of data into
the same location means that the first routine is typically the
slowest since it has to bear the cost of fetching data into to cache.
Reallocating buffers stabilizes numbers by a bit.
* benchtests/bench-string.h (realloc_bufs): New function.
(test_init): Call it.
* benchtests/bench-memset-large.c (do_test): Likewise.
* benchtests/bench-memset.c (do_test): Likewise.
Make the memset benchmarks (bench-memset and bench-memset-large) print
their output in JSON so that they can be evaluated using the
compare_strings.py script.
* benchtests/bench-memset-large.c: Print output in JSON
format.
* benchtests/bench-memset.c: Likewise.
This code page is identical to code page 850 except that X'D5'
has been changed from LI61 (dotless i) to SC20 (euro symbol).
The code points from /x01 to /x1f in the /localedata/charmaps/IBM858
file have the same mapping as those in localedata/charmaps/ANSI_X3.4-1968.
That means they disagree with with
ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00858.txt
in that range.
For example, localedata/charmaps/IBM858 and localedata/charmaps/ANSI_X3.4-1968 have:
“<U0001> /x01 START OF HEADING (SOH)”
whereas CP00858.txt has:
“01 SS000000 Smiling Face”
That means that CP00858.txt is not really ASCII-compatible and to make
it ASCII-compatible we deviate fro CP00858.txt in the code points from /x01
to /x1f.
[BZ #21084]
* benchtests/strcoll-inputs/filelist#en_US.UTF-8: Add IBM858 and ibm858.c.
* iconvdata/Makefile: Add IBM858.
* iconvdata/gconv-modules: Add IBM858.
* iconvdata/ibm858.c: New file.
* iconvdata/tst-tables.sh: Add IBM858
* localedata/charmaps/IBM858: New file.
This patch further improves math function benchmarking by adding a latency
test in addition to throughput. This enables more accurate comparisons of the
math functions. The latency test works by creating a dependency on the previous
iteration: func_res = F (func_res * zero + input[i]). The multiply by zero
avoids changing the input.
It reports reciprocal throughput and latency in nanoseconds (depending on the
timing header used) and max/min throughput in iterations per second:
"workload-spec2006.wrf": {
"reciprocal-throughput": 100,
"latency": 200,
"max-throughput": 1.0e+07,
"min-throughput": 5.0e+06
}
* benchtests/bench-skeleton.c (main): Add support for
latency benchmarking.
* benchtests/scripts/bench.py: Add support for latency benchmarking.
Make the memmove benchmarks (bench-memmove and bench-memmove-large)
print their output in JSON so that they can be evaluated using the
compare_strings.py script.
* benchtests/bench-memmove-large.c: Print output in JSON
format.
* benchtests/bench-memmove.c: Likewise.
The test run is unnecessary and interferes with the benchmark. The
tests are done during make check, so they're unnecessary here.
* benchtests/bench-memccpy.c (do_one_test): Remove checks.
* benchtests/bench-memchr.c (do_one_test): Likewise.
* benchtests/bench-memcpy-large.c (do_one_test): Likewise.
* benchtests/bench-memcpy.c (do_one_test): Likewise.
* benchtests/bench-memmove-large.c (do_one_test): Likewise.
* benchtests/bench-memmove.c (do_one_test): Likewise.
* benchtests/bench-memset-large.c (do_one_test): Likewise.
* benchtests/bench-memset.c (do_one_test): Likewise.
* benchtests/bench-string.h (test_init): Remove memsets.
The compare_strings.py script generates a graph for the benchmarks it
performs a comparison on and that fails if X is not available. Avoid
the error and ensure that only the graph is generated and saved as a
PNG file.
* benchtests/scripts/compare_strings.py: Avoid display error
when generating graph.
This patch allows one to provide the function name using an optional
-base option to compare all other functions against. This is useful
when pitching one implementation of a string function against
alternatives. In the absence of this option, comparisons are done
against the first ifunc in the list.
* benchtests/scripts/compare_strings.py (main): Add an
optional -base option.
(process_results): New argument base_func.
The hardcoded 'memcpy' name turns up in other derived tests like
mempcpy.
* benchtests/bench-memcpy.c (test_main): Use TEST_NAME instead of
hardcoding memcpy.
* benchtests/bench-memcpy-large.c (test_name): Likewise.
* benchtests/bench-memcpy-random.c (test_name): Likewise.
Read the memcpy results in json and print out the results in tabular
form, in addition to generating a graph of the results to compare all
of the implementations.
The format of the output is extensible enough to allow this kind of
analysis to be done on other string functions as well.
* benchtests/scripts/benchout_strings.schema.json: New file.
* benchtests/scripts/compare_strings.py: New file.
Print the benchmark output for various memcpy benchmarks in json so
that it can be predictably parsed and analyzed.
* benchtests/bench-memcpy-large.c: Include json-lib.h.
(do_one_test): Print json.
(do_test): Likewise.
(test_main): Likewise.
* benchtests/bench-memcpy-random.c: Include json-lib.h.
(do_one_test): Print json.
(do_test): Likewise.
(test_main): Likewise.
* benchtests/bench-memcpy.c: Include json-lib.h.
(do_one_test): Print json.
(do_test): Likewise.
(test_main): Likewise.
Enhance the json module in benchtests to print signed and unsigned
integers and string array elements.
* benchtests/json-lib.h: Include inttypes.h.
(json_attr_int, json_attr_int, json_element_string,
json_element_int, json_element_uint): New functions.
* benchtests/json-lib.c: (json_attr_int, json_attr_int,
json_element_string, json_element_int, json_element_uint): New
functions.
Add a workload for powf. This is a reduced trace based on 2.3 billion
samples extracted from wrf. The distribution of values, in particular
frequency of commonly used operands is the same as in the full trace.
* benchtests/powf-inputs: Add reduced trace from wrf.
Improve support for math function benchmarking. This patch adds
a feature that allows accurate benchmarking of traces extracted
from real workloads. This is done by iterating over all samples
rather than repeating each sample many times (which completely
ignores branch prediction and cache effects). A trace can be
added to existing math function inputs via
"## name: workload-<name>", followed by the trace.
* benchtests/README: Describe workload feature.
* benchtests/bench-skeleton.c (main): Add support for
benchmarking traces from workloads.
Add powf() bench test with input which covers these cases:
- positive base to positive exponent
- exponent 0
- negative base to even exponent
- exponent 1
- exponent -1
- squared
- squareroot
- 1 to negative exponent
- -1 to negative exponent
- base 0
- -1 to even exponent
- small base
- small exponent
* benchtests/Makefile (bench-math): Add powf.
* benchtests/powf-inputs: New file.
Current allocate_stack logic for create stacks is to first mmap all
the required memory with the desirable memory and then mprotect the
guard area with PROT_NONE if required. Although it works as expected,
it pessimizes the allocation because it requires the kernel to actually
increase commit charge (it counts against the available physical/swap
memory available for the system).
The only issue is to actually check this change since side-effects are
really Linux specific and to actually account them it would require a
kernel specific tests to parse the system wide information. On the kernel
I checked /proc/self/statm does not show any meaningful difference for
vmm and/or rss before and after thread creation. I could only see
really meaningful information checking on system wide /proc/meminfo
between thread creation: MemFree, MemAvailable, and Committed_AS shows
large difference without the patch. I think trying to use these
kind of information on a testcase is fragile.
The BZ#18988 reports shows that the commit pages are easily seen with
mlockall (MCL_FUTURE) (with lock all pages that become mapped in the
process) however a more straighfoward testcase shows that pthread_create
could be faster using this patch:
--
static const int inner_count = 256;
static const int outer_count = 128;
static
void *thread1(void *arg)
{
return NULL;
}
static
void *sleeper(void *arg)
{
pthread_t ts[inner_count];
for (int i = 0; i < inner_count; i++)
pthread_create (&ts[i], &a, thread1, NULL);
for (int i = 0; i < inner_count; i++)
pthread_join (ts[i], NULL);
return NULL;
}
int main(void)
{
pthread_attr_init(&a);
pthread_attr_setguardsize(&a, 1<<20);
pthread_attr_setstacksize(&a, 1134592);
pthread_t ts[outer_count];
for (int i = 0; i < outer_count; i++)
pthread_create(&ts[i], &a, sleeper, NULL);
for (int i = 0; i < outer_count; i++)
pthread_join(ts[i], NULL);
assert(r == 0);
}
return 0;
}
--
On x86_64 (4.4.0-45-generic, gcc 5.4.0) running the small benchtests
I see:
$ time ./test
real 0m3.647s
user 0m0.080s
sys 0m11.836s
While with the patch I see:
$ time ./test
real 0m0.696s
user 0m0.040s
sys 0m1.152s
So I added a pthread_create benchtest (thread_create) which check
the thread creation latency. As for the simple benchtests, I saw
improvements in thread creation on all architectures I tested the
change.
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
arm-linux-gnueabihf, powerpc64le-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu.
[BZ #18988]
* benchtests/thread_create-inputs: New file.
* benchtests/thread_create-source.c: Likewise.
* support/xpthread_attr_setguardsize.c: Likewise.
* support/Makefile (libsupport-routines): Add
xpthread_attr_setguardsize object.
* support/xthread.h: Add xpthread_attr_setguardsize prototype.
* benchtests/Makefile (bench-pthread): Add thread_create.
* nptl/allocatestack.c (allocate_stack): Call mmap with PROT_NONE and
then mprotect the required area.
bench-memchr.c is shared with bench-memrchr.c. This patch adds some
tests for positions close to the beginning for memrchr, which are
equivalent to positions close to the end for memchr.
* benchtests/bench-memchr.c (do_test): Print out both length
and position.
(test_main): Also test the position close to the beginning for
memrchr.
cppflags-iterator.mk no longer has anything to do with CPPFLAGS; all
it does is set libof-$(foo) for a list of files. extra-modules.mk
does the same thing, but with a different input variable, and doesn't
let the caller control the module. Therefore, this patch gives
cppflags-iterator.mk a better name, removes extra-modules.mk, and
updates all uses of both.
* extra-modules.mk: Delete file.
* cppflags-iterator.mk: Rename to ...
* libof-iterator.mk: ...this. Adjust comments.
* Makerules, extra-lib.mk, benchtests/Makefile, elf/Makefile
* elf/rtld-Rules, iconv/Makefile, locale/Makefile, malloc/Makefile
* nscd/Makefile, sunrpc/Makefile, sysdeps/s390/Makefile:
Use libof-iterator.mk instead of cppflags-iterator.mk or
extra-modules.mk.
* benchtests/strcoll-inputs/filelist#en_US.UTF-8: Remove
extra-modules.mk and cppflags-iterator.mk, add libof-iterator.mk.
of the size and alignment is based on a trace of SPEC2006. Instead of
repeating the same copy over and over again like the existing tests, it times
several thousand different copies to more accurately estimate the overhead of
branch prediction.
* benchtests/Makefile (string-benchset): Add memcpy-random.
* benchtests/bench-memcpy-random.c: New file.
Add a configure check that looks for python3 and python in that order
since we had agreed in the past to prefer python3 over python in all
our code. The patch also adjusts invocations through the various
Makefiles to use the set variable.
* configure.ac: Check for python3 or python.
* configure: Regenerated.
* config.make.in (PYTHON): New variable.
* benchtests/Makefile: Don't define PYTHON.
(bench): Define target only if PYTHON was defined.
* Rules: Don't define PYTHON.
Define pretty printer targets only if PYTHON was defined.
(tests-printers): Add to tests-unsupported if PYTHON is not
found.
(python-flags, python-invoke): Remove.
(tests-printers-out): Use PYTHON instead of python-invoke.
Currently strsep calls strpbrk is is now a veneer to strcspn. Calling
strcspn directly is faster. Since it handles a delimiter string of size
1 as a special case, this is not needed in strsep itself. Although this
means there is a slightly higher overhead if the delimiter size is 1,
all other cases are slightly faster. The overall performance gain is 5-10%
on AArch64.
The string/bits/string2.h header contains optimizations for constant
delimiters of size 1-3. Benchmarking these showed similar performance for
size 1 (since in all cases strchr/strchrnul is used), while size 2 and 3
can give up to 2x speedup for small input strings. However if these cases
are common it seems much better to add this optimization to strcspn.
So move these header optimizations to string-inlines.c.
Improve the strsep benchmark so that it actually benchmarks something.
The current version contains a delimiter character at every position in the
input string, so there is very little work to do, and the extremely inefficent
simple_strsep implementation appears fastest in every case. The new version
has either no match in the input for the fail case and a match halfway in the
input for the success case. The input is then restored so that each iteration
does exactly the same amount of work. Reduce the number of testcases since
simple_strsep takes a lot of time now.
* benchtests/bench-strsep.c (oldstrsep): Add old implementation.
(do_one_test) Restore original string so iteration works.
* string/string-inlines.c (do_test): Create better input strings.
(test_main) Reduce number of testruns.
* string/string-inlines.c (__old_strsep_1c): New function.
(__old_strsep_2c): Likewise.
(__old_strsep_3c): Likewise.
* string/strsep.c (__strsep): Remove case of small delim string.
Call strcspn directly rather than strpbrk.
* string/bits/string2.h (__strsep): Remove define.
(__strsep_1c): Remove.
(__strsep_2c): Remove.
(__strsep_3c): Remove.
(strsep): Remove.
* sysdeps/unix/sysv/linux/internal_statvfs.c
(__statvfs_getflags): Rename to __strsep.
This patch adds fmaxf and fminf benchtests. It is based on
math/s_fmax_template.c implementation which checks for basically four
different classes:
1. if x is greater or equal than y.
2. if x is less than y.
3. if x or y is signaling.
4. if y is nan.
Cases 1 and 2 are used for default input number (by mixing normal double
numbers and infinity), while case 3 and 4 are used each for on for a
benchmark class.
Checked on x86_64-linux-gnu and powerpc64-linux-gnu.
* benchtests/Makefile (bench-math): Add fminf and fmaxf.
(CFLAGS-bench-fmaxf.c): New rule.
(CFLAGS-bench-fminf.c): Likewise.
* benchtests/fmaxf-inputs: New file.
* benchtests/fminf-inputs: Likewise.
This patch adds fmax and fmin benchtests. It is based math/s_fmax_template.c
implementation which checks for basically four different classes:
1. if x is greater or equal than y.
2. if x is less than y.
3. if x or y is signaling.
4. if y is nan.
Cases 1 and 2 are used for default input number (by mixing normal double
numbers and infinity), while case 3 and 4 are used each for on for a
benchmark class.
Checked on x86_64-linux-gnu and powerpc64-linux-gnu.
* benchtests/Makefile (bench-math): Add fmin and fmax.
(CFLAGS-bench-fmax.c): New rule.
(CFLAGS-bench-fmin.c): New rule.
* benchtests/fmax-inputs: New file.
* benchtests/fmin-inputs: Likewise.
Benchsets in benchtests use test-skeleton, so they too need to be
linked against the new libsupport DSO.
* benchtests/Makefile (binaries-benchset): Depend on libsupport
DSO.
calls strcspn, call strcspn directly so we get the end of the token without
an extra call to rawmemchr. Also avoid an unnecessary call to strcspn after
the last token by adding an early exit for an empty string. Change strtok
to tailcall strtok_r to avoid unnecessary code duplication.
Remove the special header optimization for strtok_r of a 1-character
constant string - both strspn and strcspn contain optimizations for this
case. Benchmarking this showed similar performance in the worst case,
but up to 5.5x better performance in the "found" case for large inputs.
* benchtests/bench-strtok.c (oldstrtok): Add old implementation.
* string/strtok.c (strtok): Change to tailcall __strtok_r.
* string/strtok_r.c (__strtok_r): Optimize for performance.
* string/string-inlines.c (__old_strtok_r_1c): New function.
* string/bits/string2.h (__strtok_r): Move to string-inlines.c.
This patch makes the sqrt benchmark use -fno-builtin, as already done
for benchmarks of ffs and ffsll, so that it actually benchmarks the
glibc function as (presumably) intended even in the presence of the
compiler inlining sqrt.
Tested for x86_64 and also used for benchmarking my ARM sqrt patch.
* benchtests/Makefile (CFLAGS-bench-sqrt.c): New variable.
Clear the destination buffer updated by the previous run in bench-memcpy.c
and test-memcpy.c to catch the error when the following implementations do
not copy anything.
[BZ #19907]
* benchtests/bench-memcpy.c (do_one_test): Clear the destination
buffer updated by the previous run.
* string/test-memcpy.c (do_one_test): Likewise.
* benchtests/bench-memmove.c (do_one_test): Add a comment.
* string/test-memmove.c (do_one_test): Likewise.
This patch adds full support for cross-building benchmarks. Some
benchmarks like those that need locales to be generated cannot be
built and are hence skipped for cross builds.
Tested by cross building for aarch64 on x86_64 and then running the
generated benchmark on aarch64.
* benchtests/Makefile (wcsmbs-benchset): Include only for
native builds and runs.
(LOCALES): Likewise.
(bench-build): Build timing-type here instead of the bench
target. Generate locale only for native builds.
* benchtests/README: Add note for cross-building.
For situations where we are cross-building or where we want to avoid
building on the target system, we want a way to only build benchmarks
and then copy them over to the target system to run them. I have also
added a simple enhancement for the 'bench' target where all benchmark
binaries are built and then the benchmarks executed.
Tested on arm.
Makefile.in (bench-build): New target.
Rules (PHONY): Add bench-build target.
benchtests/Makefile (bench): Depend on bench-build.
(bench-build): New target.
From the bug:
Obsolete locale. The ISO-639 code for Hebrew was changed from 'iw'
to 'he' in 1989, according to Bruno Haible on libc-alpha 2003-09-01.
Reported-by: Chris Leonard <cjlhomeaddress@gmail.com>
benchtests should use $(test-via-rtld-prefix) and $(+link-tests) like
other glibc tests.
[BZ #19783]
* benchtests/Makefile (run-bench): Replace $(rtld-prefix) with
$(test-via-rtld-prefix).
($(binaries-bench)): Replace $(+link) with $(+link-tests).
The ffs and ffsll functions were listed as math functions when they
are actually defined in strings.h and string.h respectively. Shuffle
around the Makefile variables a bit and make a separate space for ffs
and ffsll.
The sincos benchmark has only about a dozen inputs that don't measure
the impact of changes to various passes. Since much of the code
properties are inherited from sin and cos, copy those inputs in to get
more comprehensive coverage.
Prevent function calls that don't return anything from being optimized
out by the compiler by marking its input variables as used.
This prevents the sincos function call from being optimized out in the
benchmark.
ChangeLog:
2015-09-18 Wilco Dijkstra <wdijkstr@arm.com>
* benchtests/Makefile: Add bench-math-inlines, link with libm.
* benchtests/bench-math-inlines.c: New benchmark.
* benchtests/bench-util.h: New file.
* benchtests/bench-util.c: New file.
* benchtests/bench-skeleton.c: Add include of bench-util.c/h.