skia2/bin/compare

#!/usr/bin/env python

import argparse
import sys

have_scipy = True
try:
    import scipy.stats
except:
    have_scipy = False

SIGNIFICANCE_THRESHOLD = 0.0001

parser = argparse.ArgumentParser(
    formatter_class=argparse.RawDescriptionHelpFormatter,
    description='Compare performance of two runs from nanobench.')
parser.add_argument('--use_means', action='store_true', default=False,
                    help='Use means to calculate performance ratios.')
parser.add_argument('baseline', help='Baseline file.')
parser.add_argument('experiment', help='Experiment file.')
args = parser.parse_args()

a,b = {},{}
for (path, d) in [(args.baseline, a), (args.experiment, b)]:
    for line in open(path):
        try:
            tokens = line.split()
            if tokens[0] != "Samples:":
                continue
            samples  = tokens[1:-1]
            label    = tokens[-1]
            d[label] = map(float, samples)
        except:
            pass

common = set(a.keys()).intersection(b.keys())

def mean(xs):
    return sum(xs) / len(xs)

ps = []
for key in common:
    p, asem, bsem = 0, 0, 0
    m = mean if args.use_means else min
    am, bm = m(a[key]), m(b[key])
    if have_scipy:
        _, p = scipy.stats.mannwhitneyu(a[key], b[key])
        asem, bsem = scipy.stats.sem(a[key]), scipy.stats.sem(b[key])
    ps.append((bm/am, p, key, am, bm, asem, bsem))
ps.sort(reverse=True)

def humanize(ns):
    for threshold, suffix in [(1e9, 's'), (1e6, 'ms'), (1e3, 'us'), (1e0, 'ns')]:
        if ns > threshold:
            return "%.3g%s" % (ns/threshold, suffix)

maxlen = max(map(len, common))

# We print only signficant changes in benchmark timing distribution.
bonferroni = SIGNIFICANCE_THRESHOLD / len(ps)  # Adjust for the fact we've run multiple tests.
for ratio, p, key, am, bm, asem, bsem in ps:
    if p < bonferroni:
        str_ratio = ('%.2gx' if ratio < 1 else '%.3gx') % ratio
        if args.use_means:
            print '%*s\t%6s(%6s) -> %6s(%6s)\t%s' % (maxlen, key, humanize(am), humanize(asem),
                                                     humanize(bm), humanize(bsem), str_ratio)
        else:
            print '%*s\t%6s -> %6s\t%s' % (maxlen, key, humanize(am), humanize(bm), str_ratio)
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`#!/usr/bin/env python`

Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`import argparse`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`import sys`
Make bin/c and bin/compare work on Windows. - Call python explicitly. - Drop numpy dependency (on numpy.mean.... come on.) - Make scipy dependency optional. Depends on https://codereview.chromium.org/1419073003 to really work. BUG=skia: Doesn't change code. NOTRY=true Review URL: https://codereview.chromium.org/1416833004 2015-10-28 16:45:44 +00:00
			`have_scipy = True`
			`try:`
			`import scipy.stats`
			`except:`
			`have_scipy = False`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00
			`SIGNIFICANCE_THRESHOLD = 0.0001`

Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`parser = argparse.ArgumentParser(`
			`formatter_class=argparse.RawDescriptionHelpFormatter,`
			`description='Compare performance of two runs from nanobench.')`
			`parser.add_argument('--use_means', action='store_true', default=False,`
			`help='Use means to calculate performance ratios.')`
			`parser.add_argument('baseline', help='Baseline file.')`
			`parser.add_argument('experiment', help='Experiment file.')`
			`args = parser.parse_args()`

Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`a,b = {},{}`
Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`for (path, d) in [(args.baseline, a), (args.experiment, b)]:`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`for line in open(path):`
			`try:`
Don't suppress nanobench output table in verbose mode Changes verbose mode to print both the table and the individual sample values. No need to hold back information in verbose mode. BUG=skia: Review URL: https://codereview.chromium.org/1208763003 2015-06-26 20:32:53 +00:00			`tokens = line.split()`
			`if tokens[0] != "Samples:":`
			`continue`
			`samples = tokens[1:-1]`
			`label = tokens[-1]`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`d[label] = map(float, samples)`
			`except:`
			`pass`

			`common = set(a.keys()).intersection(b.keys())`

Make bin/c and bin/compare work on Windows. - Call python explicitly. - Drop numpy dependency (on numpy.mean.... come on.) - Make scipy dependency optional. Depends on https://codereview.chromium.org/1419073003 to really work. BUG=skia: Doesn't change code. NOTRY=true Review URL: https://codereview.chromium.org/1416833004 2015-10-28 16:45:44 +00:00			`def mean(xs):`
			`return sum(xs) / len(xs)`

Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`ps = []`
			`for key in common:`
Make bin/c and bin/compare work on Windows. - Call python explicitly. - Drop numpy dependency (on numpy.mean.... come on.) - Make scipy dependency optional. Depends on https://codereview.chromium.org/1419073003 to really work. BUG=skia: Doesn't change code. NOTRY=true Review URL: https://codereview.chromium.org/1416833004 2015-10-28 16:45:44 +00:00			`p, asem, bsem = 0, 0, 0`
			`m = mean if args.use_means else min`
			`am, bm = m(a[key]), m(b[key])`
			`if have_scipy:`
			`_, p = scipy.stats.mannwhitneyu(a[key], b[key])`
compare has a syntax error where it is missing the path to sem(). BUG=skia: Review URL: https://codereview.chromium.org/1420963010 2015-11-06 18:35:37 +00:00			`asem, bsem = scipy.stats.sem(a[key]), scipy.stats.sem(b[key])`
Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`ps.append((bm/am, p, key, am, bm, asem, bsem))`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`ps.sort(reverse=True)`

			`def humanize(ns):`
			`for threshold, suffix in [(1e9, 's'), (1e6, 'ms'), (1e3, 'us'), (1e0, 'ns')]:`
			`if ns > threshold:`
			`return "%.3g%s" % (ns/threshold, suffix)`

			`maxlen = max(map(len, common))`

			`# We print only signficant changes in benchmark timing distribution.`
			`bonferroni = SIGNIFICANCE_THRESHOLD / len(ps) # Adjust for the fact we've run multiple tests.`
Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`for ratio, p, key, am, bm, asem, bsem in ps:`
Add nanobench stats scripts to Skia repo. These are the scripts I've been homegrowing for measuring perf impact. I think we found them useful today as a way of sifting through the noise. BUG=skia: Review URL: https://codereview.chromium.org/703713002 2014-11-24 20:39:59 +00:00			`if p < bonferroni:`
Don't count a leading 1 as a signficant digit in the ratio. What used to look like this: desk_pokemonwiki.skp 9.38ms -> 9.76ms 1x tabl_pravda.skp 237us -> 241us 1x desk_css3gradients.skp 249us -> 254us 1x .... desk_fontwipe.skp 39.6us -> 38.7us 0.98x tabl_digg.skp 922us -> 893us 0.97x tabl_gmail.skp 20.7us -> 20us 0.96x Now will print more like this: desk_pokemonwiki.skp 9.38ms -> 9.76ms 1.04x tabl_pravda.skp 237us -> 241us 1.02x desk_css3gradients.skp 249us -> 254us 1.02x .... desk_fontwipe.skp 39.6us -> 38.7us 0.98x tabl_digg.skp 922us -> 893us 0.97x tabl_gmail.skp 20.7us -> 20us 0.96x BUG=skia: Review URL: https://codereview.chromium.org/756643004 2014-11-24 22:44:23 +00:00			`str_ratio = ('%.2gx' if ratio < 1 else '%.3gx') % ratio`
Change to use mean and to use stderr. BUG=skia: Review URL: https://codereview.chromium.org/1228783003 2015-07-09 17:50:24 +00:00			`if args.use_means:`
			`print '%*s\t%6s(%6s) -> %6s(%6s)\t%s' % (maxlen, key, humanize(am), humanize(asem),`
			`humanize(bm), humanize(bsem), str_ratio)`
			`else:`
			`print '%*s\t%6s -> %6s\t%s' % (maxlen, key, humanize(am), humanize(bm), str_ratio)`