Unfortunately, immintrin.h (which is also included by SkTypes)
includes xmmintrin.h which includes mm_malloc.h which includes
stdlib.h for malloc even though, from the implementation, it is
difficult to see why.
Fortunately, arm_neon.h does not seem to be involved in such
shenanigans, so building for Android will keep things sane.
TBR=reed@google.com
Doesn't change Skia API, just moves an include.
Review URL: https://codereview.chromium.org/1313203003
Prior to this CL, if a client wanted to decode scanlines, they had to
create an SkCodec in order to get an SkScanlineDecoder. This introduces
complications if input data is not easily shared between the two
objects.
Instead, add methods to SkScanlineDecoder for creating a new one from
input data, and remove the creation functions from SkCodec.
Update DM and tests.
Review URL: https://codereview.chromium.org/1267583002
(and a couple presubmit fixes)
This allows us to turn back on -Werror for LLVM coverage builds,
and more generally supports building with Clang 3.7.
No public API changes.
TBR=reed@google.com
BUG=skia:
Review URL: https://codereview.chromium.org/1232463006
Make getScanlineDecoder return a new object each time, which is
owned by the caller, and independent from any existing scanline
decoders and the SkCodec itself.
Since the SkCodec already contains the entire state machine, and it
is used by the scanline decoders, simply create a new SkCodec which
is now owned by the scanline decoder.
Move code that cleans up after using a scanline decoder into its
destructor
One side effect is that creating the first scanline decoder requires
a duplication of the stream and re-reading the header. (With some
more complexity/changes, we could pass the state machine to the
scanline decoder and make the SkCodec recreate its own state machine
instead.) The typical client of the scanline decoder (region decoder)
uses an SkMemoryStream, so the duplication is cheap, although we
should consider the extra time to reread the header/recreate the state
machine. (If/when we use the scanline decoder for other purposes,
where the stream may not be cheaply duplicated, we should consider
passing the state machine.)
One (intended) result of this change is that a client can create a
new scanline decoder in a new thread, and decode different pieces of
the image simultaneously.
In SkPngCodec::decodePalette, use fBitDepth rather than a parameter.
Review URL: https://codereview.chromium.org/1230033004
SkImageGenerator makes some assumptions that are not necessarily valid
for SkCodec. For example, SkCodec does not assume that it can always be
rewound.
We also have an ongoing question of what an SkCodec should report as
its default settings (i.e. the return from getInfo). It makes sense for
an SkCodec to report that its pixels are unpremultiplied, if that is
the case for the underlying data, but if a client of SkImageGenerator
uses the default settings (as many do), they will receive
unpremultiplied pixels which cannot (currently) be drawn with Skia. We
may ultimately decide to revisit SkCodec reporting an SkImageInfo, but
I have left it unchanged for now.
Import features of SkImageGenerator used by SkCodec into SkCodec.
I have left SkImageGenerator unchanged for now, but it no longer needs
Result or Options. This will require changes to Chromium.
Manually handle the lifetime of fScanlineDecoder, so SkScanlineDecoder.h
can include SkCodec.h (where Result is), and SkCodec.h does not need
to include it (to delete fScanlineDecoder).
In many places, make the following simple changes:
- Now include SkScanlineDecoder.h, which is no longer included by
SkCodec.h
- Use the enums in SkCodec, rather than SkImageGenerator
- Stop including SkImageGenerator.h where no longer needed
Review URL: https://codereview.chromium.org/1220733013
This makes nanobench picture recording benchmarks somewhat useful again,
as opposed to all taking about 5us to run no matter the content.
ATTN Sheriff: this will probably trigger perf.skia.org alerts.
BUG=skia:
Review URL: https://codereview.chromium.org/1219873002
All of the CodecSubset benches fail when the color type is
kIndex8. We need to pass a color table to allocPixels()
when we want to decode to kIndex8 or it will throw a failure.
BUG=skia:
Review URL: https://codereview.chromium.org/1213983003
Changes verbose mode to print both the table and the individual sample
values. No need to hold back information in verbose mode.
BUG=skia:
Review URL: https://codereview.chromium.org/1208763003
Adds a nanobench mode that takes samples for a fixed amount of time,
rather than taking a fixed amount of samples.
BUG=skia:
Review URL: https://codereview.chromium.org/1204153002
Now that Sk4px exists, there's a lot less sense in eeking out every
cycle of speed from SkPMFloat: if we need to go _really_ fast, we
should use Sk4px. SkPMFloat's going to be used for things that are
already slow: large-range intermediates, divides, sqrts, etc.
A [0,1] range is easier to work with, and can even be faster if we
eliminate enough *255 and *1/255 steps. This is particularly true
on ARM, where NEON can do the *255 and /255 steps for us while
converting float<->int.
We have lots of experimental SkPMFloat <-> SkPMColor APIs that
I'm now removing. Of the existing APIs, roundClamp() is the sanest,
so I've kept only that, now called round(). The 4-at-a-time APIs
never panned out, so they're gone.
There will be small diffs on:
colormatrix coloremoji colorfilterimagefilter fadefilter imagefilters_xfermodes imagefilterscropexpand imagefiltersgraph tileimagefilter
BUG=skia:
Review URL: https://codereview.chromium.org/1201343004
Improves the GPU measuring accuracy of nanobench by using fence syncs.
Fence syncs are very widely supported and available on almost every
platform.
NO_MERGE_BUILDS
BUG=skia:
Review URL: https://codereview.chromium.org/1194783003
I think these changes to the subset benchmarks cover what we discussed yesterday.
I removed the divisor benchmarks (2x2, 3x3) and changed the single subset benchmarks.
Also, we will no longer benchmark subset decodes on small images.
BUG=skia:
Review URL: https://codereview.chromium.org/1188223002
Let's make CPU-bound .SKP benching mimic Chrome's tiles.
Unfortunately, the CPU code also performs a lot better with those big wide tiles...
BUG=skia:
Review URL: https://codereview.chromium.org/1189863002
This makes it easier to benchmark _mpd variants in a profiler.
E.g.,
<profiler> out/Release/nanobench --images --config 8888 --loops -1 --match sp_desk_nytimes
BUG=skia:
Review URL: https://codereview.chromium.org/1184673006
I haven't figured out a pithy way to have these apply to only classes
originating from SkNx, so let's just remove them. There aren't too
many use cases, and it's not really any less readable without them.
Semantically, this is a no-op.
BUG=skia:
Review URL: https://codereview.chromium.org/1167153002
The existing bench only tests the fast path, but we're looking to speed
up the general case. It'd be nice to be able to measure that speedup.
BUG=skia:
Review URL: https://codereview.chromium.org/1146953003
Constructing the gm tests and benches causes many calls to font loads.
This is visible as profiling samples in fontconfig and freetype on Linux
for all profiling runs of nanobench. This complicates analysis of
test-cases that are suspected of being slow due to font-related issues.
Move the font loading to GM::onOnceBeforeDraw and Benchmark::onPreDraw.
This way the code is not executed if the testcase does not match the
nanobench --match filter. This way the samples in font-related code are
more easy to identify as legitimate occurances caused by the testcase.
This should not cause differences in timings, because:
* Benchmark::preDraw / onPreDraw is defined to be run outside the timer
* GM::runAsBench is not enabled for any of the modified testcases. Also
nanobench untimed warmup round should run the onOnceBeforeDraw.
(and there are other GM::runAsBench gms already doing loading in
onOnceBeforeDraw).
Changes the behavior:
In TextBench:
Before, the test would report two different gms with the same name if
the color emoji font was not loaded successfully.
After, the test always reports all tests as individual names.
Generally:
The errors from loading fonts now print inbetween each testcase, as
opposed to printing during construction phase. Sample output:
( 143/145 MB 1872) 14.7ms 8888 gm quadclosepathResource /fonts/Funkster.ttf not a valid font.
( 160/160 MB 1831) 575µs 8888 gm surfacenewResource /fonts/Funkster.ttf not a valid font.
( 163/165 MB 1816) 12.5ms 8888 gm linepathResource /fonts/Funkster.ttf not a valid font.
( 263/411 MB 1493) 118ms 8888 gm typefacestyles_kerningResource /fonts/Funkster.ttf not a valid font.
( 374/411 MB 1231) 7.16ms 565 gm getpostextpathResource /fonts/Funkster.ttf not a valid font.
( 323/411 MB 1179) 4.92ms 565 gm stringartResource /fonts/Funkster.ttf not a valid font.
( 347/493 MB 917) 191ms 565 gm patch_gridResource /fonts/Funkster.ttf not a valid font.
( 375/493 MB 857) 23.9ms gpu gm clipdrawdrawCannot render path (0)
( 393/493 MB 706) 2.91ms unit test ParsePath------ png error IEND: CRC error
( 394/493 MB 584) 166ms gpu gm hairmodesResource /fonts/Funkster.ttf not a valid font.
Resource /fonts/Funkster.ttf not a valid font.
Resource /fonts/Funkster.ttf not a valid font.
...
Review URL: https://codereview.chromium.org/1144023002
Make GrResourceCache performance less sensitive to key length change.
The memcmp in GrResourceKey is called when SkTDynamicHash jumps the
slots to find the hash by a index. Avoid most of the memcmps by
comparing the hash first.
This is important because small changes in key data length can cause
big performance regressions. The theory is that key length change causes
different hash values. These hash values might trigger memcmps that
originally weren't there, causing the regression.
Adds few specialized benches to grresourcecache_add to test different
key lengths. The tests are run only on release, because on debug the
SkTDynamicHash validation takes too long, and adding many such delays
to development test runs would be unproductive. On release the tests
are quite fast.
Effect of this patch to the added tests on amd64:
grresourcecache_find_10 738us -> 768us 1.04x
grresourcecache_find_2 472us -> 476us 1.01x
grresourcecache_find_25 841us -> 845us 1x
grresourcecache_find_4 565us -> 531us 0.94x
grresourcecache_find_54 1.18ms -> 1.1ms 0.93x
grresourcecache_find_5 834us -> 749us 0.9x
grresourcecache_find_3 620us -> 542us 0.87x
grresourcecache_add_25 2.74ms -> 2.24ms 0.82x
grresourcecache_add_56 3.23ms -> 2.56ms 0.79x
grresourcecache_add_54 3.34ms -> 2.62ms 0.78x
grresourcecache_add_5 2.68ms -> 2.1ms 0.78x
grresourcecache_add_10 2.7ms -> 2.11ms 0.78x
grresourcecache_add_2 1.85ms -> 1.41ms 0.76x
grresourcecache_add 1.84ms -> 1.4ms 0.76x
grresourcecache_add_4 1.99ms -> 1.49ms 0.75x
grresourcecache_add_3 2.11ms -> 1.55ms 0.73x
grresourcecache_add_55 39ms -> 13.9ms 0.36x
grresourcecache_find_55 23.2ms -> 6.21ms 0.27x
On arm64 the results are similar.
On arm_v7_neon, the results lack the discontinuity at 55:
grresourcecache_add 4.06ms -> 4.26ms 1.05x
grresourcecache_add_2 4.05ms -> 4.23ms 1.05x
grresourcecache_find 1.28ms -> 1.3ms 1.02x
grresourcecache_find_56 3.35ms -> 3.32ms 0.99x
grresourcecache_find_2 1.31ms -> 1.29ms 0.99x
grresourcecache_find_54 3.28ms -> 3.24ms 0.99x
grresourcecache_add_5 6.38ms -> 6.26ms 0.98x
grresourcecache_add_55 8.44ms -> 8.24ms 0.98x
grresourcecache_add_25 7.03ms -> 6.86ms 0.98x
grresourcecache_find_25 2.7ms -> 2.59ms 0.96x
grresourcecache_find_4 1.45ms -> 1.38ms 0.95x
grresourcecache_find_10 2.52ms -> 2.39ms 0.95x
grresourcecache_find_55 3.54ms -> 3.33ms 0.94x
grresourcecache_find_5 2.5ms -> 2.32ms 0.93x
grresourcecache_find_3 1.57ms -> 1.43ms 0.91x
The extremely slow case, 55, is postulated to be due to the index jump
collisions running the memcmp. This is not visible on arm_v7_neon probably due
to hash function producing different results for 32 bit architectures.
This change is needed for extending path cache key in Gr
NV_path_rendering codepath. Extending is needed in order to add dashed
paths to the path cache.
Review URL: https://codereview.chromium.org/1132723003
I'm thinking of using this in perf with something like:
ratio(fill(filter("test=foo")), fill(filter("test=control")))
Does that make sense to you?
Not sure that this is really a good control bench on all bots,
but I propose we just run it a bit and find out if it needs work.
BUG=skia:
Review URL: https://codereview.chromium.org/1129823003
The benches for N <= 10 get around 2x faster on my N7 and N9. I believe this
is because of the reduced function-call-then-function-pointer-call overhead on
the N7, and additionally because it seems autovectorization beats our NEON code
for small N on the N9.
My desktop is unchanged, though that's probably because N=10 lies well within a
region where memset's performance is essentially constant: N=100 takes only
about 2x as long as N=1 and N=10, which perform nearly identically.
BUG=skia:
Review URL: https://codereview.chromium.org/1073863002
The colorfilter is applied to a single (paint's) color, so the bench does not
measure the filter at all, but simply the blit of a color.
BUG=skia:
TBR=
Review URL: https://codereview.chromium.org/1055383002
Now that all SkCodecs can rewind (assuming the stream is rewindable),
we do not need to special case it.
Pointed out by Derek in the code review that added this.
TBR=djsollen
Review URL: https://codereview.chromium.org/1058633002
CodecBench:
Add new class for timing using SkCodec.
DecodingBench:
Include creating a decoder inside the loop. This is to have a better
comparison against SkCodec. SkCodec's factory function does not
necessarily read the same amount as SkImageDecoder's, so in order to
have a meaningful comparison, read the entire stream from the
beginning. Also for comparison, create a new SkStream from the
SkData each time.
Add a debugging check to make sure we have an SkImageDecoder.
Add include guards.
nanobench.cpp:
Decode using SkCodec.
When decoding using SkImageDecoder, exclude benches where we decoded
to a different color type than requested. SkImageDecoder may decide to
decode to a different type, in which case the name is misleading.
TODOs:
Now that we ignore color types that do not match the desired
color type, we should add Index8. This also means calling the more
complex version of getPixels so CodecBench can support kIndex8.
BUG=skia:3257
Review URL: https://codereview.chromium.org/1044363002
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002