Commit Graph

36 Commits

Author SHA1 Message Date
mtklein
78559a78f9 Use ARMv8 CRC32 instructions for SkOpts::hash().
For large inputs, this runs ~11x faster than Murmur3.
My bench drops from 1µs to 88ns.

Like x86-64, this runs fastest if we work in 24 byte chunks.  16 byte chunks
run at about 0.75x this speed, 8 byte chunks at about 0.4x (which would still
be about 5x faster than Murmur3).

This'll require plumbing support for opts_crc32 into Chrome first before it can roll.

perf.skia.org charts we want to watch: https://perf.skia.org/#5490

Seach for compute_hash in these logs to see the difference:
baseline: https://luci-milo.appspot.com/swarming/task/30ba22f3dfe30e10/steps/nanobench/0/stdout
trybot: https://luci-milo.appspot.com/swarming/task/30bbc406cbf62d10/steps/nanobench/0/stdout

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2260823002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review-Url: https://codereview.chromium.org/2260823002
2016-08-22 08:53:45 -07:00
mtklein
9c3887c975 Step 1 of opts_crc32 rollout.
- This adds an crc32_sources variable to opts.gypi and hooks it into
    our GYP build with a dummy source file.

Step 2 will hook crc32_sources into Chrome's GN build.
Step 3 will actually land some code using CRC32 instructions.
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2259233002

Review-Url: https://codereview.chromium.org/2259233002
2016-08-19 08:07:51 -07:00
mtklein
4e97607d9a Use sse4.2 CRC32 instructions to hash when available.
About 9x faster than Murmur3 for long inputs.

Most of this is a mechanical change from SkChecksum::Murmur3(...) to SkOpts::hash(...).

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2208903002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;master.client.skia.compile:Build-Ubuntu-GCC-x86_64-Release-CMake-Trybot,Build-Mac-Clang-x86_64-Release-CMake-Trybot

Review-Url: https://codereview.chromium.org/2208903002
2016-08-08 09:06:28 -07:00
herb
4d1dd6643f Add stub for avx.
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2087343002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review-Url: https://codereview.chromium.org/2087343002
2016-06-23 09:40:30 -07:00
mtklein
809ccf37ec Remove NEON runtime detection support.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1952953004
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review-Url: https://codereview.chromium.org/1952953004
2016-05-05 10:58:39 -07:00
mtklein
b4a7dc99b1 Port S32A_opaque blit row to SkOpts.
This should be a pixel-for-pixel (i.e. bug-for-bug) port.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1820313002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1820313002
2016-03-23 06:29:12 -07:00
mtklein
81bb79b7b9 Remove SkNx AVX code. It is not really used. Getting in the way of refactoring.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1679053002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1679053002
2016-02-08 15:50:22 -08:00
msarett
0dfffbeeec Revert of AVX 2 SrcOver blits: color32, blitmask. (patchset id:450001 of https://codereview.chromium.org/1532613002/ )
Reason for revert:
Bot failures

Original issue's description:
> AVX 2 SrcOver blits: color32, blitmask.
>
> As a follow up to the SSE 4.1 CL, this should look pretty familiar.
>
> I've made some organizational changes around how we load, store, pack, and unpack data that I think makes things clearer and more orthogonal, and it'll make it easier to try out a pmaddubsw lerp.  I have backported these changes to the SSE 4.1 code, and I hope that I can actually get a lot of this code templated for sharing between the two later.
>
> Perf changes (relative to SSE 4.1):
> Xfermode_SrcOver:      1650 -> 1180  (0.71x)  // large opaque blit
> Xfermode_SrcOver_aa:   1794 -> 1653  (0.92x)  // large opaque + small transparent
> text_16_AA_{FF,BK,WT}: 1.72 -> 1.59  (0.92x)  // small opaque blit
> text_16_AA_88:         1.83 -> 1.77  (0.97x)  // small transparent blit
>
> This should be a big throughout win, and a small latency win.
> This should all be pixel-exact to the previous SSE 4.1 code.
>
>
> GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1532613002
> CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Ubuntu-GCC-x86_64-Release-CMake-Trybot,Build-Mac10.9-Clang-x86_64-Release-CMake-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/5d2117015eb271e09faf4a7ddd89093c9d618a36

TBR=herb@google.com,mtklein@google.com,mtklein@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true

Review URL: https://codereview.chromium.org/1632713002
2016-01-25 08:54:50 -08:00
mtklein
5d2117015e AVX 2 SrcOver blits: color32, blitmask.
As a follow up to the SSE 4.1 CL, this should look pretty familiar.

I've made some organizational changes around how we load, store, pack, and unpack data that I think makes things clearer and more orthogonal, and it'll make it easier to try out a pmaddubsw lerp.  I have backported these changes to the SSE 4.1 code, and I hope that I can actually get a lot of this code templated for sharing between the two later.

Perf changes (relative to SSE 4.1):
Xfermode_SrcOver:      1650 -> 1180  (0.71x)  // large opaque blit
Xfermode_SrcOver_aa:   1794 -> 1653  (0.92x)  // large opaque + small transparent
text_16_AA_{FF,BK,WT}: 1.72 -> 1.59  (0.92x)  // small opaque blit
text_16_AA_88:         1.83 -> 1.77  (0.97x)  // small transparent blit

This should be a big throughout win, and a small latency win.
This should all be pixel-exact to the previous SSE 4.1 code.

GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1532613002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Ubuntu-GCC-x86_64-Release-CMake-Trybot,Build-Mac10.9-Clang-x86_64-Release-CMake-Trybot

Review URL: https://codereview.chromium.org/1532613002
2016-01-25 08:37:30 -08:00
mtklein
084db25d47 float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.
Xfermode_ColorDodge_aa	10.3ms -> 7.85ms	0.76x
 Xfermode_SoftLight_aa	13.8ms -> 10.2ms	0.74x
 Xfermode_ColorBurn_aa	10.7ms -> 7.82ms	0.73x
    Xfermode_SoftLight	33.6ms -> 23.2ms	0.69x
   Xfermode_ColorDodge	  25ms -> 16.5ms	0.66x
    Xfermode_ColorBurn	26.1ms -> 16.6ms	0.63x

Ought to be no pixel diffs:
https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dgm&master=false

Incidental stuff:

I made the SkNx(T) constructors implicit to make writing math expressions simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than
  Sk4f v;
  ...
  v = v * Sk4f(4);

As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now.  I plan on
following up with a CL that lets those become `(1 - da)` too.

BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Review URL: https://codereview.chromium.org/1432903002
2015-11-11 11:39:09 -08:00
mtklein
46b42f5304 Add empty files to empty targets to appease XCode.
BUG=skia:4117

Review URL: https://codereview.chromium.org/1312703004
2015-08-31 14:21:08 -07:00
mtklein
5141d90796 Add build targets for advanced Intel instruction sets (1 of 3).
CL (1 of 3) adds empty lists in our .gypi,
and builds the files in those empty lists with the appropriate flags.

CL (2 of 3) will have Chrome's GYP and GN files read these lists,
and build them with the appropriate flags.

CL (3 of 3) will add runtime detection and stub files to the lists
with empty Init_sse42(), Init_avx(), Init_avx2() methods.

After that, we should be able to use SSE 4.2, AVX, and AVX2 if desired.

Some motivation:
  - SSE 4.2 adds some sweet string-oriented methods that
    can help us write fast high quality 32-bit hashes.
  - AVX is SSE doubled, e.g. 8 floats or two SkPMFloat at a time.
  - AVX2 is SSE2 doubled, e.g. 8 pixels at a time.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1290423007
2015-08-24 10:32:02 -07:00
mtklein
9483763793 Remove SkOpts_sse2.cpp.
It's sort of pointless: all our clients that will have SSE2 at runtime have it
unconditionally at compile time, so the functions in namespace portable will
pick up the SSE2 code.  The procs in SkOpts_sse2.o were just duplicate code.

A couple of the procs we had in _sse2.cpp can benefit slightly when compiled
with SSSE3.  I've moved those to _ssse3.cpp.  This should lead to small speedups
on platforms like Linux and Windows that have a baseline of SSE2.

Similarly, I've removed the call to Init_neon() when NEON is available globally... it's a no-op.

Renaming namespace portable to something clearer is TBD.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1294213002
2015-08-18 05:18:53 -07:00
mtklein
b6394746ff Port SkTextureCompression opts to SkOpts
Pretty vanilla translation.  I cleaned up who calls whom a little.
Used to be utils -> opts -> utils, now it's just utils -> opts.

I may follow up with a pass over the NEON code for readability
and to clean up dead code.

This turns on NEON A8->R11EAC conversion for ARMv8.
Unit tests which now hit the NEON code still pass.
I can't find any related bench.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1273103002
2015-08-06 08:17:16 -07:00
mtklein
d029ded92d Port morphology to SkOpts.
Nothing too fancy.

Direction enums become enum classes so they don't get all confused.  An
alternative is to create one single Direction enum that both blur and
morphology opts use.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1267343004
2015-08-04 14:09:09 -07:00
mtklein
dce5ce4276 Port SkBlurImage opts to SkOpts.
+268 -535 lines

I also rearranged the code a little bit to encapsulate itself better,
mostly replacing static helper functions with lambdas.  This also
let me merge the SSE2 and SSE4.1 code paths.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1264103004
2015-08-04 08:49:21 -07:00
mtklein
490b61569d Port SkXfermode opts to SkOpts.h
Renames Sk4pxXfermode.h to SkXfermode_opts.h,
and refactors it a tiny bit internally.

This moves xfermode optimization from being "compile-time everywhere but NEON"
to simply "runtime everywhere".  I don't anticipate any effect on perf or
correctness.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1264543006
2015-07-31 11:50:27 -07:00
mtklein
7eb0945af2 Port SkUtils opts to SkOpts.
With this new arrangement, the benefits of inlining sk_memset16/32 have changed.

On x86, they're not significantly different, except for small N<=10 where the inlined code is significantly slower.
On ARMv7 with NEON, our custom code is still significantly faster for N>10 (up to 2x faster).  For small N<=10 inlining is still significantly faster.
On ARMv7 without NEON, our custom code is still ridiculously faster (up to 10x) than inlining for N>10, though for small N<=10 inlining is still a little faster.

We were not using the NEON memset16 and memset32 procs on ARMv8.  At first blush, that seems to be an oversight, but if so it's an extremely lucky one.  The ARMv8 code generation for our memset16/32 procs is total garbage, leaving those methods ~8x slower than just inlining the memset, using the compiler's autovectorization.

So, no need to inline any more on x86, and still inline for N<=10 on ARMv7.  Always inline for ARMv8.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1270573002
2015-07-31 10:46:50 -07:00
mtklein
8317a1832f Lay groundwork for SkOpts.
This doesn't really do anything yet.  It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks.

BUG=skia:4117

Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827

Review URL: https://codereview.chromium.org/1255193002
2015-07-30 07:30:16 -07:00
mtklein
56b78a7a2a Revert of Lay groundwork for SkOpts. (patchset id:40001 of https://codereview.chromium.org/1255193002/)
Reason for revert:
Chromium doesn't call SkGraphics::Init().  This setup won't work.

Original issue's description:
> Lay groundwork for SkOpts.
>
> This doesn't really do anything yet.  It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks.
>
> BUG=skia:4117
>
> Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827

TBR=djsollen@google.com
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:4117

Review URL: https://codereview.chromium.org/1261743002
2015-07-27 12:03:23 -07:00
mtklein
ce2c5055ce Lay groundwork for SkOpts.
This doesn't really do anything yet.  It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks.

BUG=skia:4117

Review URL: https://codereview.chromium.org/1255193002
2015-07-27 10:52:33 -07:00
mtklein
54f313ccb8 Clean up dead xfermode opts code.
Now that SK_SUPPORT_LEGACY_XFERMODES is unused, tons of code becomes dead.

Nothing is needed in opts/ anymore for x86.
We still do runtime NEON detection, which just duplicates Sk4pxXfermode.

TBR=reed@google.com

BUG=skia:

Review URL: https://codereview.chromium.org/1230023011
2015-07-20 07:14:19 -07:00
mtklein
c5e0891029 Replace NEON assembly memset16 and memset32 with intrinsic versions.
According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower
and a percent or two faster than the old assembly.

BUG=skia:

CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot

Review URL: https://codereview.chromium.org/1075003002
2015-04-10 06:24:58 -07:00
mtklein
a67572ff91 Remove ARM assembly memsets.
Step 1 of a zillion in the quest for NEON on iOS,
and step 1 of a different zillion in the Great Assembly Purge.

ios, arm, arm64, arm_v7, arm_v7_neon all build.

BUG=skia:

Review URL: https://codereview.chromium.org/1072063002
2015-04-09 09:16:28 -07:00
scroggo
3e5622764a Add copyright headers to remaining gyp files.
Prevents some PRESUBMIT errors.

TBR=mtklein@google.com

Review URL: https://codereview.chromium.org/1035523003
2015-03-25 10:22:41 -07:00
henrik.smiding
3704df347a Remove SSE2 ColorRect32 code/files
Removes the disabled SSE2 optimization of ColorRect32 and deletes
the two files containing the code.
Measured on both Core Haswell and Atom Silvermont, and only got
some miniscule improvement compared to the default implementation.

Also tried to write a new, ultimate, version of this optimization,
but only got ~5% improvement on ColorRect32-heavy tests.

Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>

Review URL: https://codereview.chromium.org/957433002
2015-02-25 07:37:14 -08:00
mtklein
d160192fd9 Revert of GYP groudwork for half-float opts support. (patchset id:1 of https://codereview.chromium.org/915693002/)
Reason for revert:
Going to punt on 16-bit float support for now.  Can't figure out ARM 64.

Original issue's description:
> GYP groudwork for half-float opts support.
>
> This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts:
>   - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time;
>   - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time.
>
> opts_avx will be a handy thing to have around later too, especially if we want to work with floats.
>
> This doesn't actually add any new source files to these libraries yet, so they're no-ops for now.
> I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources.
>
> This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows,
> which is how we turn on AVX there.  There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS.
>
> BUG=skia:
>
> TBR=reed@google.com
>
> Committed: https://skia.googlesource.com/skia/+/46b80833394d7919cadf2abf2b93802141dd21c5

TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/912223002
2015-02-10 18:18:18 -08:00
mtklein
94aca69d1e Revert of add dummy avx file so xcode will build (patchset id:1 of https://codereview.chromium.org/916563002/)
Reason for revert:
Going to punt on 16-bit float support for now.  Can't figure out ARM 64.

Original issue's description:
> add dummy avx file so xcode will build
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/a0921f2563701d54e4e022de99f2705f4ada8a6e

TBR=reed@google.com
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/912213002
2015-02-10 18:17:17 -08:00
reed
a0921f2563 add dummy avx file so xcode will build
BUG=skia:

Review URL: https://codereview.chromium.org/916563002
2015-02-10 13:39:15 -08:00
mtklein
46b8083339 GYP groudwork for half-float opts support.
This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts:
  - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time;
  - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time.

opts_avx will be a handy thing to have around later too, especially if we want to work with floats.

This doesn't actually add any new source files to these libraries yet, so they're no-ops for now.
I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources.

This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows,
which is how we turn on AVX there.  There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS.

BUG=skia:

TBR=reed@google.com

Review URL: https://codereview.chromium.org/915693002
2015-02-10 09:17:05 -08:00
stephana
4bf1ce2709 Revert of Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset id:1 of https://codereview.chromium.org/873553003/)
Reason for revert:
Reverted the wrong CL.

Original issue's description:
> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset  id:300001 of https://codereview.chromium.org/874863002/)
>
> Reason for revert:
> This causes a bug on the 'hittestpath' GM on MacMini 4,1
>
> See:
>
> https://gold.skia.org/#/triage/hittestpath?head=0
>
> for details.
>
> Original issue's description:
> > SSE4 opaque blend using intrinsics instead of assembly.
> >
> > Since we had such a hard time with the assembly versions of this blit (to the
> > point that we have them completely disabled everywhere), I thought I'd take
> > a shot at writing a version of the blit using intrinsics.
> >
> > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> > to skip the blend when the 16 src pixels we consider each loop are all opaque
> > or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> > all those alphas.
> >
> > It's worth looking to see if we can backport this type of logic to SSE2 using
> > _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
> >
> > My local performance testing doesn't show this to be an unambiguous win
> > (there are probably microbenchmarks and SKPs where we'd be better off just
> > powering through the blend rather than looking at alphas), but the potential
> > does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)
> >
> > DM says it draws pixel perfect compare to the old code.
> >
> > Microbenchmarks:
> >                bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
> >              bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
> >                        bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
> > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
> >                 bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
> >                               bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
> >              bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
> >          bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
> >               bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
> >                     bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
> >                                     bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
> >                              bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
> >                      bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
> >                     bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
> >                                   bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
> >                     bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
> >                bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x
> >
> > Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> > be a decent predictor of real-world impact.
> >
> > BUG=chromium:399842
> >
> > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
> >
> > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
> >
> > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493
>
> TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
> NOPRESUBMIT=true
> NOTREECHECKS=true
> NOTRY=true
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/894083002
2015-02-02 10:02:48 -08:00
stephana
4988891a11 Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset id:300001 of https://codereview.chromium.org/874863002/)
Reason for revert:
This causes a bug on the 'hittestpath' GM on MacMini 4,1

See:

https://gold.skia.org/#/triage/hittestpath?head=0

for details.

Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
>                bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
>              bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
>                        bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
>                 bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
>                               bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
>              bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
>          bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
>               bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
>                     bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
>                                     bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
>                              bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
>                      bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
>                     bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
>                                   bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
>                     bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
>                bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
>
> CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/873553003
2015-02-02 09:52:43 -08:00
mtklein
6dbfb21a6c SSE4 opaque blend using intrinsics instead of assembly.
Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.

The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.

It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.

My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)

DM says it draws pixel perfect compare to the old code.

Microbenchmarks:
               bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
             bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
                       bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
                bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
                              bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
             bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
         bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
              bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
                    bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
                                    bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
                             bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
                     bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
                    bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
                                  bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
                    bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
               bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x

Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.

BUG=chromium:399842

Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785

CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot

Review URL: https://codereview.chromium.org/874863002
2015-01-27 14:35:18 -08:00
mtklein
f7069d58fc Split src/opts source lists out of opts.gyp.
This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.

BUG=skia:

Landing this without review as a mega-tryjob.
TBR=reed@google.com

Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82

Review URL: https://codereview.chromium.org/870353003
2015-01-26 18:55:58 -08:00
mtklein
0933725e49 Revert of Split src/opts source lists out of opts.gyp. (patchset id:1 of https://codereview.chromium.org/870353003/)
Reason for revert:
Android Makefiles broken

Original issue's description:
> Split src/opts source lists out of opts.gyp.
>
> This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.
>
> BUG=skia:
>
> Landing this without review as a mega-tryjob.
> TBR=reed@google.com
>
> Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82

TBR=mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/880783002
2015-01-26 18:15:31 -08:00
mtklein
c98fe3aa4f Split src/opts source lists out of opts.gyp.
This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.

BUG=skia:

Landing this without review as a mega-tryjob.
TBR=reed@google.com

Review URL: https://codereview.chromium.org/870353003
2015-01-26 18:05:37 -08:00