Commit Graph

14 Commits

Author SHA1 Message Date
mtklein
c5e0891029 Replace NEON assembly memset16 and memset32 with intrinsic versions.
According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower
and a percent or two faster than the old assembly.

BUG=skia:

CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot

Review URL: https://codereview.chromium.org/1075003002
2015-04-10 06:24:58 -07:00
mtklein
a67572ff91 Remove ARM assembly memsets.
Step 1 of a zillion in the quest for NEON on iOS,
and step 1 of a different zillion in the Great Assembly Purge.

ios, arm, arm64, arm_v7, arm_v7_neon all build.

BUG=skia:

Review URL: https://codereview.chromium.org/1072063002
2015-04-09 09:16:28 -07:00
scroggo
3e5622764a Add copyright headers to remaining gyp files.
Prevents some PRESUBMIT errors.

TBR=mtklein@google.com

Review URL: https://codereview.chromium.org/1035523003
2015-03-25 10:22:41 -07:00
henrik.smiding
3704df347a Remove SSE2 ColorRect32 code/files
Removes the disabled SSE2 optimization of ColorRect32 and deletes
the two files containing the code.
Measured on both Core Haswell and Atom Silvermont, and only got
some miniscule improvement compared to the default implementation.

Also tried to write a new, ultimate, version of this optimization,
but only got ~5% improvement on ColorRect32-heavy tests.

Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>

Review URL: https://codereview.chromium.org/957433002
2015-02-25 07:37:14 -08:00
mtklein
d160192fd9 Revert of GYP groudwork for half-float opts support. (patchset #1 id:1 of https://codereview.chromium.org/915693002/)
Reason for revert:
Going to punt on 16-bit float support for now.  Can't figure out ARM 64.

Original issue's description:
> GYP groudwork for half-float opts support.
>
> This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts:
>   - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time;
>   - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time.
>
> opts_avx will be a handy thing to have around later too, especially if we want to work with floats.
>
> This doesn't actually add any new source files to these libraries yet, so they're no-ops for now.
> I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources.
>
> This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows,
> which is how we turn on AVX there.  There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS.
>
> BUG=skia:
>
> TBR=reed@google.com
>
> Committed: https://skia.googlesource.com/skia/+/46b80833394d7919cadf2abf2b93802141dd21c5

TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/912223002
2015-02-10 18:18:18 -08:00
mtklein
94aca69d1e Revert of add dummy avx file so xcode will build (patchset #1 id:1 of https://codereview.chromium.org/916563002/)
Reason for revert:
Going to punt on 16-bit float support for now.  Can't figure out ARM 64.

Original issue's description:
> add dummy avx file so xcode will build
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/a0921f2563701d54e4e022de99f2705f4ada8a6e

TBR=reed@google.com
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/912213002
2015-02-10 18:17:17 -08:00
reed
a0921f2563 add dummy avx file so xcode will build
BUG=skia:

Review URL: https://codereview.chromium.org/916563002
2015-02-10 13:39:15 -08:00
mtklein
46b8083339 GYP groudwork for half-float opts support.
This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts:
  - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time;
  - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time.

opts_avx will be a handy thing to have around later too, especially if we want to work with floats.

This doesn't actually add any new source files to these libraries yet, so they're no-ops for now.
I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources.

This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows,
which is how we turn on AVX there.  There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS.

BUG=skia:

TBR=reed@google.com

Review URL: https://codereview.chromium.org/915693002
2015-02-10 09:17:05 -08:00
stephana
4bf1ce2709 Revert of Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #1 id:1 of https://codereview.chromium.org/873553003/)
Reason for revert:
Reverted the wrong CL.

Original issue's description:
> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/)
>
> Reason for revert:
> This causes a bug on the 'hittestpath' GM on MacMini 4,1
>
> See:
>
> https://gold.skia.org/#/triage/hittestpath?head=0
>
> for details.
>
> Original issue's description:
> > SSE4 opaque blend using intrinsics instead of assembly.
> >
> > Since we had such a hard time with the assembly versions of this blit (to the
> > point that we have them completely disabled everywhere), I thought I'd take
> > a shot at writing a version of the blit using intrinsics.
> >
> > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> > to skip the blend when the 16 src pixels we consider each loop are all opaque
> > or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> > all those alphas.
> >
> > It's worth looking to see if we can backport this type of logic to SSE2 using
> > _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
> >
> > My local performance testing doesn't show this to be an unambiguous win
> > (there are probably microbenchmarks and SKPs where we'd be better off just
> > powering through the blend rather than looking at alphas), but the potential
> > does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)
> >
> > DM says it draws pixel perfect compare to the old code.
> >
> > Microbenchmarks:
> >                bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
> >              bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
> >                        bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
> > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
> >                 bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
> >                               bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
> >              bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
> >          bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
> >               bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
> >                     bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
> >                                     bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
> >                              bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
> >                      bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
> >                     bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
> >                                   bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
> >                     bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
> >                bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x
> >
> > Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> > be a decent predictor of real-world impact.
> >
> > BUG=chromium:399842
> >
> > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
> >
> > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
> >
> > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493
>
> TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
> NOPRESUBMIT=true
> NOTREECHECKS=true
> NOTRY=true
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/894083002
2015-02-02 10:02:48 -08:00
stephana
4988891a11 Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/)
Reason for revert:
This causes a bug on the 'hittestpath' GM on MacMini 4,1

See:

https://gold.skia.org/#/triage/hittestpath?head=0

for details.

Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
>                bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
>              bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
>                        bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
>                 bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
>                               bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
>              bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
>          bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
>               bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
>                     bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
>                                     bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
>                              bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
>                      bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
>                     bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
>                                   bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
>                     bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
>                bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
>
> CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/873553003
2015-02-02 09:52:43 -08:00
mtklein
6dbfb21a6c SSE4 opaque blend using intrinsics instead of assembly.
Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.

The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent.  _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.

It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.

My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots.  (< 1.0x is a win.)

DM says it draws pixel perfect compare to the old code.

Microbenchmarks:
               bitmap_RGBA_8888_A_source_stripes_two	  14us -> 14.4us	1.03x
             bitmap_RGBA_8888_A_source_stripes_three	14.3us -> 14.5us	1.01x
                       bitmap_RGBA_8888_scale_bilerp	61.9us -> 62.2us	1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp	 102us ->  101us	0.99x
                bitmap_RGBA_8888_scale_rotate_bilerp	 103us ->  101us	0.99x
                              bitmap_RGBA_8888_scale	18.4us -> 18.2us	0.99x
             bitmap_RGBA_8888_A_scale_rotate_bicubic	  71us ->   70us	0.99x
         bitmap_RGBA_8888_update_scale_rotate_bilerp	 103us ->  101us	0.99x
              bitmap_RGBA_8888_A_scale_rotate_bilerp	 112us ->  109us	0.98x
                    bitmap_RGBA_8888_update_volatile	5.72us -> 5.58us	0.98x
                                    bitmap_RGBA_8888	5.73us -> 5.58us	0.97x
                             bitmap_RGBA_8888_update	5.78us ->  5.6us	0.97x
                     bitmap_RGBA_8888_A_scale_bilerp	70.7us ->   68us	0.96x
                    bitmap_RGBA_8888_A_scale_bicubic	23.7us -> 21.8us	0.92x
                                  bitmap_RGBA_8888_A	13.9us -> 10.9us	0.78x
                    bitmap_RGBA_8888_A_source_opaque	  14us -> 6.29us	0.45x
               bitmap_RGBA_8888_A_source_transparent	  14us -> 3.65us	0.26x

Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.

BUG=chromium:399842

Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785

CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot

Review URL: https://codereview.chromium.org/874863002
2015-01-27 14:35:18 -08:00
mtklein
f7069d58fc Split src/opts source lists out of opts.gyp.
This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.

BUG=skia:

Landing this without review as a mega-tryjob.
TBR=reed@google.com

Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82

Review URL: https://codereview.chromium.org/870353003
2015-01-26 18:55:58 -08:00
mtklein
0933725e49 Revert of Split src/opts source lists out of opts.gyp. (patchset #1 id:1 of https://codereview.chromium.org/870353003/)
Reason for revert:
Android Makefiles broken

Original issue's description:
> Split src/opts source lists out of opts.gyp.
>
> This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.
>
> BUG=skia:
>
> Landing this without review as a mega-tryjob.
> TBR=reed@google.com
>
> Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82

TBR=mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:

Review URL: https://codereview.chromium.org/880783002
2015-01-26 18:15:31 -08:00
mtklein
c98fe3aa4f Split src/opts source lists out of opts.gyp.
This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN.

BUG=skia:

Landing this without review as a mega-tryjob.
TBR=reed@google.com

Review URL: https://codereview.chromium.org/870353003
2015-01-26 18:05:37 -08:00