skia2

Author	SHA1	Message	Date
mtklein	b6394746ff	Port SkTextureCompression opts to SkOpts Pretty vanilla translation. I cleaned up who calls whom a little. Used to be utils -> opts -> utils, now it's just utils -> opts. I may follow up with a pass over the NEON code for readability and to clean up dead code. This turns on NEON A8->R11EAC conversion for ARMv8. Unit tests which now hit the NEON code still pass. I can't find any related bench. BUG=skia:4117 Review URL: https://codereview.chromium.org/1273103002	2015-08-06 08:17:16 -07:00
mtklein	d029ded92d	Port morphology to SkOpts. Nothing too fancy. Direction enums become enum classes so they don't get all confused. An alternative is to create one single Direction enum that both blur and morphology opts use. BUG=skia:4117 Review URL: https://codereview.chromium.org/1267343004	2015-08-04 14:09:09 -07:00
mtklein	dce5ce4276	Port SkBlurImage opts to SkOpts. +268 -535 lines I also rearranged the code a little bit to encapsulate itself better, mostly replacing static helper functions with lambdas. This also let me merge the SSE2 and SSE4.1 code paths. BUG=skia:4117 Review URL: https://codereview.chromium.org/1264103004	2015-08-04 08:49:21 -07:00
mtklein	490b61569d	Port SkXfermode opts to SkOpts.h Renames Sk4pxXfermode.h to SkXfermode_opts.h, and refactors it a tiny bit internally. This moves xfermode optimization from being "compile-time everywhere but NEON" to simply "runtime everywhere". I don't anticipate any effect on perf or correctness. BUG=skia:4117 Review URL: https://codereview.chromium.org/1264543006	2015-07-31 11:50:27 -07:00
mtklein	7eb0945af2	Port SkUtils opts to SkOpts. With this new arrangement, the benefits of inlining sk_memset16/32 have changed. On x86, they're not significantly different, except for small N<=10 where the inlined code is significantly slower. On ARMv7 with NEON, our custom code is still significantly faster for N>10 (up to 2x faster). For small N<=10 inlining is still significantly faster. On ARMv7 without NEON, our custom code is still ridiculously faster (up to 10x) than inlining for N>10, though for small N<=10 inlining is still a little faster. We were not using the NEON memset16 and memset32 procs on ARMv8. At first blush, that seems to be an oversight, but if so it's an extremely lucky one. The ARMv8 code generation for our memset16/32 procs is total garbage, leaving those methods ~8x slower than just inlining the memset, using the compiler's autovectorization. So, no need to inline any more on x86, and still inline for N<=10 on ARMv7. Always inline for ARMv8. BUG=skia:4117 Review URL: https://codereview.chromium.org/1270573002	2015-07-31 10:46:50 -07:00
mtklein	8317a1832f	Lay groundwork for SkOpts. This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. BUG=skia:4117 Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827 Review URL: https://codereview.chromium.org/1255193002	2015-07-30 07:30:16 -07:00
mtklein	56b78a7a2a	Revert of Lay groundwork for SkOpts. (patchset #3 id:40001 of https://codereview.chromium.org/1255193002/) Reason for revert: Chromium doesn't call SkGraphics::Init(). This setup won't work. Original issue's description: > Lay groundwork for SkOpts. > > This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. > > BUG=skia:4117 > > Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827 TBR=djsollen@google.com NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4117 Review URL: https://codereview.chromium.org/1261743002	2015-07-27 12:03:23 -07:00
mtklein	ce2c5055ce	Lay groundwork for SkOpts. This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. BUG=skia:4117 Review URL: https://codereview.chromium.org/1255193002	2015-07-27 10:52:33 -07:00
mtklein	54f313ccb8	Clean up dead xfermode opts code. Now that SK_SUPPORT_LEGACY_XFERMODES is unused, tons of code becomes dead. Nothing is needed in opts/ anymore for x86. We still do runtime NEON detection, which just duplicates Sk4pxXfermode. TBR=reed@google.com BUG=skia: Review URL: https://codereview.chromium.org/1230023011	2015-07-20 07:14:19 -07:00
mtklein	c5e0891029	Replace NEON assembly memset16 and memset32 with intrinsic versions. According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower and a percent or two faster than the old assembly. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1075003002	2015-04-10 06:24:58 -07:00
mtklein	a67572ff91	Remove ARM assembly memsets. Step 1 of a zillion in the quest for NEON on iOS, and step 1 of a different zillion in the Great Assembly Purge. ios, arm, arm64, arm_v7, arm_v7_neon all build. BUG=skia: Review URL: https://codereview.chromium.org/1072063002	2015-04-09 09:16:28 -07:00
scroggo	3e5622764a	Add copyright headers to remaining gyp files. Prevents some PRESUBMIT errors. TBR=mtklein@google.com Review URL: https://codereview.chromium.org/1035523003	2015-03-25 10:22:41 -07:00
henrik.smiding	3704df347a	Remove SSE2 ColorRect32 code/files Removes the disabled SSE2 optimization of ColorRect32 and deletes the two files containing the code. Measured on both Core Haswell and Atom Silvermont, and only got some miniscule improvement compared to the default implementation. Also tried to write a new, ultimate, version of this optimization, but only got ~5% improvement on ColorRect32-heavy tests. Signed-off-by: Henrik Smiding <henrik.smiding@intel.com> Review URL: https://codereview.chromium.org/957433002	2015-02-25 07:37:14 -08:00
mtklein	d160192fd9	Revert of GYP groudwork for half-float opts support. (patchset #1 id:1 of https://codereview.chromium.org/915693002/) Reason for revert: Going to punt on 16-bit float support for now. Can't figure out ARM 64. Original issue's description: > GYP groudwork for half-float opts support. > > This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts: > - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time; > - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time. > > opts_avx will be a handy thing to have around later too, especially if we want to work with floats. > > This doesn't actually add any new source files to these libraries yet, so they're no-ops for now. > I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources. > > This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows, > which is how we turn on AVX there. There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS. > > BUG=skia: > > TBR=reed@google.com > > Committed: https://skia.googlesource.com/skia/+/46b80833394d7919cadf2abf2b93802141dd21c5 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/912223002	2015-02-10 18:18:18 -08:00
mtklein	94aca69d1e	Revert of add dummy avx file so xcode will build (patchset #1 id:1 of https://codereview.chromium.org/916563002/) Reason for revert: Going to punt on 16-bit float support for now. Can't figure out ARM 64. Original issue's description: > add dummy avx file so xcode will build > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/a0921f2563701d54e4e022de99f2705f4ada8a6e TBR=reed@google.com NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/912213002	2015-02-10 18:17:17 -08:00
reed	a0921f2563	add dummy avx file so xcode will build BUG=skia: Review URL: https://codereview.chromium.org/916563002	2015-02-10 13:39:15 -08:00
mtklein	46b8083339	GYP groudwork for half-float opts support. This sets us up two new opts targets with the immediate goal of adding half-float (SkHalf.h) opts: - opts_neon_fp16: uses hardware support on most ARM chips with NEON to do 4 conversions at a time; - opts_avx: uses hardware support on Intel chips with AVX to do 8 conversions at a time. opts_avx will be a handy thing to have around later too, especially if we want to work with floats. This doesn't actually add any new source files to these libraries yet, so they're no-ops for now. I'll need to write a parallel change to Chrome's GN and GYPs before we can start adding sources. This also rolls GYP up to head, to get suppport for EnableEnhancedInstructionSet: '3' on Windows, which is how we turn on AVX there. There's no Mac-specific flag, so we use OTHER_CPLUSPLUSFLAGS. BUG=skia: TBR=reed@google.com Review URL: https://codereview.chromium.org/915693002	2015-02-10 09:17:05 -08:00
stephana	4bf1ce2709	Revert of Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #1 id:1 of https://codereview.chromium.org/873553003/) Reason for revert: Reverted the wrong CL. Original issue's description: > Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/) > > Reason for revert: > This causes a bug on the 'hittestpath' GM on MacMini 4,1 > > See: > > https://gold.skia.org/#/triage/hittestpath?head=0 > > for details. > > Original issue's description: > > SSE4 opaque blend using intrinsics instead of assembly. > > > > Since we had such a hard time with the assembly versions of this blit (to the > > point that we have them completely disabled everywhere), I thought I'd take > > a shot at writing a version of the blit using intrinsics. > > > > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) > > to skip the blend when the 16 src pixels we consider each loop are all opaque > > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract > > all those alphas. > > > > It's worth looking to see if we can backport this type of logic to SSE2 using > > _mm_movemask_epi8, or up to 32 pixels at a time using AVX. > > > > My local performance testing doesn't show this to be an unambiguous win > > (there are probably microbenchmarks and SKPs where we'd be better off just > > powering through the blend rather than looking at alphas), but the potential > > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) > > > > DM says it draws pixel perfect compare to the old code. > > > > Microbenchmarks: > > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x > > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x > > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x > > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x > > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x > > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x > > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x > > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x > > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x > > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x > > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x > > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x > > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x > > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x > > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x > > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x > > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x > > > > Running over our ~70 SKP web page captures, this looks like we spend 0.7x > > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should > > be a decent predictor of real-world impact. > > > > BUG=chromium:399842 > > > > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 > > > > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot > > > > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493 > > TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org > NOPRESUBMIT=true > NOTREECHECKS=true > NOTRY=true > BUG=chromium:399842 > > Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/894083002	2015-02-02 10:02:48 -08:00
stephana	4988891a11	Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/) Reason for revert: This causes a bug on the 'hittestpath' GM on MacMini 4,1 See: https://gold.skia.org/#/triage/hittestpath?head=0 for details. Original issue's description: > SSE4 opaque blend using intrinsics instead of assembly. > > Since we had such a hard time with the assembly versions of this blit (to the > point that we have them completely disabled everywhere), I thought I'd take > a shot at writing a version of the blit using intrinsics. > > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) > to skip the blend when the 16 src pixels we consider each loop are all opaque > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract > all those alphas. > > It's worth looking to see if we can backport this type of logic to SSE2 using > _mm_movemask_epi8, or up to 32 pixels at a time using AVX. > > My local performance testing doesn't show this to be an unambiguous win > (there are probably microbenchmarks and SKPs where we'd be better off just > powering through the blend rather than looking at alphas), but the potential > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) > > DM says it draws pixel perfect compare to the old code. > > Microbenchmarks: > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x > > Running over our ~70 SKP web page captures, this looks like we spend 0.7x > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should > be a decent predictor of real-world impact. > > BUG=chromium:399842 > > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 > > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot > > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493 TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/873553003	2015-02-02 09:52:43 -08:00
mtklein	6dbfb21a6c	SSE4 opaque blend using intrinsics instead of assembly. Since we had such a hard time with the assembly versions of this blit (to the point that we have them completely disabled everywhere), I thought I'd take a shot at writing a version of the blit using intrinsics. The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) to skip the blend when the 16 src pixels we consider each loop are all opaque or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract all those alphas. It's worth looking to see if we can backport this type of logic to SSE2 using _mm_movemask_epi8, or up to 32 pixels at a time using AVX. My local performance testing doesn't show this to be an unambiguous win (there are probably microbenchmarks and SKPs where we'd be better off just powering through the blend rather than looking at alphas), but the potential does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) DM says it draws pixel perfect compare to the old code. Microbenchmarks: bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x bitmap_RGBA_8888 5.73us -> 5.58us 0.97x bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x Running over our ~70 SKP web page captures, this looks like we spend 0.7x the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should be a decent predictor of real-world impact. BUG=chromium:399842 Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot Review URL: https://codereview.chromium.org/874863002	2015-01-27 14:35:18 -08:00
mtklein	f7069d58fc	Split src/opts source lists out of opts.gyp. This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN. BUG=skia: Landing this without review as a mega-tryjob. TBR=reed@google.com Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82 Review URL: https://codereview.chromium.org/870353003	2015-01-26 18:55:58 -08:00
mtklein	0933725e49	Revert of Split src/opts source lists out of opts.gyp. (patchset #1 id:1 of https://codereview.chromium.org/870353003/) Reason for revert: Android Makefiles broken Original issue's description: > Split src/opts source lists out of opts.gyp. > > This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN. > > BUG=skia: > > Landing this without review as a mega-tryjob. > TBR=reed@google.com > > Committed: https://skia.googlesource.com/skia/+/c98fe3aa4f8c97c462c0eb6d9106fc37e48d7f82 TBR=mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/880783002	2015-01-26 18:15:31 -08:00
mtklein	c98fe3aa4f	Split src/opts source lists out of opts.gyp. This should make it easier to keep our opts.gyp in sync with Chrome's GYP and GN. BUG=skia: Landing this without review as a mega-tryjob. TBR=reed@google.com Review URL: https://codereview.chromium.org/870353003	2015-01-26 18:05:37 -08:00

23 Commits