While investigating my bug (skia:4052) I saw this TODO and figured
it'd make me feel better about an otherwise unsuccessful investigation.
This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly
by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls
to 2 cheap comparisons and 1 expensive div255().
NEON speeds up by a more modest ~3%.
BUG=skia:
Review URL: https://codereview.chromium.org/1230663005
Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones.
- SkAlpha / SkPMColor constructors become static factories.
- Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious.
- Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round.
- Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts.
- use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative.
- MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do.
BUG=skia:
Review URL: https://codereview.chromium.org/1202013002
I haven't figured out a pithy way to have these apply to only classes
originating from SkNx, so let's just remove them. There aren't too
many use cases, and it's not really any less readable without them.
Semantically, this is a no-op.
BUG=skia:
Review URL: https://codereview.chromium.org/1167153002
0x8001 / 0x7fff don't seem to work, but we were close: 0x8000 does.
I plan to use this to implement the Difference xfermode,
and it seems generally handy.
BUG=skia:
Review URL: https://codereview.chromium.org/1133933004
patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
This looks quite launchable. radial_gradient3, min of 100 samples:
N5: 985µs -> 946µs
MBP: 395µs -> 279µs
On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot,Build-Ubuntu-GCC-Arm7-Release-Android_NoNeon-Trybot
Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec
Review URL: https://codereview.chromium.org/1109643002
Reason for revert:
compile failures.
Original issue's description:
> Mike's radial gradient CL with better float -> int.
>
> patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
>
> This looks quite launchable. radial_gradient3, min of 100 samples:
> N5: 985µs -> 946µs
> MBP: 395µs -> 279µs
>
> On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
>
> BUG=skia:
>
> CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ecTBR=reed@google.com,robertphillips@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1109883003
patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
This looks quite launchable. radial_gradient3, min of 100 samples:
N5: 985µs -> 946µs
MBP: 395µs -> 279µs
On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot
Review URL: https://codereview.chromium.org/1109643002
These will underly the SkPMFloat-like class for uint16_t components.
Sk4h will back a single-pixel version, and Sk8h any larger number than that.
BUG=skia:
Review URL: https://codereview.chromium.org/1088883005
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002