Please see if this looks usable. It may even give a perf boost if you use it, even without custom implementations for each instruction set.
I've been trying this morning to beat this naive loop implementation, but so far no luck with either _SSE2.h or _SSSE3.h. It's possible this is an artifact of the microbenchmark, because we're not doing anything between the conversions. I'd like to see how this fits into real code, what assembly's generated, what the hot spots are, etc.
I've updated the tests to test these new APIs, and splintered off a pair of new benchmarks that use the new APIs. This required some minor rejiggering in the benches.
BUG=skia:
Review URL: https://codereview.chromium.org/978213003
Instead of set(SkPMColor), add a constructor SkPMFloat(SkPMColor).
Replace setA(), setR(), etc. with a 4 float constructor.
And, promise to stick to SkPMColor order.
BUG=skia:
Review URL: https://codereview.chromium.org/977773002
SSE rounds for free (that was a happy accident: they also have a truncating version).
NEON does not, nor obviously the portable code, so they add 0.5 before truncating.
NOPRESUBMIT=true
BUG=skia:
Review URL: https://codereview.chromium.org/974643002
This pushes the cost of the *255 and *1/255 conversions onto only those code
paths that need it. We're not doing it any more efficiently than can be done
with Sk4f.
In microbenchmark isolation, this is about a 15% speedup.
BUG=skia:
NOPRESUBMIT=true
Review URL: https://codereview.chromium.org/973603002