better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot

Review URL: https://codereview.chromium.org/1502843002
This commit is contained in:
mtklein 2015-12-07 08:21:11 -08:00 committed by Commit bot
parent a544eda5dd
commit 9d344069c5

View File

@ -58,9 +58,9 @@ inline Sk4px Sk4px::Wide::addNarrowHi(const Sk16h& other) const {
}
inline Sk4px Sk4px::Wide::div255() const {
// Calculated as ((x+128) + ((x+128)>>8)) >> 8.
auto v = *this + Sk16h(128);
return v.addNarrowHi(v>>8);
// Calculated as (x + (x+128)>>8 +128) >> 8. The 'r' in each instruction provides each +128.
return Sk16b(vcombine_u8(vraddhn_u16(this->fLo.fVec, vrshrq_n_u16(this->fLo.fVec, 8)),
vraddhn_u16(this->fHi.fVec, vrshrq_n_u16(this->fHi.fVec, 8))));
}
inline Sk4px Sk4px::alphas() const {