We really only need to_unorm(),
and that's fine with trunc(mad(x, scale, 0.5)).
Change-Id: I1561c678501963a9ae53c22994fc906159fc7199
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275075
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
This does open us up to a little bit of possible inconsistency of
rounding when right on a x.5 (sometimes we'll +0.5 and trunc, sometimes
round to nearest, sometimes round according to the default mode which is
usually round to nearest) but I think that inconsistency may be worth
the free register not needing a splat(0.5f) buys us.
A few invisible diffs.
Change-Id: I9af092c937ccf7c5891c2ab3cb298d217e4a9e9f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/253725
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Reed <reed@google.com>
This plumbs through round but doesn't use it. I want that change to be
its own CL. It's nice to have assembler support and the name changes
even if I revert using round.
Change-Id: I6d67ec5c63546069eb7cc1c91599b599bafcda66
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/253724
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Change-Id: I58b52d3e1d05d0834be30e00d991636e227cbf0b
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/230836
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Arg strides are the reason JIT happens lazily in Program::eval() today
instead of proactively in Builder::done() or Program's constructor. It
also just really doesn't make sense to delay this information... it's
not like you can change it up sanely between calls to eval().
The argument index now comes implicitly from the order of calling arg().
This may seem logically independent, but it prevents a weird situation
where you could use the same argument index twice with different
strides... not sure what that would mean.
Change-Id: I0f5d46e94a1ca112a72675c5492f17c0dd825ce0
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/227390
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
I was just reading the ARM docs and realized that their BIC ("BIt
Clear") is the same as SSE's ANDN ("AND Not") instruction. It's kind of
a neat little tool to have laying around... comes up more than you'd
think, and it's sometimes the clearest way to express what you're doing,
as in the changed program here where the comment is "mask away the low
bits". That's a bit_clear with a mask for what you want to clear away!
And the real reason to write this up is that I want to have a CL to
point to that shows how to add an instruction top to bottom.
Change-Id: I99690ed9c1009427b3986955e7ae6264de4d215c
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223120
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Reed <reed@google.com>
- 32x8 i32 add,sub,mul
- add I32_Naive bench/test builder to get better i32 mul coverage
- minor refactoring all over
Change-Id: I13cc19ff37a2da0bcff289ba51baac08f456d6c5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/222485
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
I'm staring at this assembly,
vmovups (%rsi), %ymm3
vpsrld $24, %ymm3, %ymm4
vpslld $16, %ymm4, %ymm15
vorps %ymm4, %ymm15, %ymm4
vpsubw %ymm4, %ymm0, %ymm4
Just knowing that could be
vmovups (%rsi), %ymm3
vpshufb 0x??(%rip), %ymm3, %ymm4
vpsubw %ymm4, %ymm0, %ymm4
That is, instead of shifting, shifting, and bit-oring
to create the 0a0a scale factor from ymm3, we could just
byte shuffle directly using some pre-baked control pattern
(stored at the end of the program like other constants)
pshufb lets you arbitrarily remix bytes from its argument and
zero bytes, and NEON has a similar family of vtbl instructions,
even including that same feature of injecting zeroes.
I think I've got this working, and the speedup is great,
from 0.19 to 0.16 ns/px for I32_SWAR, and
from 0.43 to 0.38 ns/px for I32.
Change-Id: Iab850275e826b4187f0efc9495a4b9eab4402c38
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220871
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Now that we've got shr_16x2, extract(..., 8, splat(0x00ff00ff)) is
better done as shr_16x2(..., 8). This swaps a 16-bit shift in for
the 32-bit shift, a wash, but lets us drop the bit_and at the end,
saving one whole instruction.
This places I32_SWAR a tiny little bit faster than the code in Opts,
like .19 ns/px vs .20 ns/px for Opts.
Change-Id: I4160dc03ecc8b855c0773a927f1510ad5cbb4b87
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220856
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
This is the final bunny I've got in my hat, I think...
Remembering that none of the s += d*invA adds can overflow,
we can use a single 32-bit add to add them all at once.
This means we don't have to unpack the src pixel into rb/ga
halves. We need only extract the alpha for invA.
This brings I32_SWAR even with the Opts code!
curr/maxrss loops min median mean max stddev samples config bench
36/36 MB 133 0.206ns 0.211ns 0.208ns 0.211ns 1% ▁▇▁█▁▇▁▇▁▇ nonrendering SkVM_4096_I32_SWAR
37/37 MB 152 0.432ns 0.432ns 0.434ns 0.444ns 1% ▃▁▁▁▁▃▁▁█▁ nonrendering SkVM_4096_I32
37/37 MB 50 0.781ns 0.794ns 0.815ns 0.895ns 5% ▆▂█▃▅▂▂▁▂▁ nonrendering SkVM_4096_F32
37/37 MB 76 0.773ns 0.78ns 0.804ns 0.907ns 6% ▄█▅▁▁▁▁▂▁▁ nonrendering SkVM_4096_RP
37/37 MB 268 0.201ns 0.203ns 0.203ns 0.204ns 0% █▇▆▆▆▆▁▆▆▆ nonrendering SkVM_4096_Opts
Change-Id: Ibf0a9c5d90b35f1e9cf7265868bd18b7e0a76c43
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220805
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
I figure the easiest way to expose 16-bit operations
is to expose 16x2 pair operations... this means we
can continue to always work with the same size vector.
Switching from 32-bit multiplies to 16-bit multiplies
is going to deliver the most oomph... they cost roughly
half what 32-bit multiplies do on x86.
Speed now:
I32_SWAR: 0.27 ns/px
I32: 0.43 ns/px
F32: 0.76 ns/px
RP: 0.8 ns/px
Opts: 0.2 ns/px
Change-Id: I8350c71722a9bde714ba18f97b8687fe35cc749f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220709
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
I just kind of remembered that if we're doing (xy+x)/256
and x is a destination channel and y is 255-sa, then you
can get the +x for free by multiplying by 256-sa instead.
(d * (255-sa) + d)
(d * (255-sa + 1))
(d * (256-sa) )
Duh. This is a trick we play in a lot of legacy code and
I've just now realized it's exactly equivalent to the trick
I want to play here... sigh.
Folding this math in kind of makes mul/mad_unorm8 moot.
Speed's getting good:
I32_SWAR: 0.3 ns/px
I32 : 0.55 ns/px
F32 : 0.8 ns/px
RP : 0.8 ns/px
Opts : 0.2 ns/px
Change-Id: I4d10db51ea80a3258c36e97b6b334ad253804613
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220708
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
The mask-only special case for extract is wrong...
it never looked it its input!
This not only makes things correct-er, but oddly it also
makes them faster by breaking inter-loop data dependencies.
Disable tests for _I32... they're actually still broken
because of a much more systemic flaw in how I've evaluated
programs. The _F32 and _I32_SWAR JIT code and all interpreted
code is just getting lucky. o_O
While here, update the I32_SWAR code to use the same math as I32,
(x*y+x)/256 for unorm8 mul. This just helps keep me sane.
Change-Id: I1acc09adb84c426fca4b2be5ca8c2d46d9678dd8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220577
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Convert extract(x,bits,z) to be (x >> bits) & z,
now a more explicit parallel to pack().
This lets us eliminate the funky bit counting required from the old
instruction, but more saliently it makes it more likely that the masks
we AND with will be the same value.
Ultimately down at the x86 or ARM ISA level, the AND instructions don't
really benefit from having an immediate argument (while the shifts do).
We might as well treat the mask as a normal value, letting it get
commoned with identical values, loop hoisted, etc.
Change-Id: I48a38468b46f2c730574c025f412262296472447
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/219597
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
At some point adding more and more complex instructions reduces
to the absurdity of SolveTheWholeProblem-The-Instruction, but
I think this one will come up often enough to still make sense.
mad() makes sense for unorm8 just about everywhere mad() makes
sense for f32.
This instruction won't matter to a JIT, but helps the interpreter.
Change-Id: Iace92296cffbb6fbc3acd1f853cb01c51792f796
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218716
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
Kind of the flip side of pack.
Made slightly awkward by instructions having only one immediate...
calling _BitScanForward / __builtin_ctz() at runtime seems to work
fine, but it really could have been done at compile time.
Change-Id: Ic83fe8e0a1603fb9189598dcc26c842cc797bf45
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218241
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
This instruction can lower to some useful SSE/NEON
instructions, and even if not, is a handy way to
express the frequent paring of << and |.
I32_SWAR: 2.3 -> 1.9
I32: 2.6 -> 2.4
F32: 5.1 -> 4.7
Change-Id: Ia169ad40f0aaef32417e05d9bf91c2d2542e7b5f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218238
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
Another way for an interpreter to go faster
is to provide better instructions.
mul_unorm8 is one we use all the time.
Drops _I32 bench from ~3.6ns/px to ~2.6ns/px.
Change-Id: I9d08914c114048b79075796af9ec802236b35706
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218236
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
Eliminate the duplicate functionality,
and better testing for the bench builders.
Change-Id: If20e52107738903f854aec431416e573d7a7d640
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218041
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>