Commit Graph

47 Commits

Author SHA1 Message Date
Mike Klein
7c0332cd35 re-enable fnma
- hook up fmls.4s as fnma_f32
 - add fneg.4s
 - use fneg.4s + fmls.4s to impl fms_f32
 - more tests to exercise these

Change-Id: I60173a5e4618ab968a9361e15334a1d63c001372
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275412
Commit-Queue: Herb Derby <herb@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2020-03-05 21:58:07 +00:00
Mike Klein
f8db8a5516 disable fnma peephole
Change-Id: I66b251826211653586c64e8495cdfde7d7d8f1a2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275410
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
2020-03-05 20:28:27 +00:00
Herb Derby
d4c4f0c004 Add Multiply-Subtract (fms) to SkVM
Add fms op and instruction generation. Do fms and fnma
instruction selection.

TODO: Add the ops to Arm

Change-Id: I7e53abd7f4752eb99c31dcbff1f2ea7cf28af6c9
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275197
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-03-05 18:44:18 +00:00
Mike Klein
823d319f3b do our own fma discovery
Peephole add(F32,F32) for an argument that is a mul().

As a flourish, only generate Op::fma_f32 on machines we know support
real fused mul-adds.  This removes the ambiguity of whether Op::mad_f32
is an FMA or not; the new Op::fma_f32 is always an FMA, and otherwise
you'll just see ordinary mul-add.  No more Op::mad_f32.

Change-Id: I38016a2430774583116d8d6a8ada677012c1a8fc
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275138
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Herb Derby <herb@google.com>
2020-03-04 22:29:30 +00:00
Mike Klein
cb50b117e3 get rid of troublesome Op::round
We really only need to_unorm(),
and that's fine with trunc(mad(x, scale, 0.5)).

Change-Id: I1561c678501963a9ae53c22994fc906159fc7199
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/275075
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
2020-03-04 22:26:01 +00:00
Mike Klein
5cdeb390d0 only emit _imm ops when JITing for x86
There are probably ways to make this more efficient by only optimizing
what's necessary (e.g. try JIT first, then interpreter only if it fails)
and some other performance improvements to make, but for now I want to
focus mostly on keeping things simple and correct.

The line between Builder::done() and Program::Program() is particularly
fuzzy and becoming fuzzier here, and I think that'll be something
that'll change eventually.

This makes SkVMTest debug dumps more portable, though perhaps less
useful.  Might kill that feature soon now that SkVM is tested more
thoroughly in unit tests and GMs and bots and such.

Change-Id: Id9ce8daaf8570e5bea8b10f1a80b97f5b33d45dc
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/269941
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-02-10 19:26:05 +00:00
Mike Klein
4bb619554e move instruction specialization later
This adds a specialization pass to Builder::optimize() and moves the
x86-specific _imm ops there, rewriting with the Builder API itself.  I'm
only using the private Builder::push() call for the moment, but that's
enough to make me feel confident that this is a good way forward:  it's
still all going through CSE that way.

We're still doing this any time we're on x86, not when targeting the
JIT, but that'll come next, see the new TODOs.  It's mildly better for
the interpreter to not use the _imm ops, but this is really all still
warmup for optimizations with less mild opinions.

I'm not proud of the switch/goto impl but it's the clearest I found.

Change-Id: I30594b403832343528b95967724fd50324cd79d1
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/269232
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-02-07 16:21:08 +00:00
Mike Klein
ed9b1f1c1e refactor out a middle representation
Kind of brewing a big refactor here, to give me some room between
skvm::Builder and skvm::Program to do optimizations, bakend
specializations and analysis.

As a warmup, I'm trying to split up today's Builder::Instruction into
two forms, first just what the user requested in Builder (this stays
Builder::Instruction) then a new type representing any transformation or
analysis we've done to it (OptimizedInstruction).

Roughly six important optimizations happen in SkVM today, in this order:
   1) constant folding
   2) backend-specific instruction specialization
   3) common sub-expression elimination
   4) reordering + dead code elimination
   5) loop invariant and lifetime analysis
   6) register assignment

At head 1-5 all happen in Builder, and 2 is particularly
awkward to have there (e.g. mul_f32 -> mul_f32_imm).
6 happens in Program per-backend, and that seems healthy.

As of this CL, 1-3 happen in Builder, 4-5 now on this middle
OptimizedInstruction format, and 6 still in Program.

I'd like to get to the point where 1 stays in Builder, 2-5 all happen on
this middle IR, and 6 stays in Program.  That ought to let me do things
like turn mul_f32 -> mul_f32_imm when it's good to and still benefit
from things like common sub-expression elimination and code reordering
happening after that trnasformation.

And then, I hope that's also a good spot to do more complicated
transformations, like lowering gather8 into gather32 plus some fix up
when targeting an x86 JIT but not anywhere else.  Today's Builder is too
early to know whether we should do this or not, and in Program it's
actually kind of awkward to do this sort of thing while also doing
having to do register assignment.  Some middle might be right.

Change-Id: I9c00268a084f07fbab88d05eb441f1957a0d7c67
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/269181
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-02-06 20:09:53 +00:00
Mike Klein
57bdb24d0e skip no-op masks
extract() can generate silly instruction patterns like

     v0 = ...
     v1 = shr v0 24
     v2 = bit_and v1 FF
     v3 = whatever v2 ...

This CL skips those pointless bit_ands when we see the
mask is an immediate and (0xFFFFFFFF>>shift) == mask.

Change-Id: I2bb3847fbb2efdf24d024870ac37b37bb8f9aa3c
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263101
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 21:59:24 +00:00
Mike Klein
a6434a5ef5 refactor bit ops
- Remove extract... it's not going to have any special impl.
  I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.

This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces.  There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.

Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.

Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 21:20:54 +00:00
Mike Klein
b5c435579e upgrade debugging tools
- Add instruction numbers to program dumps.
- Dump the program when an assertion fails,
  and print the failing condition or an optional
  other value (e.g. if alpha outside [0,1], print alpha).

With all that and the new commented assert enabled, I'm seeing that
sometimes we get a bilerp alpha of 0x3f800001, just a little more than
1.0f.  Fix still tbd.

Change-Id: I2c20e41ae370d8cd2963e2dbf0fd91aa0fd50061
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/262808
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-07 18:26:22 +00:00
Mike Klein
1cb05993af all-constant peepholes
This adds a bunch of tests for ops that can all be evaluated directly in
skvm::Builder.  You can see the sort of effect this has by looking at
the diffs for SkVMTest.expected... lots of `v3 = sub_f32 v2 v2`
transformed to `v3 = splat 0 (0)` and that sort of thing.

My favorite part is handling many assert_true() calls at compile time!

While the old inter-Op code parallels aren't as clear now, these new
early-out tests kind of work like comments explaining each op.  I find
that nice.  I found it hard to parse so many uses of the word "splat" so
I did go back to isImm() from isSplat(), and added allImm() to test for
and read several immediates all at once.

Some of this is less C++17 than I'd like. :/

Change-Id: Ie8187d5d184195e3c0c92d613508fb708c28302f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/255814
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-11-21 22:00:11 +00:00
Mike Klein
8c1e0effbb sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register.  We can use this to avoid the
need to allocate a register for many constants.

The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think.  There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted.  From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.

1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM.  ARM has more vector registers than x86, and so register pressure
is lower there.  We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2).  The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.

2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32.  When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.

On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss.  On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead.  All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co.  It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.

3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co.  ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.

Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 20:17:55 +00:00
Mike Klein
4135cf0b57 use round() instead of trunc() to f32->unorm
This does open us up to a little bit of possible inconsistency of
rounding when right on a x.5 (sometimes we'll +0.5 and trunc, sometimes
round to nearest, sometimes round according to the default mode which is
usually round to nearest) but I think that inconsistency may be worth
the free register not needing a splat(0.5f) buys us.

A few invisible diffs.

Change-Id: I9af092c937ccf7c5891c2ab3cb298d217e4a9e9f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/253725
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Reed <reed@google.com>
2019-11-08 21:28:07 +00:00
Mike Klein
6e4aad91c3 rename to_i32 -> trunc, and add round
This plumbs through round but doesn't use it.  I want that change to be
its own CL.  It's nice to have assembler support and the name changes
even if I revert using round.

Change-Id: I6d67ec5c63546069eb7cc1c91599b599bafcda66
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/253724
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-11-08 21:00:51 +00:00
Mike Klein
e8356ad35d indent loop so it stands out
Change-Id: Iea0f804b1b2fed9e663e45c33fb54a91b10fd07b
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/252652
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-11-05 11:33:54 +00:00
Mike Klein
d48488b5ea reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction.  This
reorders blocks of instructions to reduce them number of temporary
registers in flight.

Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:

    src = load32 ...          (1)
    sr = extract src ...      (2)
    sg = extract src ...      (3)
    sb = extract src ...      (4)
    sa = extract src ...      (4, src dies)

    dst = load32 ...          (5)
    dr = extract dst ...      (6)
    dg = extract dst ...      (7)
    db = extract dst ...      (8)
    da = extract dst ...      (8, dst dies)

    r = add sr dr             (7, sr and dr die)
    g = add sg dg             (6, sg and dg die)
    b = add sb db             (5, sb and db die)
    a = add sa da             (4, sa and da die)

    rg   = pack r g ...       (3, r and g die)
    ba   = pack b a ...       (2, b and a die)
    rgba = pack rg ba ...     (1, rg and ba die)
    store32 rgba ...          (0, rgba dies)

That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here).  This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:

    src = load32 ...       (1)
    sr  = extract src ...  (2)
    dst = load32 ...       (3)
    dr  = extract dst ...  (4)
     r  = add sr dr        (3, sr and dr die)

    sg  = extract src ...  (4)
    dg  = extract dst ...  (5)
     g  = add sg dg        (4, sg and dg die)

    rg  = pack r g         (3, r and g die)

    sb  = extract src ...  (4)
    db  = extract dst ...  (5)
     b  = add sb db        (4, sb and db die)

    sa  = extract src ...  (4, src dies)
    da  = extract dst ...  (4, dst dies)
     a  = add sa da        (3, sa and da die)

    ba  = pack b a         (2, b and a die)

    rgba = pack rg ba ...  (1, rg and ba die)
    store32 rgba  ...      (0)

That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).

I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error.  I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.

This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.

Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 21:49:05 +00:00
Mike Klein
0f61c12737 add used_in_loop bit to skvm::Builder::Instruction
Most hoisted values are used in the loop body (and that's really the
whole point of hoisting) but some are just temporaries to help produce
other hoisted values.  This used_in_loop bit helps us distinguish the
two, and lets us recycle registers holding temporary hoisted values not
used in the loop.

The can-we-recycle logic now becomes:
   - is this a real value?
   - is it time for it to die?
   - is it either not hoisted or a hoisted temporary?

The set-death-to-infinity approach for hoisted values is now gone.  That
worked great for hoisted values used inside the loop, but was too
conservative for hoisted temporaries.  This lifetime extension was
preventing us from recycling those registers, pinning enough registers
that we run out and fail to JIT.

Small amounts of refactoring to make this clearer:
   - move the Instruction hash function definition near its operator==
   - rename the two "hoist" variables to "can_hoist" for Instructions
     and "try_hoisting" for the JIT approach
   - add ↟ to mark hoisted temporaries, _really_ hoisted values.

There's some redundancy here between tracking the can_hoist bit, the
used_in_loop bit, and lifetime tracking.  I think it should be true, for
instance, that !can_hoist && !used_in_loop implies an instruction is
dead code.  I plan to continue refactoring lifetime analysis (in
particular reordering instructions to decrease register pressure) so
hopefully by the time I'm done that metadata will shake out a little
crisper.

Change-Id: I6460ca96d1cbec0315bed3c9a0774cd88ab5be26
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/248986
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-16 18:29:06 +00:00
Mike Klein
f996311003 extend lifetimes for hoisted used in loop
This makes the register recycling checks a bit more
precise.  At head we never recycle a register that's
holding a hoisted value, which is overly conservative.

We really should never recycle a register that's still
needed.  By extending the lifetime of any hoisted value
that's used in the loop, we prevent that, while still
allowing hoisted values that are only used in hoisted
computation to be reused.

This takes just a small tweak in the JIT code (removing
the !hoisted({x,y,z}) checks), and a somewhat larger
refactoring in the interpreter, making both hoisted and
non-hoisted code go through the same recycling register
assignment flow.

There's one diff in the existing cases where we now
reuse a hoisted register, and I've added a second test
just to make sure it's covered explicitly.

Change-Id: I25b37ab1f1fea3042d7fd167529abc8fed1dddff
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/233239
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-08-13 02:08:16 +00:00
Mike Klein
8ac9f4e5b2 flesh out SkVM ops a bit more
Add missing comparison and selection ops, bit casts, 16-bit memory
operations, gathers, uniform loads, and fill in math holes where
reasonable.  Update some names to be a bit more regular.

I think all instructions are implemented in the interpreter,
and many tested.  More testing and JITs to follow.

Change-Id: I8cf377e8b72a86ac950e020892ce82b39e9d7277
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/229893
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-07-29 20:43:10 +00:00
Mike Klein
5e533c9e1f move hoist analysis back into Builder
Even if a JIT ultimately doesn't end up hoisting any values, it's going
to want this information while it decides.  Writing it in one place also
ensures we only get it wrong in one place...

I'm no_ extending the lifetime of hoisted instructions here in Builder.
That's something to leave to the backend so they have the flexibility of
which of these values to hoist, if any.  If they don't hoist, they'll
need to know when the value dies.

Moving this information back here lets the test expectation goldens
reflect the hoist bit again too.  Kind of nice.

Change-Id: Ib165ca898a97c1d822cb28fe24f15bae4d570a17
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/229024
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-07-22 19:34:06 +00:00
Mike Klein
c2fb3b4b72 split deaths() out of other analysis
I'm slowly refactoring my way to where hoisting and register assignment
are done in backend-specific ways, but this liveness analysis is always
going to be useful for each backend.

Use deaths() to restore friendly ☠️  dead code markers in test dumps.

Change-Id: I3ab94665bbbbf0788b0b27e00d644eba927dff47
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/228113
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
2019-07-17 18:11:10 +00:00
Mike Klein
aab45b5638 add misc. value programs to SkVMTest.expected
Noticed we were only dumping the final register
programs for the integer code.  Might as well also
track the value programs.

Change-Id: I417c5c655b632691557bbbb136dcbd3f3167af9a
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/225324
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-07-02 23:13:06 +00:00
Mike Klein
2b7b2a2331 add bit_clear
I was just reading the ARM docs and realized that their BIC ("BIt
Clear") is the same as SSE's ANDN ("AND Not") instruction.  It's kind of
a neat little tool to have laying around... comes up more than you'd
think, and it's sometimes the clearest way to express what you're doing,
as in the changed program here where the comment is "mask away the low
bits".  That's a bit_clear with a mask for what you want to clear away!

And the real reason to write this up is that I want to have a CL to
point to that shows how to add an instruction top to bottom.

Change-Id: I99690ed9c1009427b3986955e7ae6264de4d215c
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223120
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Reed <reed@google.com>
2019-06-24 16:31:15 +00:00
Mike Klein
a1167abcae split out Analysis struct from Instruction
Instruction is the fundamental data, and Analysis derived from it.

The fields in Analysis are only* needed in Builder::done(), and this
split seems to help clarify what done() can tweak (Analysis) and what
it cannot (fProgram, Instructions).  done() is now const.

No speed change as far as I can tell.

* As you may notice looking at the test expectations, making analysis
ephemeral means that dump() can no longer print the skull for dead code
or the arrow for hoisted.  The register program that's also in the
expectation file still reflects both of these optimizations, so we're
not really losing any information.  Just maybe less demo-friendly.

Change-Id: I79feb57558525591baf3faadeb59c418c12793f3
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223119
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-24 16:00:41 +00:00
Mike Klein
0c3346643a refactor to remove the need for death schedule
This cuts the overhead bench from about 19µs to about 15µs.

The key insight here is that the only registers that might become
available after any given instruction are the ones that hold that
instruction's inputs.  We can check when they become available
directly from the original Builder::Program, without needing a
side death schedule data structure.

Marking hoisted instructions as having life == program size
helps make this logic a little simpler to reason through.

Change-Id: Ifb9957f2d0e323e0e5d07996a2cc988f7c8b4c3f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223117
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-24 15:44:10 +00:00
Mike Klein
397fc88fc0 first VEX ymm vector ops
- 32x8 i32 add,sub,mul
   - add I32_Naive bench/test builder to get better i32 mul coverage
   - minor refactoring all over

Change-Id: I13cc19ff37a2da0bcff289ba51baac08f456d6c5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/222485
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-20 18:20:00 +00:00
Mike Klein
342b1b2753 proposed: add bytes() op
I'm staring at this assembly,

	vmovups	(%rsi), %ymm3
	vpsrld	$24, %ymm3, %ymm4
	vpslld	$16, %ymm4, %ymm15
	vorps	%ymm4, %ymm15, %ymm4
	vpsubw	%ymm4, %ymm0, %ymm4

Just knowing that could be

	vmovups	(%rsi), %ymm3
	vpshufb	 0x??(%rip), %ymm3, %ymm4
	vpsubw	%ymm4, %ymm0, %ymm4

That is, instead of shifting, shifting, and bit-oring
to create the 0a0a scale factor from ymm3, we could just
byte shuffle directly using some pre-baked control pattern
(stored at the end of the program like other constants)

pshufb lets you arbitrarily remix bytes from its argument and
zero bytes, and NEON has a similar family of vtbl instructions,
even including that same feature of injecting zeroes.

I think I've got this working, and the speedup is great,
from 0.19 to 0.16 ns/px for I32_SWAR, and
from 0.43 to 0.38 ns/px for I32.

Change-Id: Iab850275e826b4187f0efc9495a4b9eab4402c38
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220871
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-17 15:29:34 +00:00
Mike Klein
4c4945a252 trim another instruction of I32_SWAR
Now that we've got shr_16x2, extract(..., 8, splat(0x00ff00ff)) is
better done as shr_16x2(..., 8).  This swaps a 16-bit shift in for
the 32-bit shift, a wash, but lets us drop the bit_and at the end,
saving one whole instruction.

This places I32_SWAR a tiny little bit faster than the code in Opts,
like .19 ns/px vs .20 ns/px for Opts.

Change-Id: I4160dc03ecc8b855c0773a927f1510ad5cbb4b87
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220856
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-13 21:58:05 +00:00
Mike Klein
7f061fb53b streamline srcover math in I32_SWAR
This is the final bunny I've got in my hat, I think...

Remembering that none of the s += d*invA adds can overflow,
we can use a single 32-bit add to add them all at once.

This means we don't have to unpack the src pixel into rb/ga
halves.  We need only extract the alpha for invA.

This brings I32_SWAR even with the Opts code!

curr/maxrss	loops	min	median	mean	max	stddev	samples   	config	bench
  36/36  MB	133	0.206ns	0.211ns	0.208ns	0.211ns	1%	▁▇▁█▁▇▁▇▁▇	nonrendering	SkVM_4096_I32_SWAR
  37/37  MB	152	0.432ns	0.432ns	0.434ns	0.444ns	1%	▃▁▁▁▁▃▁▁█▁	nonrendering	SkVM_4096_I32
  37/37  MB	50	0.781ns	0.794ns	0.815ns	0.895ns	5%	▆▂█▃▅▂▂▁▂▁	nonrendering	SkVM_4096_F32
  37/37  MB	76	0.773ns	0.78ns	0.804ns	0.907ns	6%	▄█▅▁▁▁▁▂▁▁	nonrendering	SkVM_4096_RP
  37/37  MB	268	0.201ns	0.203ns	0.203ns	0.204ns	0%	█▇▆▆▆▆▁▆▆▆	nonrendering	SkVM_4096_Opts

Change-Id: Ibf0a9c5d90b35f1e9cf7265868bd18b7e0a76c43
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220805
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-13 21:32:45 +00:00
Mike Klein
57cb5ba122 i16x2 sub/shr
More i16x2 ops, as seemed immediately useful in I32_SWAR.

   I32_SWAR: 0.27 ns/px  --> 0.25 ns/px
   I32:      0.43 ns/px
   F32:      0.76 ns/px
   RP:       0.8  ns/px
   Opts:     0.2  ns/px

Change-Id: I04fed0d1ed1c4218d0cafb45fd0ee6d68880de80
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220801
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-13 19:17:34 +00:00
Mike Klein
3538908983 baby steps into 16-bit ops
I figure the easiest way to expose 16-bit operations
is to expose 16x2 pair operations... this means we
can continue to always work with the same size vector.

Switching from 32-bit multiplies to 16-bit multiplies
is going to deliver the most oomph... they cost roughly
half what 32-bit multiplies do on x86.

Speed now:

   I32_SWAR: 0.27 ns/px
   I32:      0.43 ns/px
   F32:      0.76 ns/px
   RP:       0.8  ns/px
   Opts:     0.2  ns/px

Change-Id: I8350c71722a9bde714ba18f97b8687fe35cc749f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220709
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-13 18:44:44 +00:00
Mike Klein
821f5e8dfe remove mul_unorm8/mad_unorm8
I just kind of remembered that if we're doing (xy+x)/256
and x is a destination channel and y is 255-sa, then you
can get the +x for free by multiplying by 256-sa instead.

  (d * (255-sa) + d)
  (d * (255-sa + 1))
  (d * (256-sa)    )

Duh.  This is a trick we play in a lot of legacy code and
I've just now realized it's exactly equivalent to the trick
I want to play here... sigh.

Folding this math in kind of makes mul/mad_unorm8 moot.

Speed's getting good:

  I32_SWAR: 0.3  ns/px
  I32     : 0.55 ns/px
  F32     : 0.8  ns/px
  RP      : 0.8  ns/px
  Opts    : 0.2  ns/px

Change-Id: I4d10db51ea80a3258c36e97b6b334ad253804613
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220708
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-13 18:21:44 +00:00
Mike Klein
3f593799da expand unit tests, fix extract
The mask-only special case for extract is wrong...
it never looked it its input!

This not only makes things correct-er, but oddly it also
makes them faster by breaking inter-loop data dependencies.

Disable tests for _I32... they're actually still broken
because of a much more systemic flaw in how I've evaluated
programs.  The _F32 and _I32_SWAR JIT code and all interpreted
code is just getting lucky.  o_O

While here, update the I32_SWAR code to use the same math as I32,
(x*y+x)/256 for unorm8 mul.  This just helps keep me sane.

Change-Id: I1acc09adb84c426fca4b2be5ca8c2d46d9678dd8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/220577
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-06-12 18:58:56 +00:00
Mike Klein
22ea7e994b add Builder::dump()
I used to have a dump of the value program before it was
translated to registers, but it went away a while ago.
This restores it.

Change-Id: I9b8bfcb124843cad4b0dc44bdf0a03e95a0c83d8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/219757
Commit-Queue: Mike Klein <mtklein@google.com>
Commit-Queue: Brian Osman <brianosman@google.com>
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-10 17:43:58 +00:00
Mike Klein
a630732c36 revise extract instruction
Convert extract(x,bits,z) to be (x >> bits) & z,
now a more explicit parallel to pack().

This lets us eliminate the funky bit counting required from the old
instruction, but more saliently it makes it more likely that the masks
we AND with will be the same value.

Ultimately down at the x86 or ARM ISA level, the AND instructions don't
really benefit from having an immediate argument (while the shifts do).
We might as well treat the mask as a normal value, letting it get
commoned with identical values, loop hoisted, etc.

Change-Id: I48a38468b46f2c730574c025f412262296472447
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/219597
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-10 15:08:19 +00:00
Mike Klein
6b43c30cd7 add mad_unorm8 instruction
At some point adding more and more complex instructions reduces
to the absurdity of SolveTheWholeProblem-The-Instruction, but
I think this one will come up often enough to still make sense.
mad() makes sense for unorm8 just about everywhere mad() makes
sense for f32.

This instruction won't matter to a JIT, but helps the interpreter.

Change-Id: Iace92296cffbb6fbc3acd1f853cb01c51792f796
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218716
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-05 20:46:02 +00:00
Mike Klein
754bad3f38 hoist loop-invariant code out of the loop
I'm of two minds about this... it adds register pressure and really only
tends to hoist few instructions that are fairly cheap anway.  On the
other hand, it's neat, it's easy to turn off (just set the initial
hoist value to false in Builder::push()) and it does deliver a
noticeable though slight performance improvement in the interpreter.

I think the final decision will probably come down to what we think
about maintainability?

Change-Id: Idd6346f70f03188917918406731154246a7c6fcb
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218584
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-05 16:31:41 +00:00
Mike Klein
c2d9a3145b Revert "print 1/K floats as fractions"
This reverts commit 346f82c1c3.

Reason for revert: *SAN bots

Original change's description:
> print 1/K floats as fractions
> 
> Change-Id: Id00cbd0950e77debb5ab5d45541dc0f8d13a3c42
> Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218338
> Reviewed-by: Brian Osman <brianosman@google.com>
> Commit-Queue: Mike Klein <mtklein@google.com>

TBR=mtklein@google.com,brianosman@google.com

Change-Id: Ic35cec97d2dc2c1e19dbdf8ea7b505ad75072da1
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218529
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-05 14:12:26 +00:00
Mike Klein
346f82c1c3 print 1/K floats as fractions
Change-Id: Id00cbd0950e77debb5ab5d45541dc0f8d13a3c42
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218338
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-04 21:11:22 +00:00
Mike Klein
3c00d3d379 dump register and instruction count
Change-Id: Ib0d4f354787e413749fdda8b59ccc2f94472b0ce
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218243
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-04 17:27:29 +00:00
Mike Klein
9656dcedc2 add extract instruction
Kind of the flip side of pack.

Made slightly awkward by instructions having only one immediate...
calling _BitScanForward / __builtin_ctz() at runtime seems to work
fine, but it really could have been done at compile time.

Change-Id: Ic83fe8e0a1603fb9189598dcc26c842cc797bf45
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218241
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-04 17:23:49 +00:00
Mike Klein
1665aaa5fd add pack instruction
This instruction can lower to some useful SSE/NEON
instructions, and even if not, is a handy way to
express the frequent paring of << and |.

I32_SWAR:  2.3 -> 1.9
I32:       2.6 -> 2.4
F32:       5.1 -> 4.7

Change-Id: Ia169ad40f0aaef32417e05d9bf91c2d2542e7b5f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218238
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-04 17:14:49 +00:00
Mike Klein
d7d04dc5eb add mul_unorm8 instruction
Another way for an interpreter to go faster
is to provide better instructions.

mul_unorm8 is one we use all the time.

Drops _I32 bench from ~3.6ns/px to ~2.6ns/px.

Change-Id: I9d08914c114048b79075796af9ec802236b35706
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218236
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-04 16:59:49 +00:00
Mike Klein
7b7077cc36 centralize test/bench SkVM builders
Eliminate the duplicate functionality,
and better testing for the bench builders.

Change-Id: If20e52107738903f854aec431416e573d7a7d640
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218041
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-04 16:55:59 +00:00
Mike Klein
12818515b5 add a x*y+0 -> x*y peephole
This is mostly to test how easy rebaselining SkVMTest is.

Change-Id: I27ab6f6bb8b7e126327374009783afd86d416f55
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218039
Reviewed-by: Brian Osman <brianosman@google.com>
2019-06-03 22:25:27 +00:00
Mike Klein
267f50773c streamline SkVM test rebaselining
- keep expectations in resources/
  - overwrite automatically if needed
    so we can see the diff in Git

Change-Id: I2486b127ebcc7f40332fd0462e38b1af04d3e32b
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218038
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2019-06-03 22:24:50 +00:00