2019-06-03 21:27:46 +00:00
|
|
|
A8 over A8
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
12 values:
|
|
|
|
v0 = load8 arg(0)
|
|
|
|
v1 = to_f32 v0
|
|
|
|
v2 = mul_f32 v1 3B808081 (0.0039215689)
|
|
|
|
v3 = load8 arg(1)
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v4 3B808081 (0.0039215689)
|
|
|
|
↑ v6 = splat 3F800000 (1)
|
|
|
|
v7 = sub_f32 v6 v2
|
|
|
|
v8 = mad_f32 v5 v7 v2
|
|
|
|
v9 = mul_f32 v8 437F0000 (255)
|
|
|
|
v10 = round v9
|
|
|
|
store8 arg(1) v10
|
2019-06-10 17:05:48 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
4 registers, 12 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2020-01-07 17:39:30 +00:00
|
|
|
1 r1 = load8 arg(0)
|
|
|
|
2 r1 = to_f32 r1
|
|
|
|
3 r1 = mul_f32 r1 3B808081 (0.0039215689)
|
|
|
|
4 r2 = load8 arg(1)
|
|
|
|
5 r2 = to_f32 r2
|
|
|
|
6 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
7 r3 = sub_f32 r0 r1
|
|
|
|
8 r1 = mad_f32 r2 r3 r1
|
|
|
|
9 r1 = mul_f32 r1 437F0000 (255)
|
|
|
|
10 r1 = round r1
|
|
|
|
11 store8 arg(1) r1
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
|
|
|
|
A8 over G8
|
|
|
|
17 values:
|
|
|
|
v0 = load8 arg(1)
|
|
|
|
v1 = to_f32 v0
|
|
|
|
v2 = mul_f32 v1 3B808081 (0.0039215689)
|
|
|
|
v3 = load8 arg(0)
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v4 3B808081 (0.0039215689)
|
|
|
|
↑ v6 = splat 3F800000 (1)
|
|
|
|
v7 = sub_f32 v6 v5
|
|
|
|
v8 = mul_f32 v2 v7
|
|
|
|
↑ v9 = splat 3E59B3D0 (0.21259999)
|
|
|
|
↑ v10 = splat 3F371759 (0.71520001)
|
|
|
|
v11 = mul_f32 v8 3D93DD98 (0.0722)
|
|
|
|
v12 = mad_f32 v8 v10 v11
|
|
|
|
v13 = mad_f32 v8 v9 v12
|
|
|
|
v14 = mul_f32 v13 437F0000 (255)
|
|
|
|
v15 = round v14
|
|
|
|
store8 arg(1) v15
|
|
|
|
|
|
|
|
5 registers, 17 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
|
|
|
1 r1 = splat 3E59B3D0 (0.21259999)
|
|
|
|
2 r2 = splat 3F371759 (0.71520001)
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
loop:
|
2020-01-07 17:39:30 +00:00
|
|
|
3 r3 = load8 arg(1)
|
|
|
|
4 r3 = to_f32 r3
|
|
|
|
5 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
6 r4 = load8 arg(0)
|
|
|
|
7 r4 = to_f32 r4
|
|
|
|
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
9 r4 = sub_f32 r0 r4
|
|
|
|
10 r4 = mul_f32 r3 r4
|
|
|
|
11 r3 = mul_f32 r4 3D93DD98 (0.0722)
|
|
|
|
12 r3 = mad_f32 r4 r2 r3
|
|
|
|
13 r3 = mad_f32 r4 r1 r3
|
|
|
|
14 r3 = mul_f32 r3 437F0000 (255)
|
|
|
|
15 r3 = round r3
|
|
|
|
16 store8 arg(1) r3
|
2019-06-03 21:27:46 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
A8 over RGBA_8888
|
2020-01-08 21:25:07 +00:00
|
|
|
36 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load32 arg(1)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v2 3B808081 (0.0039215689)
|
|
|
|
v4 = load8 arg(0)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v5 3B808081 (0.0039215689)
|
|
|
|
↑ v7 = splat 3F800000 (1)
|
|
|
|
v8 = sub_f32 v7 v6
|
|
|
|
v9 = mul_f32 v3 v8
|
|
|
|
v10 = mul_f32 v9 437F0000 (255)
|
|
|
|
v11 = round v10
|
|
|
|
v12 = shr_i32 v0 8
|
|
|
|
v13 = bit_and v12 FF
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v14 = to_f32 v13
|
|
|
|
v15 = mul_f32 v14 3B808081 (0.0039215689)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v16 = mul_f32 v15 v8
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v17 = mul_f32 v16 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v18 = round v17
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v19 = pack v11 v18 8
|
|
|
|
v20 = shr_i32 v0 16
|
|
|
|
v21 = bit_and v20 FF
|
|
|
|
v22 = to_f32 v21
|
|
|
|
v23 = mul_f32 v22 3B808081 (0.0039215689)
|
|
|
|
v24 = mul_f32 v23 v8
|
|
|
|
v25 = mul_f32 v24 437F0000 (255)
|
|
|
|
v26 = round v25
|
|
|
|
v27 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v28 = to_f32 v27
|
|
|
|
v29 = mul_f32 v28 3B808081 (0.0039215689)
|
|
|
|
v30 = mad_f32 v29 v8 v6
|
|
|
|
v31 = mul_f32 v30 437F0000 (255)
|
|
|
|
v32 = round v31
|
|
|
|
v33 = pack v26 v32 8
|
|
|
|
v34 = pack v19 v33 16
|
|
|
|
store32 arg(1) v34
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
6 registers, 36 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(1)
|
|
|
|
2 r2 = bit_and r1 FF
|
|
|
|
3 r2 = to_f32 r2
|
|
|
|
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
5 r3 = load8 arg(0)
|
|
|
|
6 r3 = to_f32 r3
|
|
|
|
7 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
8 r4 = sub_f32 r0 r3
|
|
|
|
9 r2 = mul_f32 r2 r4
|
|
|
|
10 r2 = mul_f32 r2 437F0000 (255)
|
|
|
|
11 r2 = round r2
|
|
|
|
12 r5 = shr_i32 r1 8
|
|
|
|
13 r5 = bit_and r5 FF
|
|
|
|
14 r5 = to_f32 r5
|
|
|
|
15 r5 = mul_f32 r5 3B808081 (0.0039215689)
|
|
|
|
16 r5 = mul_f32 r5 r4
|
|
|
|
17 r5 = mul_f32 r5 437F0000 (255)
|
|
|
|
18 r5 = round r5
|
|
|
|
19 r5 = pack r2 r5 8
|
|
|
|
20 r2 = shr_i32 r1 16
|
|
|
|
21 r2 = bit_and r2 FF
|
|
|
|
22 r2 = to_f32 r2
|
|
|
|
23 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
24 r2 = mul_f32 r2 r4
|
|
|
|
25 r2 = mul_f32 r2 437F0000 (255)
|
|
|
|
26 r2 = round r2
|
|
|
|
27 r1 = shr_i32 r1 24
|
2020-01-08 21:25:07 +00:00
|
|
|
28 r1 = to_f32 r1
|
|
|
|
29 r1 = mul_f32 r1 3B808081 (0.0039215689)
|
|
|
|
30 r3 = mad_f32 r1 r4 r3
|
|
|
|
31 r3 = mul_f32 r3 437F0000 (255)
|
|
|
|
32 r3 = round r3
|
|
|
|
33 r3 = pack r2 r3 8
|
|
|
|
34 r3 = pack r5 r3 16
|
|
|
|
35 store32 arg(1) r3
|
2019-06-03 21:27:46 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
G8 over A8
|
|
|
|
9 values:
|
|
|
|
↑ v0 = splat 3F800000 (1)
|
2019-11-21 17:46:58 +00:00
|
|
|
↑ v1 = splat 0 (0)
|
|
|
|
v2 = load8 arg(1)
|
|
|
|
v3 = to_f32 v2
|
|
|
|
v4 = mul_f32 v3 3B808081 (0.0039215689)
|
|
|
|
v5 = mad_f32 v4 v1 v0
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v6 = mul_f32 v5 437F0000 (255)
|
|
|
|
v7 = round v6
|
|
|
|
store8 arg(1) v7
|
2019-06-10 17:05:48 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
3 registers, 9 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
|
|
|
1 r1 = splat 0 (0)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2020-01-07 17:39:30 +00:00
|
|
|
2 r2 = load8 arg(1)
|
|
|
|
3 r2 = to_f32 r2
|
|
|
|
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
5 r2 = mad_f32 r2 r1 r0
|
|
|
|
6 r2 = mul_f32 r2 437F0000 (255)
|
|
|
|
7 r2 = round r2
|
|
|
|
8 store8 arg(1) r2
|
2019-06-03 21:27:46 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
G8 over G8
|
2019-11-21 17:46:58 +00:00
|
|
|
16 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load8 arg(0)
|
|
|
|
v1 = to_f32 v0
|
|
|
|
v2 = mul_f32 v1 3B808081 (0.0039215689)
|
|
|
|
v3 = load8 arg(1)
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v4 3B808081 (0.0039215689)
|
2019-11-21 17:46:58 +00:00
|
|
|
↑ v6 = splat 0 (0)
|
|
|
|
v7 = mad_f32 v5 v6 v2
|
|
|
|
↑ v8 = splat 3E59B3D0 (0.21259999)
|
|
|
|
↑ v9 = splat 3F371759 (0.71520001)
|
|
|
|
v10 = mul_f32 v7 3D93DD98 (0.0722)
|
|
|
|
v11 = mad_f32 v7 v9 v10
|
|
|
|
v12 = mad_f32 v7 v8 v11
|
|
|
|
v13 = mul_f32 v12 437F0000 (255)
|
|
|
|
v14 = round v13
|
|
|
|
store8 arg(1) v14
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-21 17:46:58 +00:00
|
|
|
5 registers, 16 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 0 (0)
|
|
|
|
1 r1 = splat 3E59B3D0 (0.21259999)
|
|
|
|
2 r2 = splat 3F371759 (0.71520001)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2020-01-07 17:39:30 +00:00
|
|
|
3 r3 = load8 arg(0)
|
|
|
|
4 r3 = to_f32 r3
|
|
|
|
5 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
6 r4 = load8 arg(1)
|
|
|
|
7 r4 = to_f32 r4
|
|
|
|
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
9 r3 = mad_f32 r4 r0 r3
|
|
|
|
10 r4 = mul_f32 r3 3D93DD98 (0.0722)
|
|
|
|
11 r4 = mad_f32 r3 r2 r4
|
|
|
|
12 r4 = mad_f32 r3 r1 r4
|
|
|
|
13 r4 = mul_f32 r4 437F0000 (255)
|
|
|
|
14 r4 = round r4
|
|
|
|
15 store8 arg(1) r4
|
2019-06-03 21:27:46 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
G8 over RGBA_8888
|
2020-01-08 21:25:07 +00:00
|
|
|
36 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load8 arg(0)
|
|
|
|
v1 = to_f32 v0
|
|
|
|
v2 = mul_f32 v1 3B808081 (0.0039215689)
|
|
|
|
v3 = load32 arg(1)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v4 = bit_and v3 FF
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v5 3B808081 (0.0039215689)
|
|
|
|
↑ v7 = splat 0 (0)
|
|
|
|
v8 = mad_f32 v6 v7 v2
|
|
|
|
v9 = mul_f32 v8 437F0000 (255)
|
|
|
|
v10 = round v9
|
|
|
|
v11 = shr_i32 v3 8
|
|
|
|
v12 = bit_and v11 FF
|
2019-11-21 17:46:58 +00:00
|
|
|
v13 = to_f32 v12
|
|
|
|
v14 = mul_f32 v13 3B808081 (0.0039215689)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v15 = mad_f32 v14 v7 v2
|
2019-11-21 17:46:58 +00:00
|
|
|
v16 = mul_f32 v15 437F0000 (255)
|
|
|
|
v17 = round v16
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v18 = pack v10 v17 8
|
|
|
|
v19 = shr_i32 v3 16
|
|
|
|
v20 = bit_and v19 FF
|
|
|
|
v21 = to_f32 v20
|
|
|
|
v22 = mul_f32 v21 3B808081 (0.0039215689)
|
|
|
|
v23 = mad_f32 v22 v7 v2
|
|
|
|
v24 = mul_f32 v23 437F0000 (255)
|
|
|
|
v25 = round v24
|
|
|
|
↑ v26 = splat 3F800000 (1)
|
|
|
|
v27 = shr_i32 v3 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v28 = to_f32 v27
|
|
|
|
v29 = mul_f32 v28 3B808081 (0.0039215689)
|
|
|
|
v30 = mad_f32 v29 v7 v26
|
|
|
|
v31 = mul_f32 v30 437F0000 (255)
|
|
|
|
v32 = round v31
|
|
|
|
v33 = pack v25 v32 8
|
|
|
|
v34 = pack v18 v33 16
|
|
|
|
store32 arg(1) v34
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
6 registers, 36 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 0 (0)
|
|
|
|
1 r1 = splat 3F800000 (1)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
2 r2 = load8 arg(0)
|
|
|
|
3 r2 = to_f32 r2
|
|
|
|
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
5 r3 = load32 arg(1)
|
|
|
|
6 r4 = bit_and r3 FF
|
|
|
|
7 r4 = to_f32 r4
|
|
|
|
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
9 r4 = mad_f32 r4 r0 r2
|
|
|
|
10 r4 = mul_f32 r4 437F0000 (255)
|
|
|
|
11 r4 = round r4
|
|
|
|
12 r5 = shr_i32 r3 8
|
|
|
|
13 r5 = bit_and r5 FF
|
|
|
|
14 r5 = to_f32 r5
|
|
|
|
15 r5 = mul_f32 r5 3B808081 (0.0039215689)
|
|
|
|
16 r5 = mad_f32 r5 r0 r2
|
|
|
|
17 r5 = mul_f32 r5 437F0000 (255)
|
|
|
|
18 r5 = round r5
|
|
|
|
19 r5 = pack r4 r5 8
|
|
|
|
20 r4 = shr_i32 r3 16
|
|
|
|
21 r4 = bit_and r4 FF
|
|
|
|
22 r4 = to_f32 r4
|
|
|
|
23 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
24 r2 = mad_f32 r4 r0 r2
|
|
|
|
25 r2 = mul_f32 r2 437F0000 (255)
|
|
|
|
26 r2 = round r2
|
|
|
|
27 r3 = shr_i32 r3 24
|
2020-01-08 21:25:07 +00:00
|
|
|
28 r3 = to_f32 r3
|
|
|
|
29 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
30 r3 = mad_f32 r3 r0 r1
|
|
|
|
31 r3 = mul_f32 r3 437F0000 (255)
|
|
|
|
32 r3 = round r3
|
|
|
|
33 r3 = pack r2 r3 8
|
|
|
|
34 r3 = pack r5 r3 16
|
|
|
|
35 store32 arg(1) r3
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
RGBA_8888 over A8
|
2020-01-08 21:25:07 +00:00
|
|
|
13 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load32 arg(0)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v2 3B808081 (0.0039215689)
|
|
|
|
v4 = load8 arg(1)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v5 3B808081 (0.0039215689)
|
|
|
|
↑ v7 = splat 3F800000 (1)
|
|
|
|
v8 = sub_f32 v7 v3
|
|
|
|
v9 = mad_f32 v6 v8 v3
|
|
|
|
v10 = mul_f32 v9 437F0000 (255)
|
|
|
|
v11 = round v10
|
|
|
|
store8 arg(1) v11
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
4 registers, 13 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(0)
|
|
|
|
2 r1 = shr_i32 r1 24
|
2020-01-08 21:25:07 +00:00
|
|
|
3 r1 = to_f32 r1
|
|
|
|
4 r1 = mul_f32 r1 3B808081 (0.0039215689)
|
|
|
|
5 r2 = load8 arg(1)
|
|
|
|
6 r2 = to_f32 r2
|
|
|
|
7 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
8 r3 = sub_f32 r0 r1
|
|
|
|
9 r1 = mad_f32 r2 r3 r1
|
|
|
|
10 r1 = mul_f32 r1 437F0000 (255)
|
|
|
|
11 r1 = round r1
|
|
|
|
12 store8 arg(1) r1
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
|
|
|
|
RGBA_8888 over G8
|
2020-01-08 21:25:07 +00:00
|
|
|
31 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load32 arg(0)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v2 3B808081 (0.0039215689)
|
|
|
|
v4 = load8 arg(1)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v5 3B808081 (0.0039215689)
|
|
|
|
v7 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v8 = to_f32 v7
|
|
|
|
v9 = mul_f32 v8 3B808081 (0.0039215689)
|
|
|
|
↑ v10 = splat 3F800000 (1)
|
|
|
|
v11 = sub_f32 v10 v9
|
|
|
|
v12 = mad_f32 v6 v11 v3
|
|
|
|
↑ v13 = splat 3E59B3D0 (0.21259999)
|
|
|
|
v14 = shr_i32 v0 8
|
|
|
|
v15 = bit_and v14 FF
|
|
|
|
v16 = to_f32 v15
|
|
|
|
v17 = mul_f32 v16 3B808081 (0.0039215689)
|
|
|
|
v18 = mad_f32 v6 v11 v17
|
|
|
|
↑ v19 = splat 3F371759 (0.71520001)
|
|
|
|
v20 = shr_i32 v0 16
|
|
|
|
v21 = bit_and v20 FF
|
|
|
|
v22 = to_f32 v21
|
|
|
|
v23 = mul_f32 v22 3B808081 (0.0039215689)
|
|
|
|
v24 = mad_f32 v6 v11 v23
|
|
|
|
v25 = mul_f32 v24 3D93DD98 (0.0722)
|
|
|
|
v26 = mad_f32 v18 v19 v25
|
|
|
|
v27 = mad_f32 v12 v13 v26
|
|
|
|
v28 = mul_f32 v27 437F0000 (255)
|
|
|
|
v29 = round v28
|
|
|
|
store8 arg(1) v29
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
8 registers, 31 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
|
|
|
1 r1 = splat 3E59B3D0 (0.21259999)
|
|
|
|
2 r2 = splat 3F371759 (0.71520001)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
3 r3 = load32 arg(0)
|
|
|
|
4 r4 = bit_and r3 FF
|
|
|
|
5 r4 = to_f32 r4
|
|
|
|
6 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
7 r5 = load8 arg(1)
|
|
|
|
8 r5 = to_f32 r5
|
|
|
|
9 r5 = mul_f32 r5 3B808081 (0.0039215689)
|
|
|
|
10 r6 = shr_i32 r3 24
|
2020-01-08 21:25:07 +00:00
|
|
|
11 r6 = to_f32 r6
|
|
|
|
12 r6 = mul_f32 r6 3B808081 (0.0039215689)
|
|
|
|
13 r6 = sub_f32 r0 r6
|
|
|
|
14 r4 = mad_f32 r5 r6 r4
|
|
|
|
15 r7 = shr_i32 r3 8
|
|
|
|
16 r7 = bit_and r7 FF
|
|
|
|
17 r7 = to_f32 r7
|
|
|
|
18 r7 = mul_f32 r7 3B808081 (0.0039215689)
|
|
|
|
19 r7 = mad_f32 r5 r6 r7
|
|
|
|
20 r3 = shr_i32 r3 16
|
|
|
|
21 r3 = bit_and r3 FF
|
|
|
|
22 r3 = to_f32 r3
|
|
|
|
23 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
24 r3 = mad_f32 r5 r6 r3
|
|
|
|
25 r3 = mul_f32 r3 3D93DD98 (0.0722)
|
|
|
|
26 r3 = mad_f32 r7 r2 r3
|
|
|
|
27 r3 = mad_f32 r4 r1 r3
|
|
|
|
28 r3 = mul_f32 r3 437F0000 (255)
|
|
|
|
29 r3 = round r3
|
|
|
|
30 store8 arg(1) r3
|
2019-06-05 15:47:46 +00:00
|
|
|
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
RGBA_8888 over RGBA_8888
|
2020-01-08 21:25:07 +00:00
|
|
|
48 values:
|
sketch out structure for ops with immediates
Lots of x86 instructions can take their right hand side argument from
memory directly rather than a register. We can use this to avoid the
need to allocate a register for many constants.
The strategy in this CL is one of several I've been stewing over, the
simplest of those strategies I think. There are some trade offs
particularly on ARM; this naive ARM implementation means we'll load&op
every time, even though the load part of the operation can logically be
hoisted. From here on I'm going to just briefly enumerate a few other
approaches that allow the optimization on x86 and still allow the
immediate splats to hoist on ARM.
1) don't do it on ARM
A very simple approach is to simply not perform this optimization on
ARM. ARM has more vector registers than x86, and so register pressure
is lower there. We're going to end up with splatted constants in
registers anyway, so maybe just let that happen the normal way instead
of some roundabout complicated hack like I'll talk about in 2). The
only downside in my mind is that this approach would make high-level
program descriptions platform dependent, which isn't so bad, but it's
been nice to be able to compare and diff debug dumps.
2) split Op::splat up
The next less-simple approach to this problem could fix this by
splitting splats into two Ops internally, one inner Op::immediate that
guantees at least the constant is in memory and is compatible with
immediate-aware Ops like mul_f32_imm, and an outer Op::constant that
depends on that Op::immediate and further guarantees that constant has
been broadcast into a register to be compatible with non-immediate-aware
ops like div_f32. When building a program, immediate-aware ops would
peek for Op::constants as they do today for Op::splats, but instead of
embedding the immediate themselves, they'd replace their dependency with
the inner Op::immediate.
On x86 these new Ops would work just as advertised, with Op::immediate a
runtime no-op, Op::constant the usual vbroadcastss. On ARM
Op::immediate needs to go all the way and splat out a register to make
the constant compatible with immediate-aware ops, and the Op::constant
becomes a noop now instead. All this comes together to let the
Op::immediate splat hoist up out of the loop while still feeding
Op::mul_f32_imm and co. It's a rather complicated approach to solving
this issue, but I might want to explore it just to see how bad it is.
3) do it inside the x86 JIT
The conceptually best approach is to find a way to do this peepholing
only inside the JIT only on x86, avoiding the need for new
Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this
peephole, so the x86 JIT is the logical owner of this optimization.
Finding a clean way to do this without too much disruption is the least
baked idea I've got here, though I think the most desirable long-term.
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER
Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
|
|
|
v0 = load32 arg(0)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v2 3B808081 (0.0039215689)
|
|
|
|
v4 = load32 arg(1)
|
|
|
|
v5 = bit_and v4 FF
|
|
|
|
v6 = to_f32 v5
|
|
|
|
v7 = mul_f32 v6 3B808081 (0.0039215689)
|
|
|
|
v8 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v9 = to_f32 v8
|
|
|
|
v10 = mul_f32 v9 3B808081 (0.0039215689)
|
|
|
|
↑ v11 = splat 3F800000 (1)
|
|
|
|
v12 = sub_f32 v11 v10
|
|
|
|
v13 = mad_f32 v7 v12 v3
|
|
|
|
v14 = mul_f32 v13 437F0000 (255)
|
|
|
|
v15 = round v14
|
|
|
|
v16 = shr_i32 v0 8
|
|
|
|
v17 = bit_and v16 FF
|
|
|
|
v18 = to_f32 v17
|
|
|
|
v19 = mul_f32 v18 3B808081 (0.0039215689)
|
|
|
|
v20 = shr_i32 v4 8
|
|
|
|
v21 = bit_and v20 FF
|
|
|
|
v22 = to_f32 v21
|
|
|
|
v23 = mul_f32 v22 3B808081 (0.0039215689)
|
|
|
|
v24 = mad_f32 v23 v12 v19
|
|
|
|
v25 = mul_f32 v24 437F0000 (255)
|
|
|
|
v26 = round v25
|
|
|
|
v27 = pack v15 v26 8
|
|
|
|
v28 = shr_i32 v0 16
|
|
|
|
v29 = bit_and v28 FF
|
|
|
|
v30 = to_f32 v29
|
|
|
|
v31 = mul_f32 v30 3B808081 (0.0039215689)
|
|
|
|
v32 = shr_i32 v4 16
|
|
|
|
v33 = bit_and v32 FF
|
|
|
|
v34 = to_f32 v33
|
|
|
|
v35 = mul_f32 v34 3B808081 (0.0039215689)
|
|
|
|
v36 = mad_f32 v35 v12 v31
|
|
|
|
v37 = mul_f32 v36 437F0000 (255)
|
|
|
|
v38 = round v37
|
|
|
|
v39 = shr_i32 v4 24
|
|
|
|
v40 = to_f32 v39
|
|
|
|
v41 = mul_f32 v40 3B808081 (0.0039215689)
|
|
|
|
v42 = mad_f32 v41 v12 v10
|
|
|
|
v43 = mul_f32 v42 437F0000 (255)
|
|
|
|
v44 = round v43
|
|
|
|
v45 = pack v38 v44 8
|
|
|
|
v46 = pack v27 v45 16
|
|
|
|
store32 arg(1) v46
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
8 registers, 48 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 3F800000 (1)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(0)
|
|
|
|
2 r2 = bit_and r1 FF
|
|
|
|
3 r2 = to_f32 r2
|
|
|
|
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
5 r3 = load32 arg(1)
|
|
|
|
6 r4 = bit_and r3 FF
|
|
|
|
7 r4 = to_f32 r4
|
|
|
|
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
9 r5 = shr_i32 r1 24
|
2020-01-08 21:25:07 +00:00
|
|
|
10 r5 = to_f32 r5
|
|
|
|
11 r5 = mul_f32 r5 3B808081 (0.0039215689)
|
|
|
|
12 r6 = sub_f32 r0 r5
|
|
|
|
13 r2 = mad_f32 r4 r6 r2
|
|
|
|
14 r2 = mul_f32 r2 437F0000 (255)
|
|
|
|
15 r2 = round r2
|
|
|
|
16 r4 = shr_i32 r1 8
|
|
|
|
17 r4 = bit_and r4 FF
|
|
|
|
18 r4 = to_f32 r4
|
|
|
|
19 r4 = mul_f32 r4 3B808081 (0.0039215689)
|
|
|
|
20 r7 = shr_i32 r3 8
|
|
|
|
21 r7 = bit_and r7 FF
|
|
|
|
22 r7 = to_f32 r7
|
|
|
|
23 r7 = mul_f32 r7 3B808081 (0.0039215689)
|
|
|
|
24 r4 = mad_f32 r7 r6 r4
|
|
|
|
25 r4 = mul_f32 r4 437F0000 (255)
|
|
|
|
26 r4 = round r4
|
|
|
|
27 r4 = pack r2 r4 8
|
|
|
|
28 r1 = shr_i32 r1 16
|
|
|
|
29 r1 = bit_and r1 FF
|
|
|
|
30 r1 = to_f32 r1
|
|
|
|
31 r1 = mul_f32 r1 3B808081 (0.0039215689)
|
|
|
|
32 r2 = shr_i32 r3 16
|
|
|
|
33 r2 = bit_and r2 FF
|
|
|
|
34 r2 = to_f32 r2
|
|
|
|
35 r2 = mul_f32 r2 3B808081 (0.0039215689)
|
|
|
|
36 r1 = mad_f32 r2 r6 r1
|
|
|
|
37 r1 = mul_f32 r1 437F0000 (255)
|
|
|
|
38 r1 = round r1
|
|
|
|
39 r3 = shr_i32 r3 24
|
|
|
|
40 r3 = to_f32 r3
|
|
|
|
41 r3 = mul_f32 r3 3B808081 (0.0039215689)
|
|
|
|
42 r5 = mad_f32 r3 r6 r5
|
|
|
|
43 r5 = mul_f32 r5 437F0000 (255)
|
|
|
|
44 r5 = round r5
|
|
|
|
45 r5 = pack r1 r5 8
|
|
|
|
46 r5 = pack r4 r5 16
|
|
|
|
47 store32 arg(1) r5
|
2019-06-05 15:47:46 +00:00
|
|
|
|
2019-06-20 16:37:10 +00:00
|
|
|
I32 (Naive) 8888 over 8888
|
2020-01-08 21:25:07 +00:00
|
|
|
32 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = load32 arg(1)
|
|
|
|
v3 = bit_and v2 FF
|
|
|
|
v4 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
↑ v5 = splat 100 (3.5873241e-43)
|
|
|
|
v6 = sub_i32 v5 v4
|
|
|
|
v7 = mul_i32 v3 v6
|
|
|
|
v8 = shr_i32 v7 8
|
|
|
|
v9 = add_i32 v1 v8
|
|
|
|
v10 = shr_i32 v0 8
|
|
|
|
v11 = bit_and v10 FF
|
|
|
|
v12 = shr_i32 v2 8
|
|
|
|
v13 = bit_and v12 FF
|
|
|
|
v14 = mul_i32 v13 v6
|
|
|
|
v15 = shr_i32 v14 8
|
|
|
|
v16 = add_i32 v11 v15
|
|
|
|
v17 = pack v9 v16 8
|
|
|
|
v18 = shr_i32 v0 16
|
|
|
|
v19 = bit_and v18 FF
|
|
|
|
v20 = shr_i32 v2 16
|
|
|
|
v21 = bit_and v20 FF
|
|
|
|
v22 = mul_i32 v21 v6
|
|
|
|
v23 = shr_i32 v22 8
|
|
|
|
v24 = add_i32 v19 v23
|
|
|
|
v25 = shr_i32 v2 24
|
|
|
|
v26 = mul_i32 v25 v6
|
|
|
|
v27 = shr_i32 v26 8
|
|
|
|
v28 = add_i32 v4 v27
|
|
|
|
v29 = pack v24 v28 8
|
|
|
|
v30 = pack v17 v29 16
|
|
|
|
store32 arg(1) v30
|
2019-07-02 20:39:23 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
8 registers, 32 instructions:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = splat 100 (3.5873241e-43)
|
2019-06-20 16:37:10 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(0)
|
|
|
|
2 r2 = bit_and r1 FF
|
|
|
|
3 r3 = load32 arg(1)
|
|
|
|
4 r4 = bit_and r3 FF
|
|
|
|
5 r5 = shr_i32 r1 24
|
2020-01-08 21:25:07 +00:00
|
|
|
6 r6 = sub_i32 r0 r5
|
|
|
|
7 r4 = mul_i32 r4 r6
|
|
|
|
8 r4 = shr_i32 r4 8
|
|
|
|
9 r4 = add_i32 r2 r4
|
|
|
|
10 r2 = shr_i32 r1 8
|
|
|
|
11 r2 = bit_and r2 FF
|
|
|
|
12 r7 = shr_i32 r3 8
|
|
|
|
13 r7 = bit_and r7 FF
|
|
|
|
14 r7 = mul_i32 r7 r6
|
|
|
|
15 r7 = shr_i32 r7 8
|
|
|
|
16 r7 = add_i32 r2 r7
|
|
|
|
17 r7 = pack r4 r7 8
|
|
|
|
18 r1 = shr_i32 r1 16
|
|
|
|
19 r1 = bit_and r1 FF
|
|
|
|
20 r4 = shr_i32 r3 16
|
|
|
|
21 r4 = bit_and r4 FF
|
|
|
|
22 r4 = mul_i32 r4 r6
|
|
|
|
23 r4 = shr_i32 r4 8
|
|
|
|
24 r4 = add_i32 r1 r4
|
|
|
|
25 r3 = shr_i32 r3 24
|
|
|
|
26 r6 = mul_i32 r3 r6
|
|
|
|
27 r6 = shr_i32 r6 8
|
|
|
|
28 r6 = add_i32 r5 r6
|
|
|
|
29 r6 = pack r4 r6 8
|
|
|
|
30 r6 = pack r7 r6 16
|
|
|
|
31 store32 arg(1) r6
|
2019-06-20 16:37:10 +00:00
|
|
|
|
2019-06-05 15:47:46 +00:00
|
|
|
I32 8888 over 8888
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
28 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = load32 arg(1)
|
|
|
|
v3 = bit_and v2 FF
|
|
|
|
v4 = shr_i32 v0 24
|
|
|
|
↑ v5 = splat 100 (3.5873241e-43)
|
|
|
|
v6 = sub_i32 v5 v4
|
|
|
|
v7 = mul_i16x2 v3 v6
|
|
|
|
v8 = shr_i32 v7 8
|
|
|
|
v9 = add_i32 v1 v8
|
|
|
|
v10 = bytes v0 2
|
|
|
|
v11 = bytes v2 2
|
|
|
|
v12 = mul_i16x2 v11 v6
|
|
|
|
v13 = shr_i32 v12 8
|
|
|
|
v14 = add_i32 v10 v13
|
|
|
|
v15 = pack v9 v14 8
|
|
|
|
v16 = bytes v0 3
|
|
|
|
v17 = bytes v2 3
|
|
|
|
v18 = mul_i16x2 v17 v6
|
|
|
|
v19 = shr_i32 v18 8
|
|
|
|
v20 = add_i32 v16 v19
|
|
|
|
v21 = shr_i32 v2 24
|
|
|
|
v22 = mul_i16x2 v21 v6
|
|
|
|
v23 = shr_i32 v22 8
|
|
|
|
v24 = add_i32 v4 v23
|
|
|
|
v25 = pack v20 v24 8
|
|
|
|
v26 = pack v15 v25 16
|
|
|
|
store32 arg(1) v26
|
2019-07-02 20:39:23 +00:00
|
|
|
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
8 registers, 28 instructions:
|
|
|
|
0 r0 = splat 100 (3.5873241e-43)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(0)
|
|
|
|
2 r2 = bit_and r1 FF
|
|
|
|
3 r3 = load32 arg(1)
|
|
|
|
4 r4 = bit_and r3 FF
|
|
|
|
5 r5 = shr_i32 r1 24
|
|
|
|
6 r6 = sub_i32 r0 r5
|
|
|
|
7 r4 = mul_i16x2 r4 r6
|
|
|
|
8 r4 = shr_i32 r4 8
|
|
|
|
9 r4 = add_i32 r2 r4
|
|
|
|
10 r2 = bytes r1 2
|
|
|
|
11 r7 = bytes r3 2
|
|
|
|
12 r7 = mul_i16x2 r7 r6
|
|
|
|
13 r7 = shr_i32 r7 8
|
|
|
|
14 r7 = add_i32 r2 r7
|
|
|
|
15 r7 = pack r4 r7 8
|
|
|
|
16 r1 = bytes r1 3
|
|
|
|
17 r4 = bytes r3 3
|
|
|
|
18 r4 = mul_i16x2 r4 r6
|
|
|
|
19 r4 = shr_i32 r4 8
|
|
|
|
20 r4 = add_i32 r1 r4
|
|
|
|
21 r3 = shr_i32 r3 24
|
|
|
|
22 r6 = mul_i16x2 r3 r6
|
|
|
|
23 r6 = shr_i32 r6 8
|
|
|
|
24 r6 = add_i32 r5 r6
|
|
|
|
25 r6 = pack r4 r6 8
|
|
|
|
26 r6 = pack r7 r6 16
|
|
|
|
27 store32 arg(1) r6
|
2019-06-03 22:10:59 +00:00
|
|
|
|
|
|
|
I32 (SWAR) 8888 over 8888
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
14 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
|
|
|
v1 = bytes v0 404
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v2 = splat 1000100 (2.3510604e-38)
|
2019-07-17 17:09:09 +00:00
|
|
|
v3 = sub_i16x2 v2 v1
|
|
|
|
v4 = load32 arg(1)
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v5 = bit_and v4 FF00FF
|
|
|
|
v6 = mul_i16x2 v5 v3
|
|
|
|
v7 = shr_i16x2 v6 8
|
|
|
|
v8 = shr_i16x2 v4 8
|
|
|
|
v9 = mul_i16x2 v8 v3
|
|
|
|
v10 = bit_and v9 FF00FF00
|
|
|
|
v11 = bit_or v7 v10
|
|
|
|
v12 = add_i32 v0 v11
|
|
|
|
store32 arg(1) v12
|
2019-07-02 20:39:23 +00:00
|
|
|
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
5 registers, 14 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 1000100 (2.3510604e-38)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
1 r1 = load32 arg(0)
|
|
|
|
2 r2 = bytes r1 404
|
|
|
|
3 r2 = sub_i16x2 r0 r2
|
|
|
|
4 r3 = load32 arg(1)
|
|
|
|
5 r4 = bit_and r3 FF00FF
|
|
|
|
6 r4 = mul_i16x2 r4 r2
|
|
|
|
7 r4 = shr_i16x2 r4 8
|
|
|
|
8 r3 = shr_i16x2 r3 8
|
|
|
|
9 r2 = mul_i16x2 r3 r2
|
|
|
|
10 r2 = bit_and r2 FF00FF00
|
|
|
|
11 r2 = bit_or r4 r2
|
|
|
|
12 r2 = add_i32 r1 r2
|
|
|
|
13 store32 arg(1) r2
|
2019-06-03 22:10:59 +00:00
|
|
|
|
2019-08-08 19:13:25 +00:00
|
|
|
6 values:
|
2019-10-16 15:46:01 +00:00
|
|
|
↟ v0 = splat 1 (1.4012985e-45)
|
|
|
|
↟ v1 = splat 2 (2.8025969e-45)
|
2019-08-08 19:13:25 +00:00
|
|
|
↑ v2 = add_i32 v0 v1
|
|
|
|
v3 = load32 arg(0)
|
|
|
|
v4 = mul_i32 v3 v2
|
|
|
|
store32 arg(0) v4
|
|
|
|
|
|
|
|
2 registers, 6 instructions:
|
2020-01-07 17:39:30 +00:00
|
|
|
0 r0 = splat 1 (1.4012985e-45)
|
|
|
|
1 r1 = splat 2 (2.8025969e-45)
|
|
|
|
2 r1 = add_i32 r0 r1
|
2019-08-08 19:13:25 +00:00
|
|
|
loop:
|
2020-01-07 17:39:30 +00:00
|
|
|
3 r0 = load32 arg(0)
|
|
|
|
4 r0 = mul_i32 r0 r1
|
|
|
|
5 store32 arg(0) r0
|
2019-08-08 19:13:25 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
22 values:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
v0 = load32 arg(0)
|
|
|
|
v1 = bit_and v0 FF
|
|
|
|
v2 = load32 arg(1)
|
|
|
|
v3 = bit_and v2 FF
|
|
|
|
v4 = add_i32 v1 v3
|
|
|
|
v5 = shr_i32 v0 8
|
|
|
|
v6 = bit_and v5 FF
|
|
|
|
v7 = shr_i32 v2 8
|
|
|
|
v8 = bit_and v7 FF
|
|
|
|
v9 = add_i32 v6 v8
|
|
|
|
v10 = pack v4 v9 8
|
|
|
|
v11 = shr_i32 v0 16
|
|
|
|
v12 = bit_and v11 FF
|
|
|
|
v13 = shr_i32 v2 16
|
|
|
|
v14 = bit_and v13 FF
|
|
|
|
v15 = add_i32 v12 v14
|
|
|
|
v16 = shr_i32 v0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
v17 = shr_i32 v2 24
|
|
|
|
v18 = add_i32 v16 v17
|
|
|
|
v19 = pack v15 v18 8
|
|
|
|
v20 = pack v10 v19 16
|
|
|
|
store32 arg(1) v20
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
|
2020-01-08 21:25:07 +00:00
|
|
|
5 registers, 22 instructions:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
loop:
|
refactor bit ops
- Remove extract... it's not going to have any special impl.
I've left it on skvm::Builder as an inline compound method.
- Add no-op shift short circuits.
- Add immediate ops for bit_{and,or,xor,clear}.
This comes from me noticing that the masks for extract today are always
immediates, and then when I started converting it to be (I32, int shift,
int mask), I realized it might be even better to break it up into its
component pieces. There's no backend that can do extract any better
than shift-then-mask, so might as well leave it that way so we can
dedup, reorder, and specialize those micro ops.
Will follow up soon to get this all JITing again,
and these can-we-JIT test changes will be reverted.
Change-Id: I0835bcd825e417104ccc7efc79e9a0f2f4897841
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/263217
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
2020-01-08 20:06:52 +00:00
|
|
|
0 r0 = load32 arg(0)
|
|
|
|
1 r1 = bit_and r0 FF
|
|
|
|
2 r2 = load32 arg(1)
|
|
|
|
3 r3 = bit_and r2 FF
|
|
|
|
4 r3 = add_i32 r1 r3
|
|
|
|
5 r1 = shr_i32 r0 8
|
|
|
|
6 r1 = bit_and r1 FF
|
|
|
|
7 r4 = shr_i32 r2 8
|
|
|
|
8 r4 = bit_and r4 FF
|
|
|
|
9 r4 = add_i32 r1 r4
|
|
|
|
10 r4 = pack r3 r4 8
|
|
|
|
11 r3 = shr_i32 r0 16
|
|
|
|
12 r3 = bit_and r3 FF
|
|
|
|
13 r1 = shr_i32 r2 16
|
|
|
|
14 r1 = bit_and r1 FF
|
|
|
|
15 r1 = add_i32 r3 r1
|
|
|
|
16 r0 = shr_i32 r0 24
|
2020-01-08 21:25:07 +00:00
|
|
|
17 r2 = shr_i32 r2 24
|
|
|
|
18 r2 = add_i32 r0 r2
|
|
|
|
19 r2 = pack r1 r2 8
|
|
|
|
20 r2 = pack r4 r2 16
|
|
|
|
21 store32 arg(1) r2
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
|