skia2/resources/SkVMTest.expected

743 lines
18 KiB
Plaintext
Raw Normal View History

A8 over A8
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
12 values:
v0 = load8 arg(0)
v1 = to_f32 v0
v2 = mul_f32 v1 3B808081 (0.0039215689)
v3 = load8 arg(1)
v4 = to_f32 v3
v5 = mul_f32 v4 3B808081 (0.0039215689)
↑ v6 = splat 3F800000 (1)
v7 = sub_f32 v6 v2
v8 = mad_f32 v5 v7 v2
v9 = mul_f32 v8 437F0000 (255)
v10 = round v9
store8 arg(1) v10
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
4 registers, 12 instructions:
0 r0 = splat 3F800000 (1)
loop:
1 r1 = load8 arg(0)
2 r1 = to_f32 r1
3 r1 = mul_f32 r1 3B808081 (0.0039215689)
4 r2 = load8 arg(1)
5 r2 = to_f32 r2
6 r2 = mul_f32 r2 3B808081 (0.0039215689)
7 r3 = sub_f32 r0 r1
8 r1 = mad_f32 r2 r3 r1
9 r1 = mul_f32 r1 437F0000 (255)
10 r1 = round r1
11 store8 arg(1) r1
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
A8 over G8
17 values:
v0 = load8 arg(1)
v1 = to_f32 v0
v2 = mul_f32 v1 3B808081 (0.0039215689)
v3 = load8 arg(0)
v4 = to_f32 v3
v5 = mul_f32 v4 3B808081 (0.0039215689)
↑ v6 = splat 3F800000 (1)
v7 = sub_f32 v6 v5
v8 = mul_f32 v2 v7
↑ v9 = splat 3E59B3D0 (0.21259999)
↑ v10 = splat 3F371759 (0.71520001)
v11 = mul_f32 v8 3D93DD98 (0.0722)
v12 = mad_f32 v8 v10 v11
v13 = mad_f32 v8 v9 v12
v14 = mul_f32 v13 437F0000 (255)
v15 = round v14
store8 arg(1) v15
5 registers, 17 instructions:
0 r0 = splat 3F800000 (1)
1 r1 = splat 3E59B3D0 (0.21259999)
2 r2 = splat 3F371759 (0.71520001)
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
loop:
3 r3 = load8 arg(1)
4 r3 = to_f32 r3
5 r3 = mul_f32 r3 3B808081 (0.0039215689)
6 r4 = load8 arg(0)
7 r4 = to_f32 r4
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
9 r4 = sub_f32 r0 r4
10 r4 = mul_f32 r3 r4
11 r3 = mul_f32 r4 3D93DD98 (0.0722)
12 r3 = mad_f32 r4 r2 r3
13 r3 = mad_f32 r4 r1 r3
14 r3 = mul_f32 r3 437F0000 (255)
15 r3 = round r3
16 store8 arg(1) r3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
A8 over RGBA_8888
37 values:
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v0 = load32 arg(1)
v1 = bit_and v0 FF
v2 = to_f32 v1
v3 = mul_f32 v2 3B808081 (0.0039215689)
v4 = load8 arg(0)
v5 = to_f32 v4
v6 = mul_f32 v5 3B808081 (0.0039215689)
↑ v7 = splat 3F800000 (1)
v8 = sub_f32 v7 v6
v9 = mul_f32 v3 v8
v10 = mul_f32 v9 437F0000 (255)
v11 = round v10
v12 = shr_i32 v0 8
v13 = bit_and v12 FF
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v14 = to_f32 v13
v15 = mul_f32 v14 3B808081 (0.0039215689)
v16 = mul_f32 v15 v8
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v17 = mul_f32 v16 437F0000 (255)
v18 = round v17
v19 = pack v11 v18 8
v20 = shr_i32 v0 16
v21 = bit_and v20 FF
v22 = to_f32 v21
v23 = mul_f32 v22 3B808081 (0.0039215689)
v24 = mul_f32 v23 v8
v25 = mul_f32 v24 437F0000 (255)
v26 = round v25
v27 = shr_i32 v0 24
v28 = bit_and v27 FF
v29 = to_f32 v28
v30 = mul_f32 v29 3B808081 (0.0039215689)
v31 = mad_f32 v30 v8 v6
v32 = mul_f32 v31 437F0000 (255)
v33 = round v32
v34 = pack v26 v33 8
v35 = pack v19 v34 16
store32 arg(1) v35
6 registers, 37 instructions:
0 r0 = splat 3F800000 (1)
loop:
1 r1 = load32 arg(1)
2 r2 = bit_and r1 FF
3 r2 = to_f32 r2
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
5 r3 = load8 arg(0)
6 r3 = to_f32 r3
7 r3 = mul_f32 r3 3B808081 (0.0039215689)
8 r4 = sub_f32 r0 r3
9 r2 = mul_f32 r2 r4
10 r2 = mul_f32 r2 437F0000 (255)
11 r2 = round r2
12 r5 = shr_i32 r1 8
13 r5 = bit_and r5 FF
14 r5 = to_f32 r5
15 r5 = mul_f32 r5 3B808081 (0.0039215689)
16 r5 = mul_f32 r5 r4
17 r5 = mul_f32 r5 437F0000 (255)
18 r5 = round r5
19 r5 = pack r2 r5 8
20 r2 = shr_i32 r1 16
21 r2 = bit_and r2 FF
22 r2 = to_f32 r2
23 r2 = mul_f32 r2 3B808081 (0.0039215689)
24 r2 = mul_f32 r2 r4
25 r2 = mul_f32 r2 437F0000 (255)
26 r2 = round r2
27 r1 = shr_i32 r1 24
28 r1 = bit_and r1 FF
29 r1 = to_f32 r1
30 r1 = mul_f32 r1 3B808081 (0.0039215689)
31 r3 = mad_f32 r1 r4 r3
32 r3 = mul_f32 r3 437F0000 (255)
33 r3 = round r3
34 r3 = pack r2 r3 8
35 r3 = pack r5 r3 16
36 store32 arg(1) r3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over A8
9 values:
↑ v0 = splat 3F800000 (1)
↑ v1 = splat 0 (0)
v2 = load8 arg(1)
v3 = to_f32 v2
v4 = mul_f32 v3 3B808081 (0.0039215689)
v5 = mad_f32 v4 v1 v0
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v6 = mul_f32 v5 437F0000 (255)
v7 = round v6
store8 arg(1) v7
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
3 registers, 9 instructions:
0 r0 = splat 3F800000 (1)
1 r1 = splat 0 (0)
loop:
2 r2 = load8 arg(1)
3 r2 = to_f32 r2
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
5 r2 = mad_f32 r2 r1 r0
6 r2 = mul_f32 r2 437F0000 (255)
7 r2 = round r2
8 store8 arg(1) r2
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over G8
16 values:
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v0 = load8 arg(0)
v1 = to_f32 v0
v2 = mul_f32 v1 3B808081 (0.0039215689)
v3 = load8 arg(1)
v4 = to_f32 v3
v5 = mul_f32 v4 3B808081 (0.0039215689)
↑ v6 = splat 0 (0)
v7 = mad_f32 v5 v6 v2
↑ v8 = splat 3E59B3D0 (0.21259999)
↑ v9 = splat 3F371759 (0.71520001)
v10 = mul_f32 v7 3D93DD98 (0.0722)
v11 = mad_f32 v7 v9 v10
v12 = mad_f32 v7 v8 v11
v13 = mul_f32 v12 437F0000 (255)
v14 = round v13
store8 arg(1) v14
5 registers, 16 instructions:
0 r0 = splat 0 (0)
1 r1 = splat 3E59B3D0 (0.21259999)
2 r2 = splat 3F371759 (0.71520001)
loop:
3 r3 = load8 arg(0)
4 r3 = to_f32 r3
5 r3 = mul_f32 r3 3B808081 (0.0039215689)
6 r4 = load8 arg(1)
7 r4 = to_f32 r4
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
9 r3 = mad_f32 r4 r0 r3
10 r4 = mul_f32 r3 3D93DD98 (0.0722)
11 r4 = mad_f32 r3 r2 r4
12 r4 = mad_f32 r3 r1 r4
13 r4 = mul_f32 r4 437F0000 (255)
14 r4 = round r4
15 store8 arg(1) r4
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over RGBA_8888
37 values:
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v0 = load8 arg(0)
v1 = to_f32 v0
v2 = mul_f32 v1 3B808081 (0.0039215689)
v3 = load32 arg(1)
v4 = bit_and v3 FF
v5 = to_f32 v4
v6 = mul_f32 v5 3B808081 (0.0039215689)
↑ v7 = splat 0 (0)
v8 = mad_f32 v6 v7 v2
v9 = mul_f32 v8 437F0000 (255)
v10 = round v9
v11 = shr_i32 v3 8
v12 = bit_and v11 FF
v13 = to_f32 v12
v14 = mul_f32 v13 3B808081 (0.0039215689)
v15 = mad_f32 v14 v7 v2
v16 = mul_f32 v15 437F0000 (255)
v17 = round v16
v18 = pack v10 v17 8
v19 = shr_i32 v3 16
v20 = bit_and v19 FF
v21 = to_f32 v20
v22 = mul_f32 v21 3B808081 (0.0039215689)
v23 = mad_f32 v22 v7 v2
v24 = mul_f32 v23 437F0000 (255)
v25 = round v24
↑ v26 = splat 3F800000 (1)
v27 = shr_i32 v3 24
v28 = bit_and v27 FF
v29 = to_f32 v28
v30 = mul_f32 v29 3B808081 (0.0039215689)
v31 = mad_f32 v30 v7 v26
v32 = mul_f32 v31 437F0000 (255)
v33 = round v32
v34 = pack v25 v33 8
v35 = pack v18 v34 16
store32 arg(1) v35
6 registers, 37 instructions:
0 r0 = splat 0 (0)
1 r1 = splat 3F800000 (1)
loop:
2 r2 = load8 arg(0)
3 r2 = to_f32 r2
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
5 r3 = load32 arg(1)
6 r4 = bit_and r3 FF
7 r4 = to_f32 r4
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
9 r4 = mad_f32 r4 r0 r2
10 r4 = mul_f32 r4 437F0000 (255)
11 r4 = round r4
12 r5 = shr_i32 r3 8
13 r5 = bit_and r5 FF
14 r5 = to_f32 r5
15 r5 = mul_f32 r5 3B808081 (0.0039215689)
16 r5 = mad_f32 r5 r0 r2
17 r5 = mul_f32 r5 437F0000 (255)
18 r5 = round r5
19 r5 = pack r4 r5 8
20 r4 = shr_i32 r3 16
21 r4 = bit_and r4 FF
22 r4 = to_f32 r4
23 r4 = mul_f32 r4 3B808081 (0.0039215689)
24 r2 = mad_f32 r4 r0 r2
25 r2 = mul_f32 r2 437F0000 (255)
26 r2 = round r2
27 r3 = shr_i32 r3 24
28 r3 = bit_and r3 FF
29 r3 = to_f32 r3
30 r3 = mul_f32 r3 3B808081 (0.0039215689)
31 r3 = mad_f32 r3 r0 r1
32 r3 = mul_f32 r3 437F0000 (255)
33 r3 = round r3
34 r3 = pack r2 r3 8
35 r3 = pack r5 r3 16
36 store32 arg(1) r3
RGBA_8888 over A8
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
14 values:
v0 = load32 arg(0)
v1 = shr_i32 v0 24
v2 = bit_and v1 FF
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v3 = to_f32 v2
v4 = mul_f32 v3 3B808081 (0.0039215689)
v5 = load8 arg(1)
v6 = to_f32 v5
v7 = mul_f32 v6 3B808081 (0.0039215689)
↑ v8 = splat 3F800000 (1)
v9 = sub_f32 v8 v4
v10 = mad_f32 v7 v9 v4
v11 = mul_f32 v10 437F0000 (255)
v12 = round v11
store8 arg(1) v12
4 registers, 14 instructions:
0 r0 = splat 3F800000 (1)
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
loop:
1 r1 = load32 arg(0)
2 r1 = shr_i32 r1 24
3 r1 = bit_and r1 FF
4 r1 = to_f32 r1
5 r1 = mul_f32 r1 3B808081 (0.0039215689)
6 r2 = load8 arg(1)
7 r2 = to_f32 r2
8 r2 = mul_f32 r2 3B808081 (0.0039215689)
9 r3 = sub_f32 r0 r1
10 r1 = mad_f32 r2 r3 r1
11 r1 = mul_f32 r1 437F0000 (255)
12 r1 = round r1
13 store8 arg(1) r1
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
RGBA_8888 over G8
32 values:
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v0 = load32 arg(0)
v1 = bit_and v0 FF
v2 = to_f32 v1
v3 = mul_f32 v2 3B808081 (0.0039215689)
v4 = load8 arg(1)
v5 = to_f32 v4
v6 = mul_f32 v5 3B808081 (0.0039215689)
v7 = shr_i32 v0 24
v8 = bit_and v7 FF
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v9 = to_f32 v8
v10 = mul_f32 v9 3B808081 (0.0039215689)
↑ v11 = splat 3F800000 (1)
v12 = sub_f32 v11 v10
v13 = mad_f32 v6 v12 v3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
↑ v14 = splat 3E59B3D0 (0.21259999)
v15 = shr_i32 v0 8
v16 = bit_and v15 FF
v17 = to_f32 v16
v18 = mul_f32 v17 3B808081 (0.0039215689)
v19 = mad_f32 v6 v12 v18
↑ v20 = splat 3F371759 (0.71520001)
v21 = shr_i32 v0 16
v22 = bit_and v21 FF
v23 = to_f32 v22
v24 = mul_f32 v23 3B808081 (0.0039215689)
v25 = mad_f32 v6 v12 v24
v26 = mul_f32 v25 3D93DD98 (0.0722)
v27 = mad_f32 v19 v20 v26
v28 = mad_f32 v13 v14 v27
v29 = mul_f32 v28 437F0000 (255)
v30 = round v29
store8 arg(1) v30
8 registers, 32 instructions:
0 r0 = splat 3F800000 (1)
1 r1 = splat 3E59B3D0 (0.21259999)
2 r2 = splat 3F371759 (0.71520001)
loop:
3 r3 = load32 arg(0)
4 r4 = bit_and r3 FF
5 r4 = to_f32 r4
6 r4 = mul_f32 r4 3B808081 (0.0039215689)
7 r5 = load8 arg(1)
8 r5 = to_f32 r5
9 r5 = mul_f32 r5 3B808081 (0.0039215689)
10 r6 = shr_i32 r3 24
11 r6 = bit_and r6 FF
12 r6 = to_f32 r6
13 r6 = mul_f32 r6 3B808081 (0.0039215689)
14 r6 = sub_f32 r0 r6
15 r4 = mad_f32 r5 r6 r4
16 r7 = shr_i32 r3 8
17 r7 = bit_and r7 FF
18 r7 = to_f32 r7
19 r7 = mul_f32 r7 3B808081 (0.0039215689)
20 r7 = mad_f32 r5 r6 r7
21 r3 = shr_i32 r3 16
22 r3 = bit_and r3 FF
23 r3 = to_f32 r3
24 r3 = mul_f32 r3 3B808081 (0.0039215689)
25 r3 = mad_f32 r5 r6 r3
26 r3 = mul_f32 r3 3D93DD98 (0.0722)
27 r3 = mad_f32 r7 r2 r3
28 r3 = mad_f32 r4 r1 r3
29 r3 = mul_f32 r3 437F0000 (255)
30 r3 = round r3
31 store8 arg(1) r3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
RGBA_8888 over RGBA_8888
50 values:
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v0 = load32 arg(0)
v1 = bit_and v0 FF
v2 = to_f32 v1
v3 = mul_f32 v2 3B808081 (0.0039215689)
v4 = load32 arg(1)
v5 = bit_and v4 FF
v6 = to_f32 v5
v7 = mul_f32 v6 3B808081 (0.0039215689)
v8 = shr_i32 v0 24
v9 = bit_and v8 FF
v10 = to_f32 v9
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v11 = mul_f32 v10 3B808081 (0.0039215689)
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
↑ v12 = splat 3F800000 (1)
v13 = sub_f32 v12 v11
v14 = mad_f32 v7 v13 v3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v15 = mul_f32 v14 437F0000 (255)
v16 = round v15
v17 = shr_i32 v0 8
v18 = bit_and v17 FF
v19 = to_f32 v18
v20 = mul_f32 v19 3B808081 (0.0039215689)
v21 = shr_i32 v4 8
v22 = bit_and v21 FF
v23 = to_f32 v22
v24 = mul_f32 v23 3B808081 (0.0039215689)
v25 = mad_f32 v24 v13 v20
v26 = mul_f32 v25 437F0000 (255)
v27 = round v26
v28 = pack v16 v27 8
v29 = shr_i32 v0 16
v30 = bit_and v29 FF
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
v31 = to_f32 v30
v32 = mul_f32 v31 3B808081 (0.0039215689)
v33 = shr_i32 v4 16
v34 = bit_and v33 FF
v35 = to_f32 v34
v36 = mul_f32 v35 3B808081 (0.0039215689)
v37 = mad_f32 v36 v13 v32
v38 = mul_f32 v37 437F0000 (255)
v39 = round v38
v40 = shr_i32 v4 24
v41 = bit_and v40 FF
v42 = to_f32 v41
v43 = mul_f32 v42 3B808081 (0.0039215689)
v44 = mad_f32 v43 v13 v11
v45 = mul_f32 v44 437F0000 (255)
v46 = round v45
v47 = pack v39 v46 8
v48 = pack v28 v47 16
store32 arg(1) v48
8 registers, 50 instructions:
0 r0 = splat 3F800000 (1)
loop:
1 r1 = load32 arg(0)
2 r2 = bit_and r1 FF
3 r2 = to_f32 r2
4 r2 = mul_f32 r2 3B808081 (0.0039215689)
5 r3 = load32 arg(1)
6 r4 = bit_and r3 FF
7 r4 = to_f32 r4
8 r4 = mul_f32 r4 3B808081 (0.0039215689)
9 r5 = shr_i32 r1 24
10 r5 = bit_and r5 FF
11 r5 = to_f32 r5
12 r5 = mul_f32 r5 3B808081 (0.0039215689)
13 r6 = sub_f32 r0 r5
14 r2 = mad_f32 r4 r6 r2
15 r2 = mul_f32 r2 437F0000 (255)
16 r2 = round r2
17 r4 = shr_i32 r1 8
18 r4 = bit_and r4 FF
19 r4 = to_f32 r4
20 r4 = mul_f32 r4 3B808081 (0.0039215689)
21 r7 = shr_i32 r3 8
22 r7 = bit_and r7 FF
23 r7 = to_f32 r7
24 r7 = mul_f32 r7 3B808081 (0.0039215689)
25 r4 = mad_f32 r7 r6 r4
26 r4 = mul_f32 r4 437F0000 (255)
27 r4 = round r4
28 r4 = pack r2 r4 8
29 r1 = shr_i32 r1 16
30 r1 = bit_and r1 FF
31 r1 = to_f32 r1
32 r1 = mul_f32 r1 3B808081 (0.0039215689)
33 r2 = shr_i32 r3 16
34 r2 = bit_and r2 FF
35 r2 = to_f32 r2
36 r2 = mul_f32 r2 3B808081 (0.0039215689)
37 r1 = mad_f32 r2 r6 r1
38 r1 = mul_f32 r1 437F0000 (255)
39 r1 = round r1
40 r3 = shr_i32 r3 24
41 r3 = bit_and r3 FF
42 r3 = to_f32 r3
43 r3 = mul_f32 r3 3B808081 (0.0039215689)
44 r5 = mad_f32 r3 r6 r5
45 r5 = mul_f32 r5 437F0000 (255)
46 r5 = round r5
47 r5 = pack r1 r5 8
48 r5 = pack r4 r5 16
49 store32 arg(1) r5
I32 (Naive) 8888 over 8888
34 values:
v0 = load32 arg(0)
v1 = bit_and v0 FF
v2 = load32 arg(1)
v3 = bit_and v2 FF
v4 = shr_i32 v0 24
v5 = bit_and v4 FF
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
↑ v6 = splat 100 (3.5873241e-43)
v7 = sub_i32 v6 v5
v8 = mul_i32 v3 v7
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
v9 = shr_i32 v8 8
v10 = add_i32 v1 v9
v11 = shr_i32 v0 8
v12 = bit_and v11 FF
v13 = shr_i32 v2 8
v14 = bit_and v13 FF
v15 = mul_i32 v14 v7
v16 = shr_i32 v15 8
v17 = add_i32 v12 v16
v18 = pack v10 v17 8
v19 = shr_i32 v0 16
v20 = bit_and v19 FF
v21 = shr_i32 v2 16
v22 = bit_and v21 FF
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
v23 = mul_i32 v22 v7
v24 = shr_i32 v23 8
v25 = add_i32 v20 v24
v26 = shr_i32 v2 24
v27 = bit_and v26 FF
v28 = mul_i32 v27 v7
v29 = shr_i32 v28 8
v30 = add_i32 v5 v29
v31 = pack v25 v30 8
v32 = pack v18 v31 16
store32 arg(1) v32
8 registers, 34 instructions:
0 r0 = splat 100 (3.5873241e-43)
loop:
1 r1 = load32 arg(0)
2 r2 = bit_and r1 FF
3 r3 = load32 arg(1)
4 r4 = bit_and r3 FF
5 r5 = shr_i32 r1 24
6 r5 = bit_and r5 FF
7 r6 = sub_i32 r0 r5
8 r4 = mul_i32 r4 r6
9 r4 = shr_i32 r4 8
10 r4 = add_i32 r2 r4
11 r2 = shr_i32 r1 8
12 r2 = bit_and r2 FF
13 r7 = shr_i32 r3 8
14 r7 = bit_and r7 FF
15 r7 = mul_i32 r7 r6
16 r7 = shr_i32 r7 8
17 r7 = add_i32 r2 r7
18 r7 = pack r4 r7 8
19 r1 = shr_i32 r1 16
20 r1 = bit_and r1 FF
21 r4 = shr_i32 r3 16
22 r4 = bit_and r4 FF
23 r4 = mul_i32 r4 r6
24 r4 = shr_i32 r4 8
25 r4 = add_i32 r1 r4
26 r3 = shr_i32 r3 24
27 r3 = bit_and r3 FF
28 r6 = mul_i32 r3 r6
29 r6 = shr_i32 r6 8
30 r6 = add_i32 r5 r6
31 r6 = pack r4 r6 8
32 r6 = pack r7 r6 16
33 store32 arg(1) r6
I32 8888 over 8888
28 values:
v0 = load32 arg(0)
v1 = bit_and v0 FF
v2 = load32 arg(1)
v3 = bit_and v2 FF
v4 = shr_i32 v0 24
↑ v5 = splat 100 (3.5873241e-43)
v6 = sub_i32 v5 v4
v7 = mul_i16x2 v3 v6
v8 = shr_i32 v7 8
v9 = add_i32 v1 v8
v10 = bytes v0 2
v11 = bytes v2 2
v12 = mul_i16x2 v11 v6
v13 = shr_i32 v12 8
v14 = add_i32 v10 v13
v15 = pack v9 v14 8
v16 = bytes v0 3
v17 = bytes v2 3
v18 = mul_i16x2 v17 v6
v19 = shr_i32 v18 8
v20 = add_i32 v16 v19
v21 = shr_i32 v2 24
v22 = mul_i16x2 v21 v6
v23 = shr_i32 v22 8
v24 = add_i32 v4 v23
v25 = pack v20 v24 8
v26 = pack v15 v25 16
store32 arg(1) v26
8 registers, 28 instructions:
0 r0 = splat 100 (3.5873241e-43)
loop:
1 r1 = load32 arg(0)
2 r2 = bit_and r1 FF
3 r3 = load32 arg(1)
4 r4 = bit_and r3 FF
5 r5 = shr_i32 r1 24
6 r6 = sub_i32 r0 r5
7 r4 = mul_i16x2 r4 r6
8 r4 = shr_i32 r4 8
9 r4 = add_i32 r2 r4
10 r2 = bytes r1 2
11 r7 = bytes r3 2
12 r7 = mul_i16x2 r7 r6
13 r7 = shr_i32 r7 8
14 r7 = add_i32 r2 r7
15 r7 = pack r4 r7 8
16 r1 = bytes r1 3
17 r4 = bytes r3 3
18 r4 = mul_i16x2 r4 r6
19 r4 = shr_i32 r4 8
20 r4 = add_i32 r1 r4
21 r3 = shr_i32 r3 24
22 r6 = mul_i16x2 r3 r6
23 r6 = shr_i32 r6 8
24 r6 = add_i32 r5 r6
25 r6 = pack r4 r6 8
26 r6 = pack r7 r6 16
27 store32 arg(1) r6
I32 (SWAR) 8888 over 8888
14 values:
v0 = load32 arg(0)
v1 = bytes v0 404
↑ v2 = splat 1000100 (2.3510604e-38)
v3 = sub_i16x2 v2 v1
v4 = load32 arg(1)
v5 = bit_and v4 FF00FF
v6 = mul_i16x2 v5 v3
v7 = shr_i16x2 v6 8
v8 = shr_i16x2 v4 8
v9 = mul_i16x2 v8 v3
v10 = bit_and v9 FF00FF00
v11 = bit_or v7 v10
v12 = add_i32 v0 v11
store32 arg(1) v12
5 registers, 14 instructions:
0 r0 = splat 1000100 (2.3510604e-38)
loop:
1 r1 = load32 arg(0)
2 r2 = bytes r1 404
3 r2 = sub_i16x2 r0 r2
4 r3 = load32 arg(1)
5 r4 = bit_and r3 FF00FF
6 r4 = mul_i16x2 r4 r2
7 r4 = shr_i16x2 r4 8
8 r3 = shr_i16x2 r3 8
9 r2 = mul_i16x2 r3 r2
10 r2 = bit_and r2 FF00FF00
11 r2 = bit_or r4 r2
12 r2 = add_i32 r1 r2
13 store32 arg(1) r2
6 values:
add used_in_loop bit to skvm::Builder::Instruction Most hoisted values are used in the loop body (and that's really the whole point of hoisting) but some are just temporaries to help produce other hoisted values. This used_in_loop bit helps us distinguish the two, and lets us recycle registers holding temporary hoisted values not used in the loop. The can-we-recycle logic now becomes: - is this a real value? - is it time for it to die? - is it either not hoisted or a hoisted temporary? The set-death-to-infinity approach for hoisted values is now gone. That worked great for hoisted values used inside the loop, but was too conservative for hoisted temporaries. This lifetime extension was preventing us from recycling those registers, pinning enough registers that we run out and fail to JIT. Small amounts of refactoring to make this clearer: - move the Instruction hash function definition near its operator== - rename the two "hoist" variables to "can_hoist" for Instructions and "try_hoisting" for the JIT approach - add ↟ to mark hoisted temporaries, _really_ hoisted values. There's some redundancy here between tracking the can_hoist bit, the used_in_loop bit, and lifetime tracking. I think it should be true, for instance, that !can_hoist && !used_in_loop implies an instruction is dead code. I plan to continue refactoring lifetime analysis (in particular reordering instructions to decrease register pressure) so hopefully by the time I'm done that metadata will shake out a little crisper. Change-Id: I6460ca96d1cbec0315bed3c9a0774cd88ab5be26 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/248986 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-16 15:46:01 +00:00
↟ v0 = splat 1 (1.4012985e-45)
↟ v1 = splat 2 (2.8025969e-45)
↑ v2 = add_i32 v0 v1
v3 = load32 arg(0)
v4 = mul_i32 v3 v2
store32 arg(0) v4
2 registers, 6 instructions:
0 r0 = splat 1 (1.4012985e-45)
1 r1 = splat 2 (2.8025969e-45)
2 r1 = add_i32 r0 r1
loop:
3 r0 = load32 arg(0)
4 r0 = mul_i32 r0 r1
5 store32 arg(0) r0
24 values:
v0 = load32 arg(0)
v1 = bit_and v0 FF
v2 = load32 arg(1)
v3 = bit_and v2 FF
v4 = add_i32 v1 v3
v5 = shr_i32 v0 8
v6 = bit_and v5 FF
v7 = shr_i32 v2 8
v8 = bit_and v7 FF
v9 = add_i32 v6 v8
v10 = pack v4 v9 8
v11 = shr_i32 v0 16
v12 = bit_and v11 FF
v13 = shr_i32 v2 16
v14 = bit_and v13 FF
v15 = add_i32 v12 v14
v16 = shr_i32 v0 24
v17 = bit_and v16 FF
v18 = shr_i32 v2 24
v19 = bit_and v18 FF
v20 = add_i32 v17 v19
v21 = pack v15 v20 8
v22 = pack v10 v21 16
store32 arg(1) v22
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
5 registers, 24 instructions:
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
loop:
0 r0 = load32 arg(0)
1 r1 = bit_and r0 FF
2 r2 = load32 arg(1)
3 r3 = bit_and r2 FF
4 r3 = add_i32 r1 r3
5 r1 = shr_i32 r0 8
6 r1 = bit_and r1 FF
7 r4 = shr_i32 r2 8
8 r4 = bit_and r4 FF
9 r4 = add_i32 r1 r4
10 r4 = pack r3 r4 8
11 r3 = shr_i32 r0 16
12 r3 = bit_and r3 FF
13 r1 = shr_i32 r2 16
14 r1 = bit_and r1 FF
15 r1 = add_i32 r3 r1
16 r0 = shr_i32 r0 24
17 r0 = bit_and r0 FF
18 r2 = shr_i32 r2 24
19 r2 = bit_and r2 FF
20 r2 = add_i32 r0 r2
21 r2 = pack r1 r2 8
22 r2 = pack r4 r2 16
23 store32 arg(1) r2
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00