skia2/resources/SkVMTest.expected

781 lines
18 KiB
Plaintext
Raw Normal View History

A8 over A8
14 values (originally 16):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load8 arg(0)
v2 = to_f32 v1
v3 = mul_f32 v0 v2
v4 = load8 arg(1)
v5 = to_f32 v4
v6 = mul_f32 v0 v5
↑ v7 = splat 3F800000 (1)
v8 = sub_f32 v7 v3
v9 = mad_f32 v6 v8 v3
↑ v10 = splat 437F0000 (255)
v11 = mul_f32 v9 v10
v12 = round v11
store8 arg(1) v12
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
6 registers, 14 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat 3F800000 (1)
2 r2 = splat 437F0000 (255)
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
loop:
3 r3 = load8 arg(0)
4 r3 = to_f32 r3
5 r3 = mul_f32 r0 r3
6 r4 = load8 arg(1)
7 r4 = to_f32 r4
8 r4 = mul_f32 r0 r4
9 r5 = sub_f32 r1 r3
10 r3 = mad_f32 r4 r5 r3
11 r3 = mul_f32 r3 r2
12 r3 = round r3
13 store8 arg(1) r3
A8 over G8
20 values (originally 22):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load8 arg(1)
v2 = to_f32 v1
v3 = mul_f32 v0 v2
v4 = load8 arg(0)
v5 = to_f32 v4
v6 = mul_f32 v0 v5
↑ v7 = splat 3F800000 (1)
v8 = sub_f32 v7 v6
v9 = mul_f32 v3 v8
↑ v10 = splat 3E59B3D0 (0.21259999)
↑ v11 = splat 3F371759 (0.71520001)
↑ v12 = splat 3D93DD98 (0.0722)
v13 = mul_f32 v9 v12
v14 = mad_f32 v9 v11 v13
v15 = mad_f32 v9 v10 v14
↑ v16 = splat 437F0000 (255)
v17 = mul_f32 v15 v16
v18 = round v17
store8 arg(1) v18
8 registers, 20 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat 3F800000 (1)
2 r2 = splat 3E59B3D0 (0.21259999)
3 r3 = splat 3F371759 (0.71520001)
4 r4 = splat 3D93DD98 (0.0722)
5 r5 = splat 437F0000 (255)
loop:
6 r6 = load8 arg(1)
7 r6 = to_f32 r6
8 r6 = mul_f32 r0 r6
9 r7 = load8 arg(0)
10 r7 = to_f32 r7
11 r7 = mul_f32 r0 r7
12 r7 = sub_f32 r1 r7
13 r7 = mul_f32 r6 r7
14 r6 = mul_f32 r7 r4
15 r6 = mad_f32 r7 r3 r6
16 r6 = mad_f32 r7 r2 r6
17 r6 = mul_f32 r6 r5
18 r6 = round r6
19 store8 arg(1) r6
A8 over RGBA_8888
39 values (originally 40):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load32 arg(1)
↑ v2 = splat FF (3.5733111e-43)
v3 = bit_and v2 v1
v4 = to_f32 v3
v5 = mul_f32 v0 v4
v6 = load8 arg(0)
v7 = to_f32 v6
v8 = mul_f32 v0 v7
↑ v9 = splat 3F800000 (1)
v10 = sub_f32 v9 v8
v11 = mul_f32 v5 v10
↑ v12 = splat 437F0000 (255)
v13 = mul_f32 v11 v12
v14 = round v13
v15 = shr_i32 v1 8
v16 = bit_and v2 v15
v17 = to_f32 v16
v18 = mul_f32 v0 v17
v19 = mul_f32 v18 v10
v20 = mul_f32 v19 v12
v21 = round v20
v22 = pack v14 v21 8
v23 = shr_i32 v1 16
v24 = bit_and v2 v23
v25 = to_f32 v24
v26 = mul_f32 v0 v25
v27 = mul_f32 v26 v10
v28 = mul_f32 v27 v12
v29 = round v28
v30 = shr_i32 v1 24
v31 = to_f32 v30
v32 = mul_f32 v0 v31
v33 = mad_f32 v32 v10 v8
v34 = mul_f32 v33 v12
v35 = round v34
v36 = pack v29 v35 8
v37 = pack v22 v36 16
store32 arg(1) v37
9 registers, 39 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat FF (3.5733111e-43)
2 r2 = splat 3F800000 (1)
3 r3 = splat 437F0000 (255)
loop:
4 r4 = load32 arg(1)
5 r5 = bit_and r1 r4
6 r5 = to_f32 r5
7 r5 = mul_f32 r0 r5
8 r6 = load8 arg(0)
9 r6 = to_f32 r6
10 r6 = mul_f32 r0 r6
11 r7 = sub_f32 r2 r6
12 r5 = mul_f32 r5 r7
13 r5 = mul_f32 r5 r3
14 r5 = round r5
15 r8 = shr_i32 r4 8
16 r8 = bit_and r1 r8
17 r8 = to_f32 r8
18 r8 = mul_f32 r0 r8
19 r8 = mul_f32 r8 r7
20 r8 = mul_f32 r8 r3
21 r8 = round r8
22 r8 = pack r5 r8 8
23 r5 = shr_i32 r4 16
24 r5 = bit_and r1 r5
25 r5 = to_f32 r5
26 r5 = mul_f32 r0 r5
27 r5 = mul_f32 r5 r7
28 r5 = mul_f32 r5 r3
29 r5 = round r5
30 r4 = shr_i32 r4 24
31 r4 = to_f32 r4
32 r4 = mul_f32 r0 r4
33 r6 = mad_f32 r4 r7 r6
34 r6 = mul_f32 r6 r3
35 r6 = round r6
36 r6 = pack r5 r6 8
37 r6 = pack r8 r6 16
38 store32 arg(1) r6
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over A8
11 values (originally 15):
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
↑ v0 = splat 3F800000 (1)
↑ v1 = splat 0 (0)
↑ v2 = splat 3B808081 (0.0039215689)
v3 = load8 arg(1)
v4 = to_f32 v3
v5 = mul_f32 v2 v4
v6 = mad_f32 v5 v1 v0
↑ v7 = splat 437F0000 (255)
v8 = mul_f32 v6 v7
v9 = round v8
store8 arg(1) v9
5 registers, 11 instructions:
0 r0 = splat 3F800000 (1)
1 r1 = splat 0 (0)
2 r2 = splat 3B808081 (0.0039215689)
3 r3 = splat 437F0000 (255)
loop:
4 r4 = load8 arg(1)
5 r4 = to_f32 r4
6 r4 = mul_f32 r2 r4
7 r4 = mad_f32 r4 r1 r0
8 r4 = mul_f32 r4 r3
9 r4 = round r4
10 store8 arg(1) r4
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over G8
19 values (originally 20):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load8 arg(0)
v2 = to_f32 v1
v3 = mul_f32 v0 v2
v4 = load8 arg(1)
v5 = to_f32 v4
v6 = mul_f32 v0 v5
↑ v7 = splat 0 (0)
v8 = mad_f32 v6 v7 v3
↑ v9 = splat 3E59B3D0 (0.21259999)
↑ v10 = splat 3F371759 (0.71520001)
↑ v11 = splat 3D93DD98 (0.0722)
v12 = mul_f32 v8 v11
v13 = mad_f32 v8 v10 v12
v14 = mad_f32 v8 v9 v13
↑ v15 = splat 437F0000 (255)
v16 = mul_f32 v14 v15
v17 = round v16
store8 arg(1) v17
8 registers, 19 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat 0 (0)
2 r2 = splat 3E59B3D0 (0.21259999)
3 r3 = splat 3F371759 (0.71520001)
4 r4 = splat 3D93DD98 (0.0722)
5 r5 = splat 437F0000 (255)
loop:
6 r6 = load8 arg(0)
7 r6 = to_f32 r6
8 r6 = mul_f32 r0 r6
9 r7 = load8 arg(1)
10 r7 = to_f32 r7
11 r7 = mul_f32 r0 r7
12 r6 = mad_f32 r7 r1 r6
13 r7 = mul_f32 r6 r4
14 r7 = mad_f32 r6 r3 r7
15 r7 = mad_f32 r6 r2 r7
16 r7 = mul_f32 r7 r5
17 r7 = round r7
18 store8 arg(1) r7
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
G8 over RGBA_8888
39 values (originally 39):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load8 arg(0)
v2 = to_f32 v1
v3 = mul_f32 v0 v2
v4 = load32 arg(1)
↑ v5 = splat FF (3.5733111e-43)
v6 = bit_and v5 v4
v7 = to_f32 v6
v8 = mul_f32 v0 v7
↑ v9 = splat 0 (0)
v10 = mad_f32 v8 v9 v3
↑ v11 = splat 437F0000 (255)
v12 = mul_f32 v10 v11
v13 = round v12
v14 = shr_i32 v4 8
v15 = bit_and v5 v14
v16 = to_f32 v15
v17 = mul_f32 v0 v16
v18 = mad_f32 v17 v9 v3
v19 = mul_f32 v18 v11
v20 = round v19
v21 = pack v13 v20 8
v22 = shr_i32 v4 16
v23 = bit_and v5 v22
v24 = to_f32 v23
v25 = mul_f32 v0 v24
v26 = mad_f32 v25 v9 v3
v27 = mul_f32 v26 v11
v28 = round v27
↑ v29 = splat 3F800000 (1)
v30 = shr_i32 v4 24
v31 = to_f32 v30
v32 = mul_f32 v0 v31
v33 = mad_f32 v32 v9 v29
v34 = mul_f32 v33 v11
v35 = round v34
v36 = pack v28 v35 8
v37 = pack v21 v36 16
store32 arg(1) v37
9 registers, 39 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat FF (3.5733111e-43)
2 r2 = splat 0 (0)
3 r3 = splat 437F0000 (255)
4 r4 = splat 3F800000 (1)
loop:
5 r5 = load8 arg(0)
6 r5 = to_f32 r5
7 r5 = mul_f32 r0 r5
8 r6 = load32 arg(1)
9 r7 = bit_and r1 r6
10 r7 = to_f32 r7
11 r7 = mul_f32 r0 r7
12 r7 = mad_f32 r7 r2 r5
13 r7 = mul_f32 r7 r3
14 r7 = round r7
15 r8 = shr_i32 r6 8
16 r8 = bit_and r1 r8
17 r8 = to_f32 r8
18 r8 = mul_f32 r0 r8
19 r8 = mad_f32 r8 r2 r5
20 r8 = mul_f32 r8 r3
21 r8 = round r8
22 r8 = pack r7 r8 8
23 r7 = shr_i32 r6 16
24 r7 = bit_and r1 r7
25 r7 = to_f32 r7
26 r7 = mul_f32 r0 r7
27 r5 = mad_f32 r7 r2 r5
28 r5 = mul_f32 r5 r3
29 r5 = round r5
30 r6 = shr_i32 r6 24
31 r6 = to_f32 r6
32 r6 = mul_f32 r0 r6
33 r6 = mad_f32 r6 r2 r4
34 r6 = mul_f32 r6 r3
35 r6 = round r6
36 r6 = pack r5 r6 8
37 r6 = pack r8 r6 16
38 store32 arg(1) r6
RGBA_8888 over A8
15 values (originally 31):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load32 arg(0)
v2 = shr_i32 v1 24
v3 = to_f32 v2
v4 = mul_f32 v0 v3
v5 = load8 arg(1)
v6 = to_f32 v5
v7 = mul_f32 v0 v6
↑ v8 = splat 3F800000 (1)
v9 = sub_f32 v8 v4
v10 = mad_f32 v7 v9 v4
↑ v11 = splat 437F0000 (255)
v12 = mul_f32 v10 v11
v13 = round v12
store8 arg(1) v13
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
6 registers, 15 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat 3F800000 (1)
2 r2 = splat 437F0000 (255)
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
loop:
3 r3 = load32 arg(0)
4 r3 = shr_i32 r3 24
5 r3 = to_f32 r3
6 r3 = mul_f32 r0 r3
7 r4 = load8 arg(1)
8 r4 = to_f32 r4
9 r4 = mul_f32 r0 r4
10 r5 = sub_f32 r1 r3
11 r3 = mad_f32 r4 r5 r3
12 r3 = mul_f32 r3 r2
13 r3 = round r3
14 store8 arg(1) r3
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
RGBA_8888 over G8
35 values (originally 36):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load32 arg(0)
↑ v2 = splat FF (3.5733111e-43)
v3 = bit_and v2 v1
v4 = to_f32 v3
v5 = mul_f32 v0 v4
v6 = load8 arg(1)
v7 = to_f32 v6
v8 = mul_f32 v0 v7
v9 = shr_i32 v1 24
v10 = to_f32 v9
v11 = mul_f32 v0 v10
↑ v12 = splat 3F800000 (1)
v13 = sub_f32 v12 v11
v14 = mad_f32 v8 v13 v5
↑ v15 = splat 3E59B3D0 (0.21259999)
v16 = shr_i32 v1 8
v17 = bit_and v2 v16
v18 = to_f32 v17
v19 = mul_f32 v0 v18
v20 = mad_f32 v8 v13 v19
↑ v21 = splat 3F371759 (0.71520001)
v22 = shr_i32 v1 16
v23 = bit_and v2 v22
v24 = to_f32 v23
v25 = mul_f32 v0 v24
v26 = mad_f32 v8 v13 v25
↑ v27 = splat 3D93DD98 (0.0722)
v28 = mul_f32 v26 v27
v29 = mad_f32 v20 v21 v28
v30 = mad_f32 v14 v15 v29
↑ v31 = splat 437F0000 (255)
v32 = mul_f32 v30 v31
v33 = round v32
store8 arg(1) v33
12 registers, 35 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat FF (3.5733111e-43)
2 r2 = splat 3F800000 (1)
3 r3 = splat 3E59B3D0 (0.21259999)
4 r4 = splat 3F371759 (0.71520001)
5 r5 = splat 3D93DD98 (0.0722)
6 r6 = splat 437F0000 (255)
loop:
7 r7 = load32 arg(0)
8 r8 = bit_and r1 r7
9 r8 = to_f32 r8
10 r8 = mul_f32 r0 r8
11 r9 = load8 arg(1)
12 r9 = to_f32 r9
13 r9 = mul_f32 r0 r9
14 r10 = shr_i32 r7 24
15 r10 = to_f32 r10
16 r10 = mul_f32 r0 r10
17 r10 = sub_f32 r2 r10
18 r8 = mad_f32 r9 r10 r8
19 r11 = shr_i32 r7 8
20 r11 = bit_and r1 r11
21 r11 = to_f32 r11
22 r11 = mul_f32 r0 r11
23 r11 = mad_f32 r9 r10 r11
24 r7 = shr_i32 r7 16
25 r7 = bit_and r1 r7
26 r7 = to_f32 r7
27 r7 = mul_f32 r0 r7
28 r7 = mad_f32 r9 r10 r7
29 r7 = mul_f32 r7 r5
30 r7 = mad_f32 r11 r4 r7
31 r7 = mad_f32 r8 r3 r7
32 r7 = mul_f32 r7 r6
33 r7 = round r7
34 store8 arg(1) r7
sketch out structure for ops with immediates Lots of x86 instructions can take their right hand side argument from memory directly rather than a register. We can use this to avoid the need to allocate a register for many constants. The strategy in this CL is one of several I've been stewing over, the simplest of those strategies I think. There are some trade offs particularly on ARM; this naive ARM implementation means we'll load&op every time, even though the load part of the operation can logically be hoisted. From here on I'm going to just briefly enumerate a few other approaches that allow the optimization on x86 and still allow the immediate splats to hoist on ARM. 1) don't do it on ARM A very simple approach is to simply not perform this optimization on ARM. ARM has more vector registers than x86, and so register pressure is lower there. We're going to end up with splatted constants in registers anyway, so maybe just let that happen the normal way instead of some roundabout complicated hack like I'll talk about in 2). The only downside in my mind is that this approach would make high-level program descriptions platform dependent, which isn't so bad, but it's been nice to be able to compare and diff debug dumps. 2) split Op::splat up The next less-simple approach to this problem could fix this by splitting splats into two Ops internally, one inner Op::immediate that guantees at least the constant is in memory and is compatible with immediate-aware Ops like mul_f32_imm, and an outer Op::constant that depends on that Op::immediate and further guarantees that constant has been broadcast into a register to be compatible with non-immediate-aware ops like div_f32. When building a program, immediate-aware ops would peek for Op::constants as they do today for Op::splats, but instead of embedding the immediate themselves, they'd replace their dependency with the inner Op::immediate. On x86 these new Ops would work just as advertised, with Op::immediate a runtime no-op, Op::constant the usual vbroadcastss. On ARM Op::immediate needs to go all the way and splat out a register to make the constant compatible with immediate-aware ops, and the Op::constant becomes a noop now instead. All this comes together to let the Op::immediate splat hoist up out of the loop while still feeding Op::mul_f32_imm and co. It's a rather complicated approach to solving this issue, but I might want to explore it just to see how bad it is. 3) do it inside the x86 JIT The conceptually best approach is to find a way to do this peepholing only inside the JIT only on x86, avoiding the need for new Op::mul_f32_imm and co. ARM and the interpreter don't benefit from this peephole, so the x86 JIT is the logical owner of this optimization. Finding a clean way to do this without too much disruption is the least baked idea I've got here, though I think the most desirable long-term. Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Debug-All-SK_USE_SKVM_BLITTER,Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-All-SK_USE_SKVM_BLITTER Change-Id: Ie9c6336ed08b6fbeb89acf920a48a319f74f3643 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/254217 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-11-12 15:07:23 +00:00
RGBA_8888 over RGBA_8888
51 values (originally 51):
↑ v0 = splat 3B808081 (0.0039215689)
v1 = load32 arg(0)
↑ v2 = splat FF (3.5733111e-43)
v3 = bit_and v2 v1
v4 = to_f32 v3
v5 = mul_f32 v0 v4
v6 = load32 arg(1)
v7 = bit_and v2 v6
v8 = to_f32 v7
v9 = mul_f32 v0 v8
v10 = shr_i32 v1 24
v11 = to_f32 v10
v12 = mul_f32 v0 v11
↑ v13 = splat 3F800000 (1)
v14 = sub_f32 v13 v12
v15 = mad_f32 v9 v14 v5
↑ v16 = splat 437F0000 (255)
v17 = mul_f32 v15 v16
v18 = round v17
v19 = shr_i32 v1 8
v20 = bit_and v2 v19
v21 = to_f32 v20
v22 = mul_f32 v0 v21
v23 = shr_i32 v6 8
v24 = bit_and v2 v23
v25 = to_f32 v24
v26 = mul_f32 v0 v25
v27 = mad_f32 v26 v14 v22
v28 = mul_f32 v27 v16
v29 = round v28
v30 = pack v18 v29 8
v31 = shr_i32 v1 16
v32 = bit_and v2 v31
v33 = to_f32 v32
v34 = mul_f32 v0 v33
v35 = shr_i32 v6 16
v36 = bit_and v2 v35
v37 = to_f32 v36
v38 = mul_f32 v0 v37
v39 = mad_f32 v38 v14 v34
v40 = mul_f32 v39 v16
v41 = round v40
v42 = shr_i32 v6 24
v43 = to_f32 v42
v44 = mul_f32 v0 v43
v45 = mad_f32 v44 v14 v12
v46 = mul_f32 v45 v16
v47 = round v46
v48 = pack v41 v47 8
v49 = pack v30 v48 16
store32 arg(1) v49
11 registers, 51 instructions:
0 r0 = splat 3B808081 (0.0039215689)
1 r1 = splat FF (3.5733111e-43)
2 r2 = splat 3F800000 (1)
3 r3 = splat 437F0000 (255)
loop:
4 r4 = load32 arg(0)
5 r5 = bit_and r1 r4
6 r5 = to_f32 r5
7 r5 = mul_f32 r0 r5
8 r6 = load32 arg(1)
9 r7 = bit_and r1 r6
10 r7 = to_f32 r7
11 r7 = mul_f32 r0 r7
12 r8 = shr_i32 r4 24
13 r8 = to_f32 r8
14 r8 = mul_f32 r0 r8
15 r9 = sub_f32 r2 r8
16 r5 = mad_f32 r7 r9 r5
17 r5 = mul_f32 r5 r3
18 r5 = round r5
19 r7 = shr_i32 r4 8
20 r7 = bit_and r1 r7
21 r7 = to_f32 r7
22 r7 = mul_f32 r0 r7
23 r10 = shr_i32 r6 8
24 r10 = bit_and r1 r10
25 r10 = to_f32 r10
26 r10 = mul_f32 r0 r10
27 r7 = mad_f32 r10 r9 r7
28 r7 = mul_f32 r7 r3
29 r7 = round r7
30 r7 = pack r5 r7 8
31 r4 = shr_i32 r4 16
32 r4 = bit_and r1 r4
33 r4 = to_f32 r4
34 r4 = mul_f32 r0 r4
35 r5 = shr_i32 r6 16
36 r5 = bit_and r1 r5
37 r5 = to_f32 r5
38 r5 = mul_f32 r0 r5
39 r4 = mad_f32 r5 r9 r4
40 r4 = mul_f32 r4 r3
41 r4 = round r4
42 r6 = shr_i32 r6 24
43 r6 = to_f32 r6
44 r6 = mul_f32 r0 r6
45 r8 = mad_f32 r6 r9 r8
46 r8 = mul_f32 r8 r3
47 r8 = round r8
48 r8 = pack r4 r8 8
49 r8 = pack r7 r8 16
50 store32 arg(1) r8
I32 (Naive) 8888 over 8888
33 values (originally 33):
v0 = load32 arg(0)
↑ v1 = splat FF (3.5733111e-43)
v2 = bit_and v1 v0
v3 = load32 arg(1)
v4 = bit_and v1 v3
v5 = shr_i32 v0 24
↑ v6 = splat 100 (3.5873241e-43)
v7 = sub_i32 v6 v5
v8 = mul_i32 v4 v7
v9 = shr_i32 v8 8
v10 = add_i32 v2 v9
v11 = shr_i32 v0 8
v12 = bit_and v1 v11
v13 = shr_i32 v3 8
v14 = bit_and v1 v13
v15 = mul_i32 v14 v7
v16 = shr_i32 v15 8
v17 = add_i32 v12 v16
v18 = pack v10 v17 8
v19 = shr_i32 v0 16
v20 = bit_and v1 v19
v21 = shr_i32 v3 16
v22 = bit_and v1 v21
v23 = mul_i32 v22 v7
v24 = shr_i32 v23 8
v25 = add_i32 v20 v24
v26 = shr_i32 v3 24
v27 = mul_i32 v26 v7
v28 = shr_i32 v27 8
v29 = add_i32 v5 v28
v30 = pack v25 v29 8
v31 = pack v18 v30 16
store32 arg(1) v31
9 registers, 33 instructions:
0 r0 = splat FF (3.5733111e-43)
1 r1 = splat 100 (3.5873241e-43)
loop:
2 r2 = load32 arg(0)
3 r3 = bit_and r0 r2
4 r4 = load32 arg(1)
5 r5 = bit_and r0 r4
6 r6 = shr_i32 r2 24
7 r7 = sub_i32 r1 r6
8 r5 = mul_i32 r5 r7
9 r5 = shr_i32 r5 8
10 r5 = add_i32 r3 r5
11 r3 = shr_i32 r2 8
12 r3 = bit_and r0 r3
13 r8 = shr_i32 r4 8
14 r8 = bit_and r0 r8
15 r8 = mul_i32 r8 r7
16 r8 = shr_i32 r8 8
17 r8 = add_i32 r3 r8
18 r8 = pack r5 r8 8
19 r2 = shr_i32 r2 16
20 r2 = bit_and r0 r2
21 r5 = shr_i32 r4 16
22 r5 = bit_and r0 r5
23 r5 = mul_i32 r5 r7
24 r5 = shr_i32 r5 8
25 r5 = add_i32 r2 r5
26 r4 = shr_i32 r4 24
27 r7 = mul_i32 r4 r7
28 r7 = shr_i32 r7 8
29 r7 = add_i32 r6 r7
30 r7 = pack r5 r7 8
31 r7 = pack r8 r7 16
32 store32 arg(1) r7
I32 8888 over 8888
29 values (originally 29):
v0 = load32 arg(0)
↑ v1 = splat FF (3.5733111e-43)
v2 = bit_and v0 v1
v3 = load32 arg(1)
v4 = bit_and v3 v1
v5 = shr_i32 v0 24
↑ v6 = splat 100 (3.5873241e-43)
v7 = sub_i32 v6 v5
v8 = mul_i16x2 v4 v7
v9 = shr_i32 v8 8
v10 = add_i32 v2 v9
v11 = bytes v0 2
v12 = bytes v3 2
v13 = mul_i16x2 v12 v7
v14 = shr_i32 v13 8
v15 = add_i32 v11 v14
v16 = pack v10 v15 8
v17 = bytes v0 3
v18 = bytes v3 3
v19 = mul_i16x2 v18 v7
v20 = shr_i32 v19 8
v21 = add_i32 v17 v20
v22 = shr_i32 v3 24
v23 = mul_i16x2 v22 v7
v24 = shr_i32 v23 8
v25 = add_i32 v5 v24
v26 = pack v21 v25 8
v27 = pack v16 v26 16
store32 arg(1) v27
9 registers, 29 instructions:
0 r0 = splat FF (3.5733111e-43)
1 r1 = splat 100 (3.5873241e-43)
loop:
2 r2 = load32 arg(0)
3 r3 = bit_and r2 r0
4 r4 = load32 arg(1)
5 r5 = bit_and r4 r0
6 r6 = shr_i32 r2 24
7 r7 = sub_i32 r1 r6
8 r5 = mul_i16x2 r5 r7
9 r5 = shr_i32 r5 8
10 r5 = add_i32 r3 r5
11 r3 = bytes r2 2
12 r8 = bytes r4 2
13 r8 = mul_i16x2 r8 r7
14 r8 = shr_i32 r8 8
15 r8 = add_i32 r3 r8
16 r8 = pack r5 r8 8
17 r2 = bytes r2 3
18 r5 = bytes r4 3
19 r5 = mul_i16x2 r5 r7
20 r5 = shr_i32 r5 8
21 r5 = add_i32 r2 r5
22 r4 = shr_i32 r4 24
23 r7 = mul_i16x2 r4 r7
24 r7 = shr_i32 r7 8
25 r7 = add_i32 r6 r7
26 r7 = pack r5 r7 8
27 r7 = pack r8 r7 16
28 store32 arg(1) r7
I32 (SWAR) 8888 over 8888
15 values (originally 15):
v0 = load32 arg(0)
v1 = bytes v0 404
↑ v2 = splat 1000100 (2.3510604e-38)
v3 = sub_i16x2 v2 v1
v4 = load32 arg(1)
↑ v5 = splat FF00FF (2.3418409e-38)
v6 = bit_and v4 v5
v7 = mul_i16x2 v6 v3
v8 = shr_i16x2 v7 8
v9 = shr_i16x2 v4 8
v10 = mul_i16x2 v9 v3
v11 = bit_clear v10 v5
v12 = bit_or v8 v11
v13 = add_i32 v0 v12
store32 arg(1) v13
6 registers, 15 instructions:
0 r0 = splat 1000100 (2.3510604e-38)
1 r1 = splat FF00FF (2.3418409e-38)
loop:
2 r2 = load32 arg(0)
3 r3 = bytes r2 404
4 r3 = sub_i16x2 r0 r3
5 r4 = load32 arg(1)
6 r5 = bit_and r4 r1
7 r5 = mul_i16x2 r5 r3
8 r5 = shr_i16x2 r5 8
9 r4 = shr_i16x2 r4 8
10 r3 = mul_i16x2 r4 r3
11 r3 = bit_clear r3 r1
12 r3 = bit_or r5 r3
13 r3 = add_i32 r2 r3
14 store32 arg(1) r3
refactor out a middle representation Kind of brewing a big refactor here, to give me some room between skvm::Builder and skvm::Program to do optimizations, bakend specializations and analysis. As a warmup, I'm trying to split up today's Builder::Instruction into two forms, first just what the user requested in Builder (this stays Builder::Instruction) then a new type representing any transformation or analysis we've done to it (OptimizedInstruction). Roughly six important optimizations happen in SkVM today, in this order: 1) constant folding 2) backend-specific instruction specialization 3) common sub-expression elimination 4) reordering + dead code elimination 5) loop invariant and lifetime analysis 6) register assignment At head 1-5 all happen in Builder, and 2 is particularly awkward to have there (e.g. mul_f32 -> mul_f32_imm). 6 happens in Program per-backend, and that seems healthy. As of this CL, 1-3 happen in Builder, 4-5 now on this middle OptimizedInstruction format, and 6 still in Program. I'd like to get to the point where 1 stays in Builder, 2-5 all happen on this middle IR, and 6 stays in Program. That ought to let me do things like turn mul_f32 -> mul_f32_imm when it's good to and still benefit from things like common sub-expression elimination and code reordering happening after that trnasformation. And then, I hope that's also a good spot to do more complicated transformations, like lowering gather8 into gather32 plus some fix up when targeting an x86 JIT but not anywhere else. Today's Builder is too early to know whether we should do this or not, and in Program it's actually kind of awkward to do this sort of thing while also doing having to do register assignment. Some middle might be right. Change-Id: I9c00268a084f07fbab88d05eb441f1957a0d7c67 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/269181 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@google.com>
2020-02-06 19:02:32 +00:00
6 values (originally 6):
add used_in_loop bit to skvm::Builder::Instruction Most hoisted values are used in the loop body (and that's really the whole point of hoisting) but some are just temporaries to help produce other hoisted values. This used_in_loop bit helps us distinguish the two, and lets us recycle registers holding temporary hoisted values not used in the loop. The can-we-recycle logic now becomes: - is this a real value? - is it time for it to die? - is it either not hoisted or a hoisted temporary? The set-death-to-infinity approach for hoisted values is now gone. That worked great for hoisted values used inside the loop, but was too conservative for hoisted temporaries. This lifetime extension was preventing us from recycling those registers, pinning enough registers that we run out and fail to JIT. Small amounts of refactoring to make this clearer: - move the Instruction hash function definition near its operator== - rename the two "hoist" variables to "can_hoist" for Instructions and "try_hoisting" for the JIT approach - add ↟ to mark hoisted temporaries, _really_ hoisted values. There's some redundancy here between tracking the can_hoist bit, the used_in_loop bit, and lifetime tracking. I think it should be true, for instance, that !can_hoist && !used_in_loop implies an instruction is dead code. I plan to continue refactoring lifetime analysis (in particular reordering instructions to decrease register pressure) so hopefully by the time I'm done that metadata will shake out a little crisper. Change-Id: I6460ca96d1cbec0315bed3c9a0774cd88ab5be26 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/248986 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-16 15:46:01 +00:00
↟ v0 = splat 1 (1.4012985e-45)
↟ v1 = splat 2 (2.8025969e-45)
↑ v2 = add_i32 v0 v1
v3 = load32 arg(0)
v4 = mul_i32 v3 v2
store32 arg(0) v4
2 registers, 6 instructions:
0 r0 = splat 1 (1.4012985e-45)
1 r1 = splat 2 (2.8025969e-45)
2 r1 = add_i32 r0 r1
loop:
3 r0 = load32 arg(0)
4 r0 = mul_i32 r0 r1
5 store32 arg(0) r0
23 values (originally 23):
↑ v0 = splat FF (3.5733111e-43)
v1 = load32 arg(0)
v2 = bit_and v0 v1
v3 = load32 arg(1)
v4 = bit_and v0 v3
v5 = add_i32 v2 v4
v6 = shr_i32 v1 8
v7 = bit_and v0 v6
v8 = shr_i32 v3 8
v9 = bit_and v0 v8
v10 = add_i32 v7 v9
v11 = pack v5 v10 8
v12 = shr_i32 v1 16
v13 = bit_and v0 v12
v14 = shr_i32 v3 16
v15 = bit_and v0 v14
v16 = add_i32 v13 v15
v17 = shr_i32 v1 24
v18 = shr_i32 v3 24
v19 = add_i32 v17 v18
v20 = pack v16 v19 8
v21 = pack v11 v20 16
store32 arg(1) v21
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
6 registers, 23 instructions:
0 r0 = splat FF (3.5733111e-43)
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
loop:
1 r1 = load32 arg(0)
2 r2 = bit_and r0 r1
3 r3 = load32 arg(1)
4 r4 = bit_and r0 r3
5 r4 = add_i32 r2 r4
6 r2 = shr_i32 r1 8
7 r2 = bit_and r0 r2
8 r5 = shr_i32 r3 8
9 r5 = bit_and r0 r5
10 r5 = add_i32 r2 r5
11 r5 = pack r4 r5 8
12 r4 = shr_i32 r1 16
13 r4 = bit_and r0 r4
14 r2 = shr_i32 r3 16
15 r2 = bit_and r0 r2
16 r2 = add_i32 r4 r2
17 r1 = shr_i32 r1 24
18 r3 = shr_i32 r3 24
19 r3 = add_i32 r1 r3
20 r3 = pack r2 r3 8
21 r3 = pack r5 r3 16
22 store32 arg(1) r3
reorder to minimize register pressure Rewrite program instructions so that each value becomes available as late as possible, just before it's used by another instruction. This reorders blocks of instructions to reduce them number of temporary registers in flight. Take this example of the sort of program that we naturally write, noting the registers needed as we progress down the right: src = load32 ... (1) sr = extract src ... (2) sg = extract src ... (3) sb = extract src ... (4) sa = extract src ... (4, src dies) dst = load32 ... (5) dr = extract dst ... (6) dg = extract dst ... (7) db = extract dst ... (8) da = extract dst ... (8, dst dies) r = add sr dr (7, sr and dr die) g = add sg dg (6, sg and dg die) b = add sb db (5, sb and db die) a = add sa da (4, sa and da die) rg = pack r g ... (3, r and g die) ba = pack b a ... (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0, rgba dies) That original ordering of the code needs 8 registers (perhaps with a temporary 9th, but we'll ignore that here). This CL will rewrite the program to something more like this by recursively issuing inputs only once needed: src = load32 ... (1) sr = extract src ... (2) dst = load32 ... (3) dr = extract dst ... (4) r = add sr dr (3, sr and dr die) sg = extract src ... (4) dg = extract dst ... (5) g = add sg dg (4, sg and dg die) rg = pack r g (3, r and g die) sb = extract src ... (4) db = extract dst ... (5) b = add sb db (4, sb and db die) sa = extract src ... (4, src dies) da = extract dst ... (4, dst dies) a = add sa da (3, sa and da die) ba = pack b a (2, b and a die) rgba = pack rg ba ... (1, rg and ba die) store32 rgba ... (0) That trims 3 registers off the example, just by reordering! I've added the real version of this example to SkVMTest.cpp. (Its 6th register comes from holding the 0xff byte mask used by extract, in case you're curious). I'll admit it's not exactly easy to work out how this reordering works without a pen and paper or trial and error. I've tried to make the implementation preserve the original program's order as much as makes sense (i.e. when order is an otherwise arbitrary choice) to keep it somewhat sane to follow. This reordering naturally skips dead code, so pour one out for ☠️ . We lose our cute dead code emoji marker, but on the other hand all code downstream of Builder::done() can assume every instruction is live. Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00