2019-06-03 21:27:46 +00:00
|
|
|
A8 over A8
|
2019-11-08 20:18:06 +00:00
|
|
|
14 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
2019-07-17 17:09:09 +00:00
|
|
|
v1 = load8 arg(0)
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v0 v2
|
|
|
|
v4 = load8 arg(1)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v0 v5
|
|
|
|
↑ v7 = splat 3F800000 (1)
|
|
|
|
v8 = sub_f32 v7 v3
|
|
|
|
v9 = mad_f32 v6 v8 v3
|
|
|
|
↑ v10 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v11 = mul_f32 v9 v10
|
|
|
|
v12 = round v11
|
|
|
|
store8 arg(1) v12
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
6 registers, 14 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
2019-06-03 21:27:46 +00:00
|
|
|
r1 = splat 3F800000 (1)
|
|
|
|
r2 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r3 = load8 arg(0)
|
|
|
|
r3 = to_f32 r3
|
|
|
|
r3 = mul_f32 r0 r3
|
|
|
|
r4 = load8 arg(1)
|
2019-11-05 02:14:24 +00:00
|
|
|
r4 = to_f32 r4
|
|
|
|
r4 = mul_f32 r0 r4
|
2019-11-08 20:18:06 +00:00
|
|
|
r5 = sub_f32 r1 r3
|
|
|
|
r3 = mad_f32 r4 r5 r3
|
|
|
|
r3 = mul_f32 r3 r2
|
|
|
|
r3 = round r3
|
|
|
|
store8 arg(1) r3
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
A8 over G8
|
2019-11-08 20:18:06 +00:00
|
|
|
20 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load8 arg(1)
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v0 v2
|
|
|
|
v4 = load8 arg(0)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v0 v5
|
|
|
|
↑ v7 = splat 3F800000 (1)
|
|
|
|
v8 = sub_f32 v7 v6
|
|
|
|
v9 = mul_f32 v3 v8
|
|
|
|
↑ v10 = splat 3E59B3D0 (0.21259999)
|
|
|
|
↑ v11 = splat 3F371759 (0.71520001)
|
|
|
|
↑ v12 = splat 3D93DD98 (0.0722)
|
|
|
|
v13 = mul_f32 v9 v12
|
|
|
|
v14 = mad_f32 v9 v11 v13
|
|
|
|
v15 = mad_f32 v9 v10 v14
|
|
|
|
↑ v16 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v17 = mul_f32 v15 v16
|
|
|
|
v18 = round v17
|
|
|
|
store8 arg(1) v18
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
8 registers, 20 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
2019-06-03 21:32:49 +00:00
|
|
|
r1 = splat 3F800000 (1)
|
|
|
|
r2 = splat 3E59B3D0 (0.21259999)
|
2019-06-05 15:47:46 +00:00
|
|
|
r3 = splat 3F371759 (0.71520001)
|
|
|
|
r4 = splat 3D93DD98 (0.0722)
|
|
|
|
r5 = splat 437F0000 (255)
|
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r6 = load8 arg(1)
|
|
|
|
r6 = to_f32 r6
|
|
|
|
r6 = mul_f32 r0 r6
|
|
|
|
r7 = load8 arg(0)
|
2019-11-05 02:14:24 +00:00
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
2019-11-08 20:18:06 +00:00
|
|
|
r7 = sub_f32 r1 r7
|
|
|
|
r7 = mul_f32 r6 r7
|
|
|
|
r6 = mul_f32 r7 r4
|
|
|
|
r6 = mad_f32 r7 r3 r6
|
|
|
|
r6 = mad_f32 r7 r2 r6
|
|
|
|
r6 = mul_f32 r6 r5
|
|
|
|
r6 = round r6
|
|
|
|
store8 arg(1) r6
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
A8 over RGBA_8888
|
2019-11-08 20:18:06 +00:00
|
|
|
37 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load32 arg(1)
|
|
|
|
↑ v2 = splat FF (3.5733111e-43)
|
|
|
|
v3 = extract v1 0 v2
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v0 v4
|
|
|
|
v6 = load8 arg(0)
|
|
|
|
v7 = to_f32 v6
|
|
|
|
v8 = mul_f32 v0 v7
|
|
|
|
↑ v9 = splat 3F800000 (1)
|
|
|
|
v10 = sub_f32 v9 v8
|
|
|
|
v11 = mul_f32 v5 v10
|
|
|
|
↑ v12 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v13 = mul_f32 v11 v12
|
|
|
|
v14 = round v13
|
|
|
|
v15 = extract v1 8 v2
|
|
|
|
v16 = to_f32 v15
|
|
|
|
v17 = mul_f32 v0 v16
|
|
|
|
v18 = mul_f32 v17 v10
|
|
|
|
v19 = mul_f32 v18 v12
|
|
|
|
v20 = round v19
|
|
|
|
v21 = pack v14 v20 8
|
|
|
|
v22 = extract v1 16 v2
|
|
|
|
v23 = to_f32 v22
|
|
|
|
v24 = mul_f32 v0 v23
|
|
|
|
v25 = mul_f32 v24 v10
|
|
|
|
v26 = mul_f32 v25 v12
|
|
|
|
v27 = round v26
|
|
|
|
v28 = extract v1 24 v2
|
|
|
|
v29 = to_f32 v28
|
|
|
|
v30 = mul_f32 v0 v29
|
|
|
|
v31 = mad_f32 v30 v10 v8
|
|
|
|
v32 = mul_f32 v31 v12
|
|
|
|
v33 = round v32
|
|
|
|
v34 = pack v27 v33 8
|
|
|
|
v35 = pack v21 v34 16
|
|
|
|
store32 arg(1) v35
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
9 registers, 37 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
2019-06-07 20:44:26 +00:00
|
|
|
r1 = splat FF (3.5733111e-43)
|
|
|
|
r2 = splat 3F800000 (1)
|
|
|
|
r3 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r4 = load32 arg(1)
|
|
|
|
r5 = extract r4 0 r1
|
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
|
|
|
r6 = load8 arg(0)
|
2019-11-05 02:14:24 +00:00
|
|
|
r6 = to_f32 r6
|
|
|
|
r6 = mul_f32 r0 r6
|
2019-11-08 20:18:06 +00:00
|
|
|
r7 = sub_f32 r2 r6
|
|
|
|
r5 = mul_f32 r5 r7
|
|
|
|
r5 = mul_f32 r5 r3
|
|
|
|
r5 = round r5
|
|
|
|
r8 = extract r4 8 r1
|
|
|
|
r8 = to_f32 r8
|
|
|
|
r8 = mul_f32 r0 r8
|
|
|
|
r8 = mul_f32 r8 r7
|
|
|
|
r8 = mul_f32 r8 r3
|
|
|
|
r8 = round r8
|
|
|
|
r8 = pack r5 r8 8
|
|
|
|
r5 = extract r4 16 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
2019-11-08 20:18:06 +00:00
|
|
|
r5 = mul_f32 r5 r7
|
|
|
|
r5 = mul_f32 r5 r3
|
|
|
|
r5 = round r5
|
|
|
|
r4 = extract r4 24 r1
|
|
|
|
r4 = to_f32 r4
|
|
|
|
r4 = mul_f32 r0 r4
|
|
|
|
r6 = mad_f32 r4 r7 r6
|
|
|
|
r6 = mul_f32 r6 r3
|
|
|
|
r6 = round r6
|
|
|
|
r6 = pack r5 r6 8
|
|
|
|
r6 = pack r8 r6 16
|
|
|
|
store32 arg(1) r6
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
G8 over A8
|
2019-11-08 20:18:06 +00:00
|
|
|
11 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3F800000 (1)
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v1 = splat 3B808081 (0.0039215689)
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v2 = load8 arg(1)
|
|
|
|
v3 = to_f32 v2
|
|
|
|
v4 = mul_f32 v1 v3
|
|
|
|
↑ v5 = sub_f32 v0 v0
|
|
|
|
v6 = mad_f32 v4 v5 v0
|
|
|
|
↑ v7 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v8 = mul_f32 v6 v7
|
|
|
|
v9 = round v8
|
|
|
|
store8 arg(1) v9
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
5 registers, 11 instructions:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
r0 = splat 3F800000 (1)
|
|
|
|
r1 = splat 3B808081 (0.0039215689)
|
|
|
|
r2 = sub_f32 r0 r0
|
2019-06-05 15:47:46 +00:00
|
|
|
r3 = splat 437F0000 (255)
|
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r4 = load8 arg(1)
|
|
|
|
r4 = to_f32 r4
|
|
|
|
r4 = mul_f32 r1 r4
|
|
|
|
r4 = mad_f32 r4 r2 r0
|
|
|
|
r4 = mul_f32 r4 r3
|
|
|
|
r4 = round r4
|
|
|
|
store8 arg(1) r4
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
G8 over G8
|
2019-11-08 20:18:06 +00:00
|
|
|
20 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load8 arg(0)
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v0 v2
|
|
|
|
v4 = load8 arg(1)
|
|
|
|
v5 = to_f32 v4
|
|
|
|
v6 = mul_f32 v0 v5
|
|
|
|
↟ v7 = splat 3F800000 (1)
|
|
|
|
↑ v8 = sub_f32 v7 v7
|
|
|
|
v9 = mad_f32 v6 v8 v3
|
|
|
|
↑ v10 = splat 3E59B3D0 (0.21259999)
|
|
|
|
↑ v11 = splat 3F371759 (0.71520001)
|
|
|
|
↑ v12 = splat 3D93DD98 (0.0722)
|
|
|
|
v13 = mul_f32 v9 v12
|
|
|
|
v14 = mad_f32 v9 v11 v13
|
|
|
|
v15 = mad_f32 v9 v10 v14
|
|
|
|
↑ v16 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v17 = mul_f32 v15 v16
|
|
|
|
v18 = round v17
|
|
|
|
store8 arg(1) v18
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
8 registers, 20 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
|
|
|
r1 = splat 3F800000 (1)
|
2019-08-08 19:13:25 +00:00
|
|
|
r1 = sub_f32 r1 r1
|
|
|
|
r2 = splat 3E59B3D0 (0.21259999)
|
|
|
|
r3 = splat 3F371759 (0.71520001)
|
|
|
|
r4 = splat 3D93DD98 (0.0722)
|
|
|
|
r5 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r6 = load8 arg(0)
|
|
|
|
r6 = to_f32 r6
|
|
|
|
r6 = mul_f32 r0 r6
|
|
|
|
r7 = load8 arg(1)
|
2019-11-05 02:14:24 +00:00
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
2019-11-08 20:18:06 +00:00
|
|
|
r6 = mad_f32 r7 r1 r6
|
|
|
|
r7 = mul_f32 r6 r4
|
|
|
|
r7 = mad_f32 r6 r3 r7
|
|
|
|
r7 = mad_f32 r6 r2 r7
|
|
|
|
r7 = mul_f32 r7 r5
|
|
|
|
r7 = round r7
|
|
|
|
store8 arg(1) r7
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
G8 over RGBA_8888
|
2019-11-08 20:18:06 +00:00
|
|
|
37 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load8 arg(0)
|
|
|
|
v2 = to_f32 v1
|
|
|
|
v3 = mul_f32 v0 v2
|
|
|
|
v4 = load32 arg(1)
|
|
|
|
↑ v5 = splat FF (3.5733111e-43)
|
|
|
|
v6 = extract v4 0 v5
|
|
|
|
v7 = to_f32 v6
|
|
|
|
v8 = mul_f32 v0 v7
|
|
|
|
↑ v9 = splat 3F800000 (1)
|
|
|
|
↑ v10 = sub_f32 v9 v9
|
|
|
|
v11 = mad_f32 v8 v10 v3
|
|
|
|
↑ v12 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v13 = mul_f32 v11 v12
|
|
|
|
v14 = round v13
|
|
|
|
v15 = extract v4 8 v5
|
|
|
|
v16 = to_f32 v15
|
|
|
|
v17 = mul_f32 v0 v16
|
|
|
|
v18 = mad_f32 v17 v10 v3
|
|
|
|
v19 = mul_f32 v18 v12
|
|
|
|
v20 = round v19
|
|
|
|
v21 = pack v14 v20 8
|
|
|
|
v22 = extract v4 16 v5
|
|
|
|
v23 = to_f32 v22
|
|
|
|
v24 = mul_f32 v0 v23
|
|
|
|
v25 = mad_f32 v24 v10 v3
|
|
|
|
v26 = mul_f32 v25 v12
|
|
|
|
v27 = round v26
|
|
|
|
v28 = extract v4 24 v5
|
|
|
|
v29 = to_f32 v28
|
|
|
|
v30 = mul_f32 v0 v29
|
|
|
|
v31 = mad_f32 v30 v10 v9
|
|
|
|
v32 = mul_f32 v31 v12
|
|
|
|
v33 = round v32
|
|
|
|
v34 = pack v27 v33 8
|
|
|
|
v35 = pack v21 v34 16
|
|
|
|
store32 arg(1) v35
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
9 registers, 37 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
r1 = splat FF (3.5733111e-43)
|
|
|
|
r2 = splat 3F800000 (1)
|
|
|
|
r3 = sub_f32 r2 r2
|
2019-06-07 20:44:26 +00:00
|
|
|
r4 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r5 = load8 arg(0)
|
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
|
|
|
r6 = load32 arg(1)
|
|
|
|
r7 = extract r6 0 r1
|
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
|
|
|
r7 = mad_f32 r7 r3 r5
|
|
|
|
r7 = mul_f32 r7 r4
|
|
|
|
r7 = round r7
|
|
|
|
r8 = extract r6 8 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r8 = to_f32 r8
|
|
|
|
r8 = mul_f32 r0 r8
|
2019-11-08 20:18:06 +00:00
|
|
|
r8 = mad_f32 r8 r3 r5
|
|
|
|
r8 = mul_f32 r8 r4
|
|
|
|
r8 = round r8
|
|
|
|
r8 = pack r7 r8 8
|
|
|
|
r7 = extract r6 16 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
2019-11-08 20:18:06 +00:00
|
|
|
r5 = mad_f32 r7 r3 r5
|
|
|
|
r5 = mul_f32 r5 r4
|
|
|
|
r5 = round r5
|
|
|
|
r6 = extract r6 24 r1
|
|
|
|
r6 = to_f32 r6
|
|
|
|
r6 = mul_f32 r0 r6
|
|
|
|
r6 = mad_f32 r6 r3 r2
|
|
|
|
r6 = mul_f32 r6 r4
|
|
|
|
r6 = round r6
|
|
|
|
r6 = pack r5 r6 8
|
|
|
|
r6 = pack r8 r6 16
|
|
|
|
store32 arg(1) r6
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
RGBA_8888 over A8
|
2019-11-08 20:18:06 +00:00
|
|
|
16 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load32 arg(0)
|
|
|
|
↑ v2 = splat FF (3.5733111e-43)
|
|
|
|
v3 = extract v1 24 v2
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v0 v4
|
|
|
|
v6 = load8 arg(1)
|
|
|
|
v7 = to_f32 v6
|
|
|
|
v8 = mul_f32 v0 v7
|
|
|
|
↑ v9 = splat 3F800000 (1)
|
|
|
|
v10 = sub_f32 v9 v5
|
|
|
|
v11 = mad_f32 v8 v10 v5
|
|
|
|
↑ v12 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v13 = mul_f32 v11 v12
|
|
|
|
v14 = round v13
|
|
|
|
store8 arg(1) v14
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
7 registers, 16 instructions:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
|
|
|
r1 = splat FF (3.5733111e-43)
|
2019-06-07 20:44:26 +00:00
|
|
|
r2 = splat 3F800000 (1)
|
|
|
|
r3 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r4 = load32 arg(0)
|
|
|
|
r4 = extract r4 24 r1
|
|
|
|
r4 = to_f32 r4
|
|
|
|
r4 = mul_f32 r0 r4
|
|
|
|
r5 = load8 arg(1)
|
2019-11-05 02:14:24 +00:00
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
2019-11-08 20:18:06 +00:00
|
|
|
r6 = sub_f32 r2 r4
|
|
|
|
r4 = mad_f32 r5 r6 r4
|
|
|
|
r4 = mul_f32 r4 r3
|
|
|
|
r4 = round r4
|
|
|
|
store8 arg(1) r4
|
2019-06-05 15:47:46 +00:00
|
|
|
|
|
|
|
RGBA_8888 over G8
|
2019-11-08 20:18:06 +00:00
|
|
|
33 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load32 arg(0)
|
|
|
|
↑ v2 = splat FF (3.5733111e-43)
|
|
|
|
v3 = extract v1 0 v2
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v0 v4
|
|
|
|
v6 = load8 arg(1)
|
2019-07-17 17:09:09 +00:00
|
|
|
v7 = to_f32 v6
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v8 = mul_f32 v0 v7
|
|
|
|
v9 = extract v1 24 v2
|
2019-07-17 17:09:09 +00:00
|
|
|
v10 = to_f32 v9
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v11 = mul_f32 v0 v10
|
|
|
|
↑ v12 = splat 3F800000 (1)
|
|
|
|
v13 = sub_f32 v12 v11
|
|
|
|
v14 = mad_f32 v8 v13 v5
|
|
|
|
↑ v15 = splat 3E59B3D0 (0.21259999)
|
|
|
|
v16 = extract v1 8 v2
|
|
|
|
v17 = to_f32 v16
|
|
|
|
v18 = mul_f32 v0 v17
|
|
|
|
v19 = mad_f32 v8 v13 v18
|
|
|
|
↑ v20 = splat 3F371759 (0.71520001)
|
|
|
|
v21 = extract v1 16 v2
|
|
|
|
v22 = to_f32 v21
|
|
|
|
v23 = mul_f32 v0 v22
|
|
|
|
v24 = mad_f32 v8 v13 v23
|
|
|
|
↑ v25 = splat 3D93DD98 (0.0722)
|
|
|
|
v26 = mul_f32 v24 v25
|
|
|
|
v27 = mad_f32 v19 v20 v26
|
|
|
|
v28 = mad_f32 v14 v15 v27
|
|
|
|
↑ v29 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v30 = mul_f32 v28 v29
|
|
|
|
v31 = round v30
|
|
|
|
store8 arg(1) v31
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
12 registers, 33 instructions:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
|
|
|
r1 = splat FF (3.5733111e-43)
|
2019-06-07 20:44:26 +00:00
|
|
|
r2 = splat 3F800000 (1)
|
|
|
|
r3 = splat 3E59B3D0 (0.21259999)
|
|
|
|
r4 = splat 3F371759 (0.71520001)
|
|
|
|
r5 = splat 3D93DD98 (0.0722)
|
|
|
|
r6 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r7 = load32 arg(0)
|
|
|
|
r8 = extract r7 0 r1
|
|
|
|
r8 = to_f32 r8
|
|
|
|
r8 = mul_f32 r0 r8
|
|
|
|
r9 = load8 arg(1)
|
2019-11-05 02:14:24 +00:00
|
|
|
r9 = to_f32 r9
|
|
|
|
r9 = mul_f32 r0 r9
|
2019-11-08 20:18:06 +00:00
|
|
|
r10 = extract r7 24 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r10 = to_f32 r10
|
|
|
|
r10 = mul_f32 r0 r10
|
2019-11-08 20:18:06 +00:00
|
|
|
r10 = sub_f32 r2 r10
|
|
|
|
r8 = mad_f32 r9 r10 r8
|
|
|
|
r11 = extract r7 8 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r11 = to_f32 r11
|
|
|
|
r11 = mul_f32 r0 r11
|
2019-11-08 20:18:06 +00:00
|
|
|
r11 = mad_f32 r9 r10 r11
|
|
|
|
r7 = extract r7 16 r1
|
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
|
|
|
r7 = mad_f32 r9 r10 r7
|
|
|
|
r7 = mul_f32 r7 r5
|
|
|
|
r7 = mad_f32 r11 r4 r7
|
|
|
|
r7 = mad_f32 r8 r3 r7
|
|
|
|
r7 = mul_f32 r7 r6
|
|
|
|
r7 = round r7
|
|
|
|
store8 arg(1) r7
|
2019-06-03 21:27:46 +00:00
|
|
|
|
|
|
|
RGBA_8888 over RGBA_8888
|
2019-11-08 20:18:06 +00:00
|
|
|
47 values:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v0 = splat 3B808081 (0.0039215689)
|
|
|
|
v1 = load32 arg(0)
|
|
|
|
↑ v2 = splat FF (3.5733111e-43)
|
|
|
|
v3 = extract v1 0 v2
|
|
|
|
v4 = to_f32 v3
|
|
|
|
v5 = mul_f32 v0 v4
|
|
|
|
v6 = load32 arg(1)
|
|
|
|
v7 = extract v6 0 v2
|
|
|
|
v8 = to_f32 v7
|
|
|
|
v9 = mul_f32 v0 v8
|
|
|
|
v10 = extract v1 24 v2
|
|
|
|
v11 = to_f32 v10
|
|
|
|
v12 = mul_f32 v0 v11
|
|
|
|
↑ v13 = splat 3F800000 (1)
|
|
|
|
v14 = sub_f32 v13 v12
|
|
|
|
v15 = mad_f32 v9 v14 v5
|
|
|
|
↑ v16 = splat 437F0000 (255)
|
2019-11-08 20:18:06 +00:00
|
|
|
v17 = mul_f32 v15 v16
|
|
|
|
v18 = round v17
|
|
|
|
v19 = extract v1 8 v2
|
|
|
|
v20 = to_f32 v19
|
|
|
|
v21 = mul_f32 v0 v20
|
|
|
|
v22 = extract v6 8 v2
|
|
|
|
v23 = to_f32 v22
|
|
|
|
v24 = mul_f32 v0 v23
|
|
|
|
v25 = mad_f32 v24 v14 v21
|
|
|
|
v26 = mul_f32 v25 v16
|
|
|
|
v27 = round v26
|
|
|
|
v28 = pack v18 v27 8
|
|
|
|
v29 = extract v1 16 v2
|
|
|
|
v30 = to_f32 v29
|
|
|
|
v31 = mul_f32 v0 v30
|
|
|
|
v32 = extract v6 16 v2
|
|
|
|
v33 = to_f32 v32
|
|
|
|
v34 = mul_f32 v0 v33
|
|
|
|
v35 = mad_f32 v34 v14 v31
|
|
|
|
v36 = mul_f32 v35 v16
|
|
|
|
v37 = round v36
|
|
|
|
v38 = extract v6 24 v2
|
|
|
|
v39 = to_f32 v38
|
|
|
|
v40 = mul_f32 v0 v39
|
|
|
|
v41 = mad_f32 v40 v14 v12
|
|
|
|
v42 = mul_f32 v41 v16
|
|
|
|
v43 = round v42
|
|
|
|
v44 = pack v37 v43 8
|
|
|
|
v45 = pack v28 v44 16
|
|
|
|
store32 arg(1) v45
|
2019-06-10 17:05:48 +00:00
|
|
|
|
2019-11-08 20:18:06 +00:00
|
|
|
11 registers, 47 instructions:
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
r0 = splat 3B808081 (0.0039215689)
|
|
|
|
r1 = splat FF (3.5733111e-43)
|
2019-06-07 20:44:26 +00:00
|
|
|
r2 = splat 3F800000 (1)
|
|
|
|
r3 = splat 437F0000 (255)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-08 20:18:06 +00:00
|
|
|
r4 = load32 arg(0)
|
|
|
|
r5 = extract r4 0 r1
|
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
|
|
|
r6 = load32 arg(1)
|
|
|
|
r7 = extract r6 0 r1
|
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
|
|
|
r8 = extract r4 24 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r8 = to_f32 r8
|
|
|
|
r8 = mul_f32 r0 r8
|
2019-11-08 20:18:06 +00:00
|
|
|
r9 = sub_f32 r2 r8
|
|
|
|
r5 = mad_f32 r7 r9 r5
|
|
|
|
r5 = mul_f32 r5 r3
|
|
|
|
r5 = round r5
|
|
|
|
r7 = extract r4 8 r1
|
|
|
|
r7 = to_f32 r7
|
|
|
|
r7 = mul_f32 r0 r7
|
|
|
|
r10 = extract r6 8 r1
|
|
|
|
r10 = to_f32 r10
|
|
|
|
r10 = mul_f32 r0 r10
|
|
|
|
r7 = mad_f32 r10 r9 r7
|
|
|
|
r7 = mul_f32 r7 r3
|
|
|
|
r7 = round r7
|
|
|
|
r7 = pack r5 r7 8
|
|
|
|
r4 = extract r4 16 r1
|
|
|
|
r4 = to_f32 r4
|
|
|
|
r4 = mul_f32 r0 r4
|
|
|
|
r5 = extract r6 16 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r5 = to_f32 r5
|
|
|
|
r5 = mul_f32 r0 r5
|
2019-11-08 20:18:06 +00:00
|
|
|
r4 = mad_f32 r5 r9 r4
|
|
|
|
r4 = mul_f32 r4 r3
|
|
|
|
r4 = round r4
|
|
|
|
r6 = extract r6 24 r1
|
2019-11-05 02:14:24 +00:00
|
|
|
r6 = to_f32 r6
|
|
|
|
r6 = mul_f32 r0 r6
|
2019-11-08 20:18:06 +00:00
|
|
|
r8 = mad_f32 r6 r9 r8
|
|
|
|
r8 = mul_f32 r8 r3
|
|
|
|
r8 = round r8
|
|
|
|
r8 = pack r4 r8 8
|
|
|
|
r8 = pack r7 r8 16
|
|
|
|
store32 arg(1) r8
|
2019-06-05 15:47:46 +00:00
|
|
|
|
2019-06-20 16:37:10 +00:00
|
|
|
I32 (Naive) 8888 over 8888
|
2019-07-02 20:39:23 +00:00
|
|
|
29 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v1 = splat FF (3.5733111e-43)
|
2019-07-17 17:09:09 +00:00
|
|
|
v2 = extract v0 0 v1
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v3 = load32 arg(1)
|
|
|
|
v4 = extract v3 0 v1
|
2019-07-17 17:09:09 +00:00
|
|
|
v5 = extract v0 24 v1
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v6 = splat 100 (3.5873241e-43)
|
|
|
|
v7 = sub_i32 v6 v5
|
|
|
|
v8 = mul_i32 v4 v7
|
|
|
|
v9 = shr_i32 v8 8
|
|
|
|
v10 = add_i32 v2 v9
|
|
|
|
v11 = extract v0 8 v1
|
|
|
|
v12 = extract v3 8 v1
|
|
|
|
v13 = mul_i32 v12 v7
|
2019-07-25 19:32:19 +00:00
|
|
|
v14 = shr_i32 v13 8
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v15 = add_i32 v11 v14
|
|
|
|
v16 = pack v10 v15 8
|
|
|
|
v17 = extract v0 16 v1
|
|
|
|
v18 = extract v3 16 v1
|
|
|
|
v19 = mul_i32 v18 v7
|
2019-07-25 19:32:19 +00:00
|
|
|
v20 = shr_i32 v19 8
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v21 = add_i32 v17 v20
|
|
|
|
v22 = extract v3 24 v1
|
|
|
|
v23 = mul_i32 v22 v7
|
|
|
|
v24 = shr_i32 v23 8
|
|
|
|
v25 = add_i32 v5 v24
|
|
|
|
v26 = pack v21 v25 8
|
|
|
|
v27 = pack v16 v26 16
|
2019-07-17 17:09:09 +00:00
|
|
|
store32 arg(1) v27
|
2019-07-02 20:39:23 +00:00
|
|
|
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
9 registers, 29 instructions:
|
2019-06-20 16:37:10 +00:00
|
|
|
r0 = splat FF (3.5733111e-43)
|
|
|
|
r1 = splat 100 (3.5873241e-43)
|
|
|
|
loop:
|
2019-11-05 02:14:24 +00:00
|
|
|
r2 = load32 arg(0)
|
|
|
|
r3 = extract r2 0 r0
|
|
|
|
r4 = load32 arg(1)
|
|
|
|
r5 = extract r4 0 r0
|
|
|
|
r6 = extract r2 24 r0
|
|
|
|
r7 = sub_i32 r1 r6
|
|
|
|
r5 = mul_i32 r5 r7
|
|
|
|
r5 = shr_i32 r5 8
|
|
|
|
r5 = add_i32 r3 r5
|
|
|
|
r3 = extract r2 8 r0
|
|
|
|
r8 = extract r4 8 r0
|
|
|
|
r8 = mul_i32 r8 r7
|
|
|
|
r8 = shr_i32 r8 8
|
|
|
|
r8 = add_i32 r3 r8
|
|
|
|
r8 = pack r5 r8 8
|
|
|
|
r2 = extract r2 16 r0
|
|
|
|
r5 = extract r4 16 r0
|
|
|
|
r5 = mul_i32 r5 r7
|
|
|
|
r5 = shr_i32 r5 8
|
|
|
|
r5 = add_i32 r2 r5
|
|
|
|
r4 = extract r4 24 r0
|
|
|
|
r7 = mul_i32 r4 r7
|
|
|
|
r7 = shr_i32 r7 8
|
|
|
|
r7 = add_i32 r6 r7
|
|
|
|
r7 = pack r5 r7 8
|
|
|
|
r7 = pack r8 r7 16
|
|
|
|
store32 arg(1) r7
|
2019-06-20 16:37:10 +00:00
|
|
|
|
2019-06-05 15:47:46 +00:00
|
|
|
I32 8888 over 8888
|
2019-07-02 20:39:23 +00:00
|
|
|
29 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v1 = splat FF (3.5733111e-43)
|
2019-07-17 17:09:09 +00:00
|
|
|
v2 = bit_and v0 v1
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v3 = load32 arg(1)
|
|
|
|
v4 = bit_and v3 v1
|
2019-07-25 19:32:19 +00:00
|
|
|
v5 = shr_i32 v0 24
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
↑ v6 = splat 100 (3.5873241e-43)
|
|
|
|
v7 = sub_i32 v6 v5
|
|
|
|
v8 = mul_i16x2 v4 v7
|
|
|
|
v9 = shr_i32 v8 8
|
|
|
|
v10 = add_i32 v2 v9
|
|
|
|
v11 = bytes v0 2
|
|
|
|
v12 = bytes v3 2
|
|
|
|
v13 = mul_i16x2 v12 v7
|
2019-07-25 19:32:19 +00:00
|
|
|
v14 = shr_i32 v13 8
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v15 = add_i32 v11 v14
|
|
|
|
v16 = pack v10 v15 8
|
|
|
|
v17 = bytes v0 3
|
|
|
|
v18 = bytes v3 3
|
|
|
|
v19 = mul_i16x2 v18 v7
|
2019-07-25 19:32:19 +00:00
|
|
|
v20 = shr_i32 v19 8
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v21 = add_i32 v17 v20
|
|
|
|
v22 = shr_i32 v3 24
|
|
|
|
v23 = mul_i16x2 v22 v7
|
|
|
|
v24 = shr_i32 v23 8
|
|
|
|
v25 = add_i32 v5 v24
|
|
|
|
v26 = pack v21 v25 8
|
|
|
|
v27 = pack v16 v26 16
|
2019-07-17 17:09:09 +00:00
|
|
|
store32 arg(1) v27
|
2019-07-02 20:39:23 +00:00
|
|
|
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
9 registers, 29 instructions:
|
2019-06-05 15:47:46 +00:00
|
|
|
r0 = splat FF (3.5733111e-43)
|
2019-06-13 15:56:51 +00:00
|
|
|
r1 = splat 100 (3.5873241e-43)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-05 02:14:24 +00:00
|
|
|
r2 = load32 arg(0)
|
|
|
|
r3 = bit_and r2 r0
|
|
|
|
r4 = load32 arg(1)
|
|
|
|
r5 = bit_and r4 r0
|
|
|
|
r6 = shr_i32 r2 24
|
|
|
|
r7 = sub_i32 r1 r6
|
|
|
|
r5 = mul_i16x2 r5 r7
|
|
|
|
r5 = shr_i32 r5 8
|
|
|
|
r5 = add_i32 r3 r5
|
|
|
|
r3 = bytes r2 2
|
|
|
|
r8 = bytes r4 2
|
|
|
|
r8 = mul_i16x2 r8 r7
|
|
|
|
r8 = shr_i32 r8 8
|
|
|
|
r8 = add_i32 r3 r8
|
|
|
|
r8 = pack r5 r8 8
|
|
|
|
r2 = bytes r2 3
|
|
|
|
r5 = bytes r4 3
|
|
|
|
r5 = mul_i16x2 r5 r7
|
|
|
|
r5 = shr_i32 r5 8
|
|
|
|
r5 = add_i32 r2 r5
|
|
|
|
r4 = shr_i32 r4 24
|
|
|
|
r7 = mul_i16x2 r4 r7
|
|
|
|
r7 = shr_i32 r7 8
|
|
|
|
r7 = add_i32 r6 r7
|
|
|
|
r7 = pack r5 r7 8
|
|
|
|
r7 = pack r8 r7 16
|
|
|
|
store32 arg(1) r7
|
2019-06-03 22:10:59 +00:00
|
|
|
|
|
|
|
I32 (SWAR) 8888 over 8888
|
2019-07-02 20:39:23 +00:00
|
|
|
15 values:
|
2019-07-17 17:09:09 +00:00
|
|
|
v0 = load32 arg(0)
|
|
|
|
v1 = bytes v0 404
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v2 = splat 1000100 (2.3510604e-38)
|
2019-07-17 17:09:09 +00:00
|
|
|
v3 = sub_i16x2 v2 v1
|
|
|
|
v4 = load32 arg(1)
|
2019-07-22 18:44:54 +00:00
|
|
|
↑ v5 = splat FF00FF (2.3418409e-38)
|
2019-07-17 17:09:09 +00:00
|
|
|
v6 = bit_and v4 v5
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v7 = mul_i16x2 v6 v3
|
|
|
|
v8 = shr_i16x2 v7 8
|
|
|
|
v9 = shr_i16x2 v4 8
|
|
|
|
v10 = mul_i16x2 v9 v3
|
2019-07-17 17:09:09 +00:00
|
|
|
v11 = bit_clear v10 v5
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
v12 = bit_or v8 v11
|
2019-07-17 17:09:09 +00:00
|
|
|
v13 = add_i32 v0 v12
|
|
|
|
store32 arg(1) v13
|
2019-07-02 20:39:23 +00:00
|
|
|
|
2019-06-24 00:35:28 +00:00
|
|
|
6 registers, 15 instructions:
|
2019-06-13 18:12:38 +00:00
|
|
|
r0 = splat 1000100 (2.3510604e-38)
|
|
|
|
r1 = splat FF00FF (2.3418409e-38)
|
2019-06-05 15:47:46 +00:00
|
|
|
loop:
|
2019-11-05 02:14:24 +00:00
|
|
|
r2 = load32 arg(0)
|
|
|
|
r3 = bytes r2 404
|
|
|
|
r3 = sub_i16x2 r0 r3
|
|
|
|
r4 = load32 arg(1)
|
|
|
|
r5 = bit_and r4 r1
|
|
|
|
r5 = mul_i16x2 r5 r3
|
|
|
|
r5 = shr_i16x2 r5 8
|
|
|
|
r4 = shr_i16x2 r4 8
|
|
|
|
r3 = mul_i16x2 r4 r3
|
|
|
|
r3 = bit_clear r3 r1
|
|
|
|
r3 = bit_or r5 r3
|
|
|
|
r3 = add_i32 r2 r3
|
|
|
|
store32 arg(1) r3
|
2019-06-03 22:10:59 +00:00
|
|
|
|
2019-08-08 19:13:25 +00:00
|
|
|
6 values:
|
2019-10-16 15:46:01 +00:00
|
|
|
↟ v0 = splat 1 (1.4012985e-45)
|
|
|
|
↟ v1 = splat 2 (2.8025969e-45)
|
2019-08-08 19:13:25 +00:00
|
|
|
↑ v2 = add_i32 v0 v1
|
|
|
|
v3 = load32 arg(0)
|
|
|
|
v4 = mul_i32 v3 v2
|
|
|
|
store32 arg(0) v4
|
|
|
|
|
|
|
|
2 registers, 6 instructions:
|
|
|
|
r0 = splat 1 (1.4012985e-45)
|
|
|
|
r1 = splat 2 (2.8025969e-45)
|
|
|
|
r1 = add_i32 r0 r1
|
|
|
|
loop:
|
2019-11-05 02:14:24 +00:00
|
|
|
r0 = load32 arg(0)
|
|
|
|
r0 = mul_i32 r0 r1
|
|
|
|
store32 arg(0) r0
|
2019-08-08 19:13:25 +00:00
|
|
|
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
19 values:
|
|
|
|
↑ v0 = splat FF (3.5733111e-43)
|
|
|
|
v1 = load32 arg(0)
|
|
|
|
v2 = extract v1 0 v0
|
|
|
|
v3 = load32 arg(1)
|
|
|
|
v4 = extract v3 0 v0
|
|
|
|
v5 = add_i32 v2 v4
|
|
|
|
v6 = extract v1 8 v0
|
|
|
|
v7 = extract v3 8 v0
|
|
|
|
v8 = add_i32 v6 v7
|
|
|
|
v9 = pack v5 v8 8
|
|
|
|
v10 = extract v1 16 v0
|
|
|
|
v11 = extract v3 16 v0
|
|
|
|
v12 = add_i32 v10 v11
|
|
|
|
v13 = extract v1 24 v0
|
|
|
|
v14 = extract v3 24 v0
|
|
|
|
v15 = add_i32 v13 v14
|
|
|
|
v16 = pack v12 v15 8
|
|
|
|
v17 = pack v9 v16 16
|
|
|
|
store32 arg(1) v17
|
|
|
|
|
|
|
|
6 registers, 19 instructions:
|
|
|
|
r0 = splat FF (3.5733111e-43)
|
|
|
|
loop:
|
2019-11-05 02:14:24 +00:00
|
|
|
r1 = load32 arg(0)
|
|
|
|
r2 = extract r1 0 r0
|
|
|
|
r3 = load32 arg(1)
|
|
|
|
r4 = extract r3 0 r0
|
|
|
|
r4 = add_i32 r2 r4
|
|
|
|
r2 = extract r1 8 r0
|
|
|
|
r5 = extract r3 8 r0
|
|
|
|
r5 = add_i32 r2 r5
|
|
|
|
r5 = pack r4 r5 8
|
|
|
|
r4 = extract r1 16 r0
|
|
|
|
r2 = extract r3 16 r0
|
|
|
|
r2 = add_i32 r4 r2
|
|
|
|
r1 = extract r1 24 r0
|
|
|
|
r3 = extract r3 24 r0
|
|
|
|
r3 = add_i32 r1 r3
|
|
|
|
r3 = pack r2 r3 8
|
|
|
|
r3 = pack r5 r3 16
|
|
|
|
store32 arg(1) r3
|
reorder to minimize register pressure
Rewrite program instructions so that each value becomes available as
late as possible, just before it's used by another instruction. This
reorders blocks of instructions to reduce them number of temporary
registers in flight.
Take this example of the sort of program that we naturally write,
noting the registers needed as we progress down the right:
src = load32 ... (1)
sr = extract src ... (2)
sg = extract src ... (3)
sb = extract src ... (4)
sa = extract src ... (4, src dies)
dst = load32 ... (5)
dr = extract dst ... (6)
dg = extract dst ... (7)
db = extract dst ... (8)
da = extract dst ... (8, dst dies)
r = add sr dr (7, sr and dr die)
g = add sg dg (6, sg and dg die)
b = add sb db (5, sb and db die)
a = add sa da (4, sa and da die)
rg = pack r g ... (3, r and g die)
ba = pack b a ... (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0, rgba dies)
That original ordering of the code needs 8 registers (perhaps with a
temporary 9th, but we'll ignore that here). This CL will rewrite the
program to something more like this by recursively issuing inputs only
once needed:
src = load32 ... (1)
sr = extract src ... (2)
dst = load32 ... (3)
dr = extract dst ... (4)
r = add sr dr (3, sr and dr die)
sg = extract src ... (4)
dg = extract dst ... (5)
g = add sg dg (4, sg and dg die)
rg = pack r g (3, r and g die)
sb = extract src ... (4)
db = extract dst ... (5)
b = add sb db (4, sb and db die)
sa = extract src ... (4, src dies)
da = extract dst ... (4, dst dies)
a = add sa da (3, sa and da die)
ba = pack b a (2, b and a die)
rgba = pack rg ba ... (1, rg and ba die)
store32 rgba ... (0)
That trims 3 registers off the example, just by reordering!
I've added the real version of this example to SkVMTest.cpp.
(Its 6th register comes from holding the 0xff byte mask used
by extract, in case you're curious).
I'll admit it's not exactly easy to work out how this reordering works
without a pen and paper or trial and error. I've tried to make the
implementation preserve the original program's order as much as makes
sense (i.e. when order is an otherwise arbitrary choice) to keep it
somewhat sane to follow.
This reordering naturally skips dead code, so pour one out for ☠️ .
We lose our cute dead code emoji marker, but on the other hand all code
downstream of Builder::done() can assume every instruction is live.
Change-Id: Iceffcd10fd7465eae51a39ef8eec7a7189766ba2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/249999
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
2019-10-22 17:27:58 +00:00
|
|
|
|