q14 rethink

I've been thinking and rethinking and rethinking how best to use 16-bit values like Q14 fixed-point in SkVM. Here's some ways: A) don't... just use 32-bit values instead B) use 16x2-bit pairs to match the narrower 32-bit lane count C) double-pump 32-bit values to match the wider 16-bit lane count D) use native 16- and 32-bit values and let the backends sort it out A) is how things work today, and C) is how SkRasterPipeline's lowp mode works. Having tried out B) and C) both for a good fair shake, they were both already awkward to work with after writing just a few functions. I would not give up on them entirely, but they're no longer my favorites. D) is subtle and my new favorite. It's easiest to program with SkVM when the values we're holding represent single values and the backend handles any parallelism for us. That suggests we add a simple 16-bit Q14 to the existing 32-bit I32 and F32 types, where they can be actively converted between as normal, but not freely no-op bit punned. D) says we people shouldn't have to choose between A-C) up front... each backend can handle it themselves. Under strategy D), it's entirely the backend's job to decide how to represent each value, and how to to vectorize them. We don't need to know as a user, and the backends can use the program itself to inform how they vectorize. 16-bit values could live in xmm registers and 32-bit values in ymm, or the 16-bit values could go in the low half of a ymm, or the even lanes of a ymm, or a full ymm and use two for 32-bit values, etc. etc. This all is a backend choice, not something we should have to know about when writing a program using Q14/I32/F32. My next steps are to get Q14 operations tested and plumbed through the JIT again, and to build out a blitter and a few effects using Q14 color channels. Then, independently, we can look at each backend and how to vectorize them. Some ideas: 1) keep running at current vectorization, with half rate 16-bit ops 2) pump up to 2x wider vectorization unconditionally to favor 16-bit 3) pump up to 2x wider vectorization only when any 16-bit op is used These choices can be made independently for each backend (JIT, LLVM, interp), and I wouldn't be surprised to find that we'll want to do them differently. For instance, the interpreter is already running at 32x vectorization... might be pumping it higher won't help anything. Change-Id: Ib8ad2b1bf790e8c4e3acfb4818d4032f7628e8f8 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/319321 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Mike Reed <reed@google.com>
2020-09-24 11:17:22 -05:00 · 2020-09-24 11:17:22 -05:00 · 6b72d3a03b
commit 6b72d3a03b
parent 18c043ee75
5 changed files with 376 additions and 319 deletions
--- a/src/core/SkVM.cpp
+++ b/src/core/SkVM.cpp
@ -286,7 +286,8 @@ namespace skvm {
            case Op::uniform16: write(o, V{id}, "=", op, Arg{immy}, Hex{immz}, fs(id)...); break;
            case Op::uniform32: write(o, V{id}, "=", op, Arg{immy}, Hex{immz}, fs(id)...); break;

-            case Op::splat:  write(o, V{id}, "=", op, Splat{immy}, fs(id)...); break;
+            case Op::splat:     write(o, V{id}, "=", op, Splat{immy}, fs(id)...); break;
+            case Op::splat_q14: write(o, V{id}, "=", op, Splat{immy}, fs(id)...); break;

            case Op::add_f32: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
            case Op::sub_f32: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
@ -319,30 +320,37 @@ namespace skvm {
            case Op::gt_i32: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;


-            case Op::add_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op::sub_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op::mul_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::add_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::sub_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::mul_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;

-            case Op::shl_q14x2: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;
-            case Op::shr_q14x2: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;
-            case Op::sra_q14x2: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;
+            case Op::shl_q14: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;
+            case Op::shr_q14: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;
+            case Op::sra_q14: write(o, V{id}, "=", op, V{x}, Shift{immy}, fs(id)...); break;

-            case Op:: min_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op:: max_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op::umin_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op::uavg_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op:: min_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op:: max_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::uavg_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;

-            case Op::eq_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
-            case Op::gt_q14x2: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::eq_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::gt_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;

+            case Op::bit_and_q14  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_or_q14   : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_xor_q14  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_clear_q14: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;

-            case Op::bit_and  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
-            case Op::bit_or   : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
-            case Op::bit_xor  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
-            case Op::bit_clear: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...      ); break;
+            case Op::from_q14: write(o, V{id}, "=", op, V{x}, fs(id)...); break;
+            case Op::  to_q14: write(o, V{id}, "=", op, V{x}, fs(id)...); break;

-            case Op::select:  write(o, V{id}, "=", op, V{x}, V{y}, V{z}, fs(id)...); break;
-            case Op::pack:    write(o, V{id}, "=", op, V{x}, V{y}, Shift{immz}, fs(id)...); break;
+            case Op::bit_and  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_or   : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_xor  : write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+            case Op::bit_clear: write(o, V{id}, "=", op, V{x}, V{y}, fs(id)...); break;
+
+            case Op::select:     write(o, V{id}, "=", op, V{x}, V{y}, V{z}, fs(id)...); break;
+            case Op::select_q14: write(o, V{id}, "=", op, V{x}, V{y}, V{z}, fs(id)...); break;
+            case Op::pack:  write(o, V{id}, "=", op, V{x}, V{y}, Shift{immz}, fs(id)...); break;

            case Op::ceil:      write(o, V{id}, "=", op, V{x}, fs(id)...); break;
            case Op::floor:     write(o, V{id}, "=", op, V{x}, fs(id)...); break;
@ -430,7 +438,8 @@ namespace skvm {
                case Op::uniform16: write(o, R{d}, "=", op, Arg{immy}, Hex{immz}); break;
                case Op::uniform32: write(o, R{d}, "=", op, Arg{immy}, Hex{immz}); break;

-                case Op::splat:  write(o, R{d}, "=", op, Splat{immy}); break;
+                case Op::splat:     write(o, R{d}, "=", op, Splat{immy}); break;
+                case Op::splat_q14: write(o, R{d}, "=", op, Splat{immy}); break;


                case Op::add_f32: write(o, R{d}, "=", op, R{x}, R{y}      ); break;
@ -463,30 +472,37 @@ namespace skvm {
                case Op::gt_i32: write(o, R{d}, "=", op, R{x}, R{y}); break;


-                case Op::add_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op::sub_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op::mul_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::add_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::sub_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::mul_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;

-                case Op::shl_q14x2: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;
-                case Op::shr_q14x2: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;
-                case Op::sra_q14x2: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;
+                case Op::shl_q14: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;
+                case Op::shr_q14: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;
+                case Op::sra_q14: write(o, R{d}, "=", op, R{x}, Shift{immy}); break;

-                case Op:: min_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op:: max_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op::umin_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op::uavg_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op:: min_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op:: max_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::uavg_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;

-                case Op::eq_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
-                case Op::gt_q14x2: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::eq_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::gt_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;

+                case Op::bit_and_q14  : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_or_q14   : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_xor_q14  : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_clear_q14: write(o, R{d}, "=", op, R{x}, R{y}); break;

-                case Op::bit_and  : write(o, R{d}, "=", op, R{x}, R{y}      ); break;
-                case Op::bit_or   : write(o, R{d}, "=", op, R{x}, R{y}      ); break;
-                case Op::bit_xor  : write(o, R{d}, "=", op, R{x}, R{y}      ); break;
-                case Op::bit_clear: write(o, R{d}, "=", op, R{x}, R{y}      ); break;
+                case Op::from_q14: write(o, R{d}, "=", op, R{x}); break;
+                case Op::  to_q14: write(o, R{d}, "=", op, R{x}); break;

-                case Op::select:  write(o, R{d}, "=", op, R{x}, R{y}, R{z}); break;
-                case Op::pack:    write(o, R{d}, "=", op,   R{x}, R{y}, Shift{immz}); break;
+                case Op::bit_and  : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_or   : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_xor  : write(o, R{d}, "=", op, R{x}, R{y}); break;
+                case Op::bit_clear: write(o, R{d}, "=", op, R{x}, R{y}); break;
+
+                case Op::select:     write(o, R{d}, "=", op, R{x}, R{y}, R{z}); break;
+                case Op::select_q14: write(o, R{d}, "=", op, R{x}, R{y}, R{z}); break;
+                case Op::pack: write(o, R{d}, "=", op,   R{x}, R{y}, Shift{immz}); break;

                case Op::ceil:      write(o, R{d}, "=", op, R{x}); break;
                case Op::floor:     write(o, R{d}, "=", op, R{x}); break;
@ -764,13 +780,8 @@ namespace skvm {
        return {this, push(Op::uniform32, NA,NA,NA, ptr.ix, offset)};
    }

-    // The two splat() functions are just syntax sugar over splatting a 4-byte bit pattern.
-    I32 Builder::splat(int   n) { return {this, push(Op::splat, NA,NA,NA, n) }; }
-    F32 Builder::splat(float f) {
-        int bits;
-        memcpy(&bits, &f, 4);
-        return {this, push(Op::splat, NA,NA,NA, bits)};
-    }
+    I32 Builder::splat    (int n) { return {this, push(Op::splat    , NA,NA,NA, n) }; }
+    Q14 Builder::splat_q14(int n) { return {this, push(Op::splat_q14, NA,NA,NA, n) }; }

    bool fma_supported() {
        static const bool supported =
@ -1014,30 +1025,43 @@ namespace skvm {
        return {this, this->push(Op::max_f32, x.id, y.id)};
    }

-    // TODO: constant propagation and strength reduction for all these Q14x2 ops
-    Q14x2 Builder::add(Q14x2 x, Q14x2 y) { return {this, this->push(Op::add_q14x2, x.id, y.id)}; }
-    Q14x2 Builder::sub(Q14x2 x, Q14x2 y) { return {this, this->push(Op::sub_q14x2, x.id, y.id)}; }
-    Q14x2 Builder::mul(Q14x2 x, Q14x2 y) { return {this, this->push(Op::mul_q14x2, x.id, y.id)}; }
+    // TODO: constant propagation and strength reduction for all these Q14 ops
+    Q14 Builder::add(Q14 x, Q14 y) { return {this, this->push(Op::add_q14, x.id, y.id)}; }
+    Q14 Builder::sub(Q14 x, Q14 y) { return {this, this->push(Op::sub_q14, x.id, y.id)}; }
+    Q14 Builder::mul(Q14 x, Q14 y) { return {this, this->push(Op::mul_q14, x.id, y.id)}; }

-    Q14x2 Builder::shl(Q14x2 x, int k) { return {this, this->push(Op::shl_q14x2, x.id,NA,NA,k)}; }
-    Q14x2 Builder::shr(Q14x2 x, int k) { return {this, this->push(Op::shr_q14x2, x.id,NA,NA,k)}; }
-    Q14x2 Builder::sra(Q14x2 x, int k) { return {this, this->push(Op::sra_q14x2, x.id,NA,NA,k)}; }
+    Q14 Builder::shl(Q14 x, int k) { return {this, this->push(Op::shl_q14, x.id,NA,NA,k)}; }
+    Q14 Builder::shr(Q14 x, int k) { return {this, this->push(Op::shr_q14, x.id,NA,NA,k)}; }
+    Q14 Builder::sra(Q14 x, int k) { return {this, this->push(Op::sra_q14, x.id,NA,NA,k)}; }

-    I32 Builder:: eq(Q14x2 x, Q14x2 y) { return {this, this->push(Op::eq_q14x2, x.id, y.id)}; }
-    I32 Builder::gt (Q14x2 x, Q14x2 y) { return {this, this->push(Op::gt_q14x2, x.id, y.id)}; }
-    I32 Builder::lt (Q14x2 x, Q14x2 y) { return  gt(y,x); }
-    I32 Builder::neq(Q14x2 x, Q14x2 y) { return ~eq(x,y); }
-    I32 Builder::gte(Q14x2 x, Q14x2 y) { return ~lt(x,y); }
-    I32 Builder::lte(Q14x2 x, Q14x2 y) { return ~gt(x,y); }
+    Q14 Builder:: eq(Q14 x, Q14 y) { return {this, this->push(Op::eq_q14, x.id, y.id)}; }
+    Q14 Builder::gt (Q14 x, Q14 y) { return {this, this->push(Op::gt_q14, x.id, y.id)}; }
+    Q14 Builder::lt (Q14 x, Q14 y) { return  gt(y,x); }
+    Q14 Builder::neq(Q14 x, Q14 y) { return ~eq(x,y); }
+    Q14 Builder::gte(Q14 x, Q14 y) { return ~lt(x,y); }
+    Q14 Builder::lte(Q14 x, Q14 y) { return ~gt(x,y); }

-    Q14x2 Builder::min(Q14x2 x, Q14x2 y) { return {this, this->push(Op::min_q14x2, x.id, y.id)}; }
-    Q14x2 Builder::max(Q14x2 x, Q14x2 y) { return {this, this->push(Op::max_q14x2, x.id, y.id)}; }
+    Q14 Builder::min(Q14 x, Q14 y) { return {this, this->push(Op::min_q14, x.id, y.id)}; }
+    Q14 Builder::max(Q14 x, Q14 y) { return {this, this->push(Op::max_q14, x.id, y.id)}; }

-    Q14x2 Builder::unsigned_avg(Q14x2 x, Q14x2 y) {
-        return {this, this->push(Op::uavg_q14x2, x.id, y.id)};
+    Q14 Builder::bit_and  (Q14 x, Q14 y) { return {this, this->push(Op::bit_and_q14  ,x.id,y.id)}; }
+    Q14 Builder::bit_or   (Q14 x, Q14 y) { return {this, this->push(Op::bit_or_q14   ,x.id,y.id)}; }
+    Q14 Builder::bit_xor  (Q14 x, Q14 y) { return {this, this->push(Op::bit_xor_q14  ,x.id,y.id)}; }
+    Q14 Builder::bit_clear(Q14 x, Q14 y) { return {this, this->push(Op::bit_clear_q14,x.id,y.id)}; }
+
+    Q14 Builder::select(Q14 cond, Q14 t, Q14 f) {
+        return {this, this->push(Op::select_q14, cond.id, t.id, f.id)};
    }
-    Q14x2 Builder::unsigned_min(Q14x2 x, Q14x2 y) {
-        return {this, this->push(Op::umin_q14x2, x.id, y.id)};
+
+    Q14 Builder::to_q14(I32 x) { return {this, this->push(Op::  to_q14, x.id) }; }
+    I32 Builder::to_i32(Q14 x) { return {this, this->push(Op::from_q14, x.id) }; }
+
+    // TODO: open question in general whether float -> q14 should round() or trunc().
+    Q14 Builder::to_q14(F32 x) { return to_q14(trunc(x * 16384.0f)); }
+    F32 Builder::to_f32(Q14 x) { return to_f32(to_i32(x)) * (1/16384.0f); }
+
+    Q14 Builder::unsigned_avg(Q14 x, Q14 y) {
+        return {this, this->push(Op::uavg_q14, x.id, y.id)};
    }

    I32 Builder::add(I32 x, I32 y) {
@ -3521,6 +3545,16 @@ namespace skvm {
                    (void)constants[immy];
                    break;

+                case Op::splat_q14:
+                case Op::select_q14:
+                case Op::bit_and_q14:
+                case Op::bit_or_q14:
+                case Op::bit_xor_q14:
+                case Op::bit_clear_q14:
+                case Op::from_q14:
+                case Op::  to_q14:
+                    return false;  // TODO
+
            #if defined(__x86_64__) || defined(_M_X64)
                case Op::assert_true: {
                    a->vptest (r(x), &constants[0xffffffff]);
@ -3782,7 +3816,7 @@ namespace skvm {
                    else           { a->vpaddd(dst(y), r(y), any(x)); }
                                     break;

-                case Op::add_q14x2:
+                case Op::add_q14:
                    if (in_reg(x)) { a->vpaddw(dst(x), r(x), any(y)); }
                    else           { a->vpaddw(dst(y), r(y), any(x)); }
                                     break;
@ -3792,14 +3826,14 @@ namespace skvm {
                    else           { a->vpmulld(dst(y), r(y), any(x)); }
                                     break;

-                case Op::mul_q14x2:
+                case Op::mul_q14:
                    if (in_reg(x)) { a->vpmulhrsw(dst(x), r(x), any(y)); }
                    else           { a->vpmulhrsw(dst(y), r(y), any(x)); }
                                     a->vpaddw(dst(), dst(), dst());  // << 1
                                     break;

-                case Op::sub_i32:   a->vpsubd(dst(x), r(x), any(y)); break;
-                case Op::sub_q14x2: a->vpsubw(dst(x), r(x), any(y)); break;
+                case Op::sub_i32: a->vpsubd(dst(x), r(x), any(y)); break;
+                case Op::sub_q14: a->vpsubw(dst(x), r(x), any(y)); break;

                case Op::bit_and:
                    if (in_reg(x)) { a->vpand(dst(x), r(x), any(y)); }
@ -3821,22 +3855,17 @@ namespace skvm {
                    else              { a->vpblendvb(dst(x), r(z), any(y), r(x)); }
                                        break;

-                case Op::min_q14x2:
+                case Op::min_q14:
                    if (in_reg(x)) { a->vpminsw(dst(x), r(x), any(y)); }
                    else           { a->vpminsw(dst(y), r(y), any(x)); }
                                     break;

-                case Op::max_q14x2:
+                case Op::max_q14:
                    if (in_reg(x)) { a->vpmaxsw(dst(x), r(x), any(y)); }
                    else           { a->vpmaxsw(dst(y), r(y), any(x)); }
                                     break;

-                case Op::umin_q14x2:
-                    if (in_reg(x)) { a->vpminuw(dst(x), r(x), any(y)); }
-                    else           { a->vpminuw(dst(y), r(y), any(x)); }
-                                     break;
-
-                case Op::uavg_q14x2:
+                case Op::uavg_q14:
                    if (in_reg(x)) { a->vpavgw(dst(x), r(x), any(y)); }
                    else           { a->vpavgw(dst(y), r(y), any(x)); }
                                     break;
@ -3845,22 +3874,22 @@ namespace skvm {
                case Op::shr_i32: a->vpsrld(dst(x), r(x), immy); break;
                case Op::sra_i32: a->vpsrad(dst(x), r(x), immy); break;

-                case Op::shl_q14x2: a->vpsllw(dst(x), r(x), immy); break;
-                case Op::shr_q14x2: a->vpsrlw(dst(x), r(x), immy); break;
-                case Op::sra_q14x2: a->vpsraw(dst(x), r(x), immy); break;
+                case Op::shl_q14: a->vpsllw(dst(x), r(x), immy); break;
+                case Op::shr_q14: a->vpsrlw(dst(x), r(x), immy); break;
+                case Op::sra_q14: a->vpsraw(dst(x), r(x), immy); break;

                case Op::eq_i32:
                    if (in_reg(x)) { a->vpcmpeqd(dst(x), r(x), any(y)); }
                    else           { a->vpcmpeqd(dst(y), r(y), any(x)); }
                                     break;

-                case Op::eq_q14x2:
+                case Op::eq_q14:
                    if (in_reg(x)) { a->vpcmpeqw(dst(x), r(x), any(y)); }
                    else           { a->vpcmpeqw(dst(y), r(y), any(x)); }
                                     break;

-                case Op::gt_i32:   a->vpcmpgtd(dst(), r(x), any(y)); break;
-                case Op::gt_q14x2: a->vpcmpgtw(dst(), r(x), any(y)); break;
+                case Op::gt_i32: a->vpcmpgtd(dst(), r(x), any(y)); break;
+                case Op::gt_q14: a->vpcmpgtw(dst(), r(x), any(y)); break;

                case Op::eq_f32:
                    if (in_reg(x)) { a->vcmpeqps(dst(x), r(x), any(y)); }
--- a/src/core/SkVM.h
+++ b/src/core/SkVM.h
@ -401,33 +401,31 @@ namespace skvm {
    };

    // Order matters a little: Ops <=store128 are treated as having side effects.
-    #define SKVM_OPS(M)                                            \
-        M(assert_true)                                             \
-        M(store8)   M(store16)   M(store32) M(store64) M(store128) \
-        M(index)                                                   \
-        M(load8)    M(load16)    M(load32)  M(load64) M(load128)   \
-        M(gather8)  M(gather16)  M(gather32)                       \
-        M(uniform8) M(uniform16) M(uniform32)                      \
-        M(splat)                                                   \
-        M(add_f32) M(add_i32) M(add_q14x2)                         \
-        M(sub_f32) M(sub_i32) M(sub_q14x2)                         \
-        M(mul_f32) M(mul_i32) M(mul_q14x2)                         \
-        M(div_f32)                                                 \
-        M(min_f32)   M(max_f32)                                    \
-        M(min_q14x2) M(max_q14x2) M(uavg_q14x2) M(umin_q14x2)      \
-        M(fma_f32) M(fms_f32) M(fnma_f32)                          \
-        M(sqrt_f32)                                                \
-        M(shl_i32)   M(shr_i32)   M(sra_i32)                       \
-        M(shl_q14x2) M(shr_q14x2) M(sra_q14x2)                     \
-        M(ceil) M(floor) M(trunc) M(round) M(to_half) M(from_half) \
-        M(to_f32)                                                  \
-        M(neq_f32) M(eq_f32) M(eq_i32) M(eq_q14x2)                 \
-        M(gte_f32) M(gt_f32) M(gt_i32) M(gt_q14x2)                 \
-        M(bit_and)                                                 \
-        M(bit_or)                                                  \
-        M(bit_xor)                                                 \
-        M(bit_clear)                                               \
-        M(select) M(pack)                                          \
+    #define SKVM_OPS(M)                                              \
+        M(assert_true)                                               \
+        M(store8)   M(store16)   M(store32) M(store64) M(store128)   \
+        M(index)                                                     \
+        M(load8)    M(load16)    M(load32)  M(load64) M(load128)     \
+        M(gather8)  M(gather16)  M(gather32)                         \
+        M(uniform8) M(uniform16) M(uniform32)                        \
+        M(splat) M(splat_q14)                                        \
+        M(add_f32) M(add_i32) M(add_q14)                             \
+        M(sub_f32) M(sub_i32) M(sub_q14)                             \
+        M(mul_f32) M(mul_i32) M(mul_q14)                             \
+        M(div_f32)                                                   \
+        M(min_f32) M(max_f32)                                        \
+        M(min_q14) M(max_q14) M(uavg_q14)                            \
+        M(fma_f32) M(fms_f32) M(fnma_f32)                            \
+        M(sqrt_f32)                                                  \
+        M(shl_i32) M(shr_i32) M(sra_i32)                             \
+        M(shl_q14) M(shr_q14) M(sra_q14)                             \
+        M(ceil) M(floor) M(trunc) M(round) M(to_half) M(from_half)   \
+        M(to_f32) M(to_q14) M(from_q14)                              \
+        M(neq_f32) M(eq_f32) M(eq_i32) M(eq_q14)                     \
+        M(gte_f32) M(gt_f32) M(gt_i32) M(gt_q14)                     \
+        M(bit_and)     M(bit_or)     M(bit_xor)     M(bit_clear)     \
+        M(bit_and_q14) M(bit_or_q14) M(bit_xor_q14) M(bit_clear_q14) \
+        M(select) M(select_q14) M(pack)                              \
    // End of SKVM_OPS

    enum class Op : int {
@ -464,7 +462,7 @@ namespace skvm {
        Builder* operator->()    const { return builder; }
    };

-    struct Q14x2 {
+    struct Q14 {
        Builder* builder = nullptr;
        Val      id      = NA;
        explicit operator bool() const { return id != NA; }
@ -501,18 +499,14 @@ namespace skvm {
        float imm = 0;
    };

-    struct Q14x2a {
-        Q14x2a(Q14x2 v) : SkDEBUGCODE(builder(v.builder),) id(v.id) {}
-        Q14x2a(float f) {
-            SkASSERT(-1.0f <= f && f <= 1.0f);  // TODO: allow full [-2,+2)?
-            int q14 = (int)(f * 16384.0f)
-                    & 0xffff;
-            imm = q14 | (q14<<16);
-        }
+    struct Q14a {
+        Q14a(Q14 v) : SkDEBUGCODE(builder(v.builder),) id(v.id) {}
+        Q14a(int bits) : imm{SkTo<int16_t>(bits)} {}   // 0x0000'4000 -> 0x4000
+        Q14a(float f) : Q14a{(int)(f * 16384.0f)} {}   //        1.0f -> 0x4000

        SkDEBUGCODE(Builder* builder = nullptr;)
-        Val   id  = NA;
-        int   imm = 0;
+        Val     id  = NA;
+        int16_t imm = 0;
    };

    struct Color {
@ -527,10 +521,10 @@ namespace skvm {
        Builder* operator->()    const { return a.operator->(); }
    };

-    struct ColorQ14 {
-        Q14x2 rb, ga;  // TODO: simpler to start with r,g,b,a?
-        explicit operator bool() const { return rb && ga; }
-        Builder* operator->()    const { return ga.operator->(); }
+    struct Color_Q14 {
+        Q14 r,g,b,a;
+        explicit operator bool() const { return r && g && b && a; }
+        Builder* operator->()    const { return a.operator->(); }
    };

    struct Coord {
@ -677,7 +671,16 @@ namespace skvm {
        // Load an immediate constant.
        I32 splat(int      n);
        I32 splat(unsigned u) { return splat((int)u); }
-        F32 splat(float    f);
+        F32 splat(float    f) {
+            int bits;
+            memcpy(&bits, &f, 4);
+            return bit_cast(splat(bits));
+        }
+
+        // Load an immediate Q14, expressed as either integer (16384, 0x4000) or float (1.0f).
+        Q14 splat_q14(int      n);
+        Q14 splat_q14(unsigned u) { return splat_q14((int)u); }
+        Q14 splat_q14(float    f) { return splat_q14(Q14a{f}.imm); }

        // float math, comparisons, etc.
        F32 add(F32, F32);  F32 add(F32a x, F32a y) { return add(_(x), _(y)); }
@ -724,7 +727,7 @@ namespace skvm {

        I32 trunc(F32 x);
        I32 round(F32 x);  // Round to int using current rounding mode (as if lrintf()).
-        I32 bit_cast(F32 x) { return {x.builder, x.id}; }  // TODO: rename to as_I32()?
+        I32 bit_cast(F32 x) { return {x.builder, x.id}; }

        I32   to_half(F32 x);
        F32 from_half(I32 x);
@ -759,7 +762,7 @@ namespace skvm {
        I32 gte(I32 x, I32 y);  I32 gte(I32a x, I32a y) { return gte(_(x), _(y)); }

        F32 to_f32(I32 x);
-        F32 bit_cast(I32 x) { return {x.builder, x.id}; }  // TODO: rename to as_F32()?
+        F32 bit_cast(I32 x) { return {x.builder, x.id}; }

        // Bitwise operations.
        I32 bit_and  (I32, I32);  I32 bit_and  (I32a x, I32a y) { return bit_and  (_(x), _(y)); }
@ -778,14 +781,11 @@ namespace skvm {
            return bit_cast(select(cond, bit_cast(t)
                                       , bit_cast(f)));
        }
-        Q14x2 select(I32 cond, Q14x2 t, Q14x2 f) {
-            return as_Q14x2(select(cond, as_I32(t)
-                                       , as_I32(f)));
-        }
+        Q14 select(Q14 cond, Q14 t, Q14 f);

-        I32   select(I32a cond, I32a   t, I32a   f) { return select(_(cond), _(t), _(f)); }
-        F32   select(I32a cond, F32a   t, F32a   f) { return select(_(cond), _(t), _(f)); }
-        Q14x2 select(I32a cond, Q14x2a t, Q14x2a f) { return select(_(cond), _(t), _(f)); }
+        I32 select(I32a cond, I32a t, I32a f) { return select(_(cond), _(t), _(f)); }
+        F32 select(I32a cond, F32a t, F32a f) { return select(_(cond), _(t), _(f)); }
+        Q14 select(Q14a cond, Q14a t, Q14a f) { return select(_(cond), _(t), _(f)); }

        I32 extract(I32 x, int bits, I32 z);   // (x>>bits) & z
        I32 pack   (I32 x, I32 y, int bits);   // x | (y << bits), assuming (x & (y << bits)) == 0
@ -793,32 +793,34 @@ namespace skvm {
        I32 extract(I32a x, int bits, I32a z) { return extract(_(x), bits, _(z)); }
        I32 pack   (I32a x, I32a y, int bits) { return pack   (_(x), _(y), bits); }

-        I32   as_I32  (Q14x2 x) { return {x.builder, x.id}; }
-        Q14x2 as_Q14x2(I32   x) { return {x.builder, x.id}; }
+        Q14 add(Q14, Q14);  Q14 add(Q14a x, Q14a y) { return add(_(x), _(y)); }
+        Q14 sub(Q14, Q14);  Q14 sub(Q14a x, Q14a y) { return sub(_(x), _(y)); }
+        Q14 mul(Q14, Q14);  Q14 mul(Q14a x, Q14a y) { return mul(_(x), _(y)); }

-        Q14x2 add(Q14x2, Q14x2);  Q14x2 add(Q14x2a x, Q14x2a y) { return add(_(x), _(y)); }
-        Q14x2 sub(Q14x2, Q14x2);  Q14x2 sub(Q14x2a x, Q14x2a y) { return sub(_(x), _(y)); }
-        Q14x2 mul(Q14x2, Q14x2);  Q14x2 mul(Q14x2a x, Q14x2a y) { return mul(_(x), _(y)); }
+        Q14 min(Q14, Q14);  Q14 min(Q14a x, Q14a y) { return min(_(x), _(y)); }
+        Q14 max(Q14, Q14);  Q14 max(Q14a x, Q14a y) { return max(_(x), _(y)); }

-        Q14x2 min(Q14x2, Q14x2);  Q14x2 min(Q14x2a x, Q14x2a y) { return min(_(x), _(y)); }
-        Q14x2 max(Q14x2, Q14x2);  Q14x2 max(Q14x2a x, Q14x2a y) { return max(_(x), _(y)); }
+        Q14 shl(Q14, int bits);
+        Q14 shr(Q14, int bits);
+        Q14 sra(Q14, int bits);

-        Q14x2 shl(Q14x2, int bits);
-        Q14x2 shr(Q14x2, int bits);
-        Q14x2 sra(Q14x2, int bits);
+        Q14 eq (Q14, Q14);  Q14  eq(Q14a x, Q14a y) { return  eq(_(x), _(y)); }
+        Q14 neq(Q14, Q14);  Q14 neq(Q14a x, Q14a y) { return neq(_(x), _(y)); }
+        Q14 lt (Q14, Q14);  Q14 lt (Q14a x, Q14a y) { return lt (_(x), _(y)); }
+        Q14 lte(Q14, Q14);  Q14 lte(Q14a x, Q14a y) { return lte(_(x), _(y)); }
+        Q14 gt (Q14, Q14);  Q14 gt (Q14a x, Q14a y) { return gt (_(x), _(y)); }
+        Q14 gte(Q14, Q14);  Q14 gte(Q14a x, Q14a y) { return gte(_(x), _(y)); }

-        I32 eq (Q14x2, Q14x2);  I32  eq(Q14x2a x, Q14x2a y) { return  eq(_(x), _(y)); }
-        I32 neq(Q14x2, Q14x2);  I32 neq(Q14x2a x, Q14x2a y) { return neq(_(x), _(y)); }
-        I32 lt (Q14x2, Q14x2);  I32 lt (Q14x2a x, Q14x2a y) { return lt (_(x), _(y)); }
-        I32 lte(Q14x2, Q14x2);  I32 lte(Q14x2a x, Q14x2a y) { return lte(_(x), _(y)); }
-        I32 gt (Q14x2, Q14x2);  I32 gt (Q14x2a x, Q14x2a y) { return gt (_(x), _(y)); }
-        I32 gte(Q14x2, Q14x2);  I32 gte(Q14x2a x, Q14x2a y) { return gte(_(x), _(y)); }
+        Q14 bit_and  (Q14, Q14);  Q14 bit_and  (Q14a x, Q14a y) { return bit_and  (_(x), _(y)); }
+        Q14 bit_or   (Q14, Q14);  Q14 bit_or   (Q14a x, Q14a y) { return bit_or   (_(x), _(y)); }
+        Q14 bit_xor  (Q14, Q14);  Q14 bit_xor  (Q14a x, Q14a y) { return bit_xor  (_(x), _(y)); }
+        Q14 bit_clear(Q14, Q14);  Q14 bit_clear(Q14a x, Q14a y) { return bit_clear(_(x), _(y)); }

-        Q14x2 unsigned_avg(Q14x2  x, Q14x2  y);  // (x+y+1)>>1
-        Q14x2 unsigned_avg(Q14x2a x, Q14x2a y) { return unsigned_avg(_(x), _(y)); }
+        Q14 unsigned_avg(Q14  x, Q14  y);  // (x+y+1)>>1
+        Q14 unsigned_avg(Q14a x, Q14a y) { return unsigned_avg(_(x), _(y)); }

-        Q14x2 unsigned_min(Q14x2  x, Q14x2  y);
-        Q14x2 unsigned_min(Q14x2a x, Q14x2a y) { return unsigned_min(_(x), _(y)); }
+        Q14 to_q14(F32); F32 to_f32(Q14);   // Converts values, e.g. 0x4000 <-> 1.0f
+        Q14 to_q14(I32); I32 to_i32(Q14);   // Preserves bits, e.g. 0x4000 <-> 0x00004000

        // Common idioms used in several places, worth centralizing for consistency.
        F32 from_unorm(int bits, I32);   // E.g. from_unorm(8, x) -> x * (1/255.0f)
@ -874,12 +876,12 @@ namespace skvm {
            return splat(x.imm);
        }

-        Q14x2 _(Q14x2a x) {
+        Q14 _(Q14a x) {
            if (x.id != NA) {
                SkASSERT(x.builder == this);
                return {this, x.id};
            }
-            return as_Q14x2(splat(x.imm));
+            return splat_q14(x.imm);
        }

        bool allImm() const;
@ -985,46 +987,43 @@ namespace skvm {
    // TODO: control flow
    // TODO: 64-bit values?

-    static inline Q14x2 operator+(Q14x2 x, Q14x2a y) { return x->add(x,y); }
-    static inline Q14x2 operator+(float x, Q14x2  y) { return y->add(x,y); }
+    static inline Q14 operator+(Q14   x, Q14a y) { return x->add(x,y); }
+    static inline Q14 operator+(float x, Q14  y) { return y->add(x,y); }

-    static inline Q14x2 operator-(Q14x2 x, Q14x2a y) { return x->sub(x,y); }
-    static inline Q14x2 operator-(float x, Q14x2  y) { return y->sub(x,y); }
+    static inline Q14 operator-(Q14   x, Q14a y) { return x->sub(x,y); }
+    static inline Q14 operator-(float x, Q14  y) { return y->sub(x,y); }

-    static inline Q14x2 operator*(Q14x2 x, Q14x2a y) { return x->mul(x,y); }
-    static inline Q14x2 operator*(float x, Q14x2  y) { return y->mul(x,y); }
+    static inline Q14 operator*(Q14   x, Q14a y) { return x->mul(x,y); }
+    static inline Q14 operator*(float x, Q14  y) { return y->mul(x,y); }

-    static inline Q14x2 min(Q14x2 x, Q14x2a y) { return x->min(x,y); }
-    static inline Q14x2 min(float x, Q14x2  y) { return y->min(x,y); }
+    static inline Q14 min(Q14   x, Q14a y) { return x->min(x,y); }
+    static inline Q14 min(float x, Q14  y) { return y->min(x,y); }

-    static inline Q14x2 max(Q14x2 x, Q14x2a y) { return x->max(x,y); }
-    static inline Q14x2 max(float x, Q14x2  y) { return y->max(x,y); }
+    static inline Q14 max(Q14   x, Q14a y) { return x->max(x,y); }
+    static inline Q14 max(float x, Q14  y) { return y->max(x,y); }

-    static inline Q14x2 unsigned_min(Q14x2 x, Q14x2a y) { return x->unsigned_min(x,y); }
-    static inline Q14x2 unsigned_min(float x, Q14x2  y) { return y->unsigned_min(x,y); }
+    static inline Q14 unsigned_avg(Q14   x, Q14a y) { return x->unsigned_avg(x,y); }
+    static inline Q14 unsigned_avg(float x, Q14  y) { return y->unsigned_avg(x,y); }

-    static inline Q14x2 unsigned_avg(Q14x2 x, Q14x2a y) { return x->unsigned_avg(x,y); }
-    static inline Q14x2 unsigned_avg(float x, Q14x2  y) { return y->unsigned_avg(x,y); }
+    static inline Q14 operator==(Q14   x, Q14   y) { return x->eq(x,y); }
+    static inline Q14 operator==(Q14   x, float y) { return x->eq(x,y); }
+    static inline Q14 operator==(float x, Q14   y) { return y->eq(x,y); }

-    static inline I32 operator==(Q14x2 x, Q14x2 y) { return x->eq(x,y); }
-    static inline I32 operator==(Q14x2 x, float y) { return x->eq(x,y); }
-    static inline I32 operator==(float x, Q14x2 y) { return y->eq(x,y); }
+    static inline Q14 operator!=(Q14   x, Q14   y) { return x->neq(x,y); }
+    static inline Q14 operator!=(Q14   x, float y) { return x->neq(x,y); }
+    static inline Q14 operator!=(float x, Q14   y) { return y->neq(x,y); }

-    static inline I32 operator!=(Q14x2 x, Q14x2 y) { return x->neq(x,y); }
-    static inline I32 operator!=(Q14x2 x, float y) { return x->neq(x,y); }
-    static inline I32 operator!=(float x, Q14x2 y) { return y->neq(x,y); }
+    static inline Q14 operator< (Q14   x, Q14a y) { return x->lt(x,y); }
+    static inline Q14 operator< (float x, Q14  y) { return y->lt(x,y); }

-    static inline I32 operator< (Q14x2 x, Q14x2a y) { return x->lt(x,y); }
-    static inline I32 operator< (float x, Q14x2  y) { return y->lt(x,y); }
+    static inline Q14 operator<=(Q14   x, Q14a y) { return x->lte(x,y); }
+    static inline Q14 operator<=(float x, Q14  y) { return y->lte(x,y); }

-    static inline I32 operator<=(Q14x2 x, Q14x2a y) { return x->lte(x,y); }
-    static inline I32 operator<=(float x, Q14x2  y) { return y->lte(x,y); }
+    static inline Q14 operator> (Q14   x, Q14a y) { return x->gt(x,y); }
+    static inline Q14 operator> (float x, Q14  y) { return y->gt(x,y); }

-    static inline I32 operator> (Q14x2 x, Q14x2a y) { return x->gt(x,y); }
-    static inline I32 operator> (float x, Q14x2  y) { return y->gt(x,y); }
-
-    static inline I32 operator>=(Q14x2 x, Q14x2a y) { return x->gte(x,y); }
-    static inline I32 operator>=(float x, Q14x2  y) { return y->gte(x,y); }
+    static inline Q14 operator>=(Q14   x, Q14a y) { return x->gte(x,y); }
+    static inline Q14 operator>=(float x, Q14  y) { return y->gte(x,y); }


    static inline I32 operator+(I32 x, I32a y) { return x->add(x,y); }
@ -1101,9 +1100,9 @@ namespace skvm {
    static inline I32 operator>=(F32   x, F32a y) { return x->gte(x,y); }
    static inline I32 operator>=(float x, F32  y) { return y->gte(x,y); }

-    static inline Q14x2& operator+=(Q14x2& x, Q14x2a y) { return (x = x + y); }
-    static inline Q14x2& operator-=(Q14x2& x, Q14x2a y) { return (x = x - y); }
-    static inline Q14x2& operator*=(Q14x2& x, Q14x2a y) { return (x = x * y); }
+    static inline Q14& operator+=(Q14& x, Q14a y) { return (x = x + y); }
+    static inline Q14& operator-=(Q14& x, Q14a y) { return (x = x - y); }
+    static inline Q14& operator*=(Q14& x, Q14a y) { return (x = x * y); }

    static inline I32& operator+=(I32& x, I32a y) { return (x = x + y); }
    static inline I32& operator-=(I32& x, I32a y) { return (x = x - y); }
@ -1168,8 +1167,10 @@ namespace skvm {
    static inline I32   to_half(F32 x) { return x->  to_half(x); }
    static inline F32 from_half(I32 x) { return x->from_half(x); }

-    static inline Q14x2 as_Q14x2(I32   x) { return x->as_Q14x2(x); }
-    static inline I32   as_I32  (Q14x2 x) { return x->as_I32  (x); }
+    static inline F32 to_f32(Q14 x) { return x->to_f32(x); }
+    static inline I32 to_i32(Q14 x) { return x->to_i32(x); }
+    static inline Q14 to_q14(F32 x) { return x->to_q14(x); }
+    static inline Q14 to_q14(I32 x) { return x->to_q14(x); }

    static inline F32 lerp(F32   lo, F32a  hi, F32a t) { return lo->lerp(lo,hi,t); }
    static inline F32 lerp(float lo, F32   hi, F32a t) { return hi->lerp(lo,hi,t); }
@ -1187,11 +1188,11 @@ namespace skvm {
    static inline I32        shr(I32 x, int bits) { return x->shr(x, bits); }
    static inline I32        sra(I32 x, int bits) { return x->sra(x, bits); }

-    static inline Q14x2 operator<<(Q14x2 x, int bits) { return x->shl(x, bits); }
-    static inline Q14x2        shl(Q14x2 x, int bits) { return x->shl(x, bits); }
-    static inline Q14x2        shr(Q14x2 x, int bits) { return x->shr(x, bits); }
-    static inline Q14x2        sra(Q14x2 x, int bits) { return x->sra(x, bits); }
-    static inline Q14x2 operator>>(Q14x2 x, int bits) { return x->sra(x, bits); }
+    static inline Q14 operator<<(Q14 x, int bits) { return x->shl(x, bits); }
+    static inline Q14        shl(Q14 x, int bits) { return x->shl(x, bits); }
+    static inline Q14        shr(Q14 x, int bits) { return x->shr(x, bits); }
+    static inline Q14        sra(Q14 x, int bits) { return x->sra(x, bits); }
+    static inline Q14 operator>>(Q14 x, int bits) { return x->sra(x, bits); }

    static inline I32 operator&(I32 x, I32a y) { return x->bit_and(x,y); }
    static inline I32 operator&(int x, I32  y) { return y->bit_and(x,y); }
@ -1206,19 +1207,33 @@ namespace skvm {
    static inline I32& operator|=(I32& x, I32a y) { return (x = x | y); }
    static inline I32& operator^=(I32& x, I32a y) { return (x = x ^ y); }

-    static inline I32   select(I32 cond, I32a   t, I32a   f) { return cond->select(cond,t,f); }
-    static inline F32   select(I32 cond, F32a   t, F32a   f) { return cond->select(cond,t,f); }
-    static inline Q14x2 select(I32 cond, Q14x2a t, Q14x2a f) { return cond->select(cond,t,f); }
+    static inline Q14 operator&(Q14 x, Q14a y) { return x->bit_and(x,y); }
+    static inline Q14 operator&(int x, Q14  y) { return y->bit_and(x,y); }
+
+    static inline Q14 operator|(Q14 x, Q14a y) { return x->bit_or(x,y); }
+    static inline Q14 operator|(int x, Q14  y) { return y->bit_or(x,y); }
+
+    static inline Q14 operator^(Q14 x, Q14a y) { return x->bit_xor(x,y); }
+    static inline Q14 operator^(int x, Q14  y) { return y->bit_xor(x,y); }
+
+    static inline Q14& operator&=(Q14& x, Q14a y) { return (x = x & y); }
+    static inline Q14& operator|=(Q14& x, Q14a y) { return (x = x | y); }
+    static inline Q14& operator^=(Q14& x, Q14a y) { return (x = x ^ y); }
+
+    static inline I32 select(I32 cond, I32a t, I32a f) { return cond->select(cond,t,f); }
+    static inline F32 select(I32 cond, F32a t, F32a f) { return cond->select(cond,t,f); }
+    static inline Q14 select(Q14 cond, Q14a t, Q14a f) { return cond->select(cond,t,f); }

    static inline I32 extract(I32 x, int bits, I32a z) { return x->extract(x,bits,z); }
    static inline I32 extract(int x, int bits, I32  z) { return z->extract(x,bits,z); }
    static inline I32 pack   (I32 x, I32a y, int bits) { return x->pack   (x,y,bits); }
    static inline I32 pack   (int x, I32  y, int bits) { return y->pack   (x,y,bits); }

-    static inline I32   operator~(I32   x) { return ~0^x; }
-    static inline I32   operator-(I32   x) { return  0-x; }
-    static inline F32   operator-(F32   x) { return 0.0f-x; }
-    static inline Q14x2 operator-(Q14x2 x) { return 0.0f-x; }
+    static inline I32 operator~(I32 x) { return ~0 ^ x; }
+    static inline Q14 operator~(Q14 x) { return ~0 ^ x; }
+    static inline I32 operator-(I32 x) { return  0 - x; }
+    static inline Q14 operator-(Q14 x) { return  0 - x; }
+    static inline F32 operator-(F32 x) { return 0.0f - x; }

    static inline F32 from_unorm(int bits, I32 x) { return x->from_unorm(bits,x); }
    static inline I32   to_unorm(int bits, F32 x) { return x->  to_unorm(bits,x); }
--- a/src/core/SkVM_fwd.h
+++ b/src/core/SkVM_fwd.h
@ -16,6 +16,7 @@ namespace skvm {
    struct I32;
    struct F32;
    struct Color;
+    struct Color_Q14;
    struct Coord;
    struct Uniforms;
 }  // namespace skvm
--- a/src/opts/SkVM_opts.h
+++ b/src/opts/SkVM_opts.h
@ -68,21 +68,18 @@ namespace SK_OPTS_NS {
        constexpr int K = 8;   // 256-bit: 2 xmm, 2 v-registers, etc.
    #endif
        using I32 = skvx::Vec<K, int>;
+        using I16 = skvx::Vec<K, int16_t>;
        using F32 = skvx::Vec<K, float>;
        using U64 = skvx::Vec<K, uint64_t>;
        using U32 = skvx::Vec<K, uint32_t>;
        using U16 = skvx::Vec<K, uint16_t>;
        using  U8 = skvx::Vec<K, uint8_t>;
-
-        using I16x2 = skvx::Vec<2*K,  int16_t>;
-        using U16x2 = skvx::Vec<2*K, uint16_t>;
-
        union Slot {
            F32   f32;
            I32   i32;
            U32   u32;
-            I16x2 i16x2;
-            U16x2 u16x2;
+            I16   i16;
+            U16   u16;
        };

        Slot                     few_regs[16];
@ -307,27 +304,41 @@ namespace SK_OPTS_NS {
                        r[d].f32 = skvx::from_half(skvx::cast<uint16_t>(r[x].i32));
                        break;

-                    CASE(Op::add_q14x2): r[d].i16x2 = r[x].i16x2 + r[y].i16x2; break;
-                    CASE(Op::sub_q14x2): r[d].i16x2 = r[x].i16x2 - r[y].i16x2; break;
-                    CASE(Op::mul_q14x2): r[d].i16x2 = mul_q14(r[x].i16x2, r[y].i16x2); break;
+                    CASE(Op::splat_q14): r[d].i16 = immy; break;

-                    CASE(Op::shl_q14x2): r[d].i16x2 = r[x].i16x2 << immy; break;
-                    CASE(Op::sra_q14x2): r[d].i16x2 = r[x].i16x2 >> immy; break;
-                    CASE(Op::shr_q14x2): r[d].u16x2 = r[x].u16x2 >> immy; break;
+                    CASE(Op::add_q14): r[d].i16 = r[x].i16 + r[y].i16; break;
+                    CASE(Op::sub_q14): r[d].i16 = r[x].i16 - r[y].i16; break;
+                    CASE(Op::mul_q14): r[d].i16 = mul_q14(r[x].i16, r[y].i16); break;

-                    CASE(Op::eq_q14x2): r[d].i16x2 = r[x].i16x2 == r[y].i16x2; break;
-                    CASE(Op::gt_q14x2): r[d].i16x2 = r[x].i16x2 >  r[y].i16x2; break;
+                    CASE(Op::shl_q14): r[d].i16 = r[x].i16 << immy; break;
+                    CASE(Op::sra_q14): r[d].i16 = r[x].i16 >> immy; break;
+                    CASE(Op::shr_q14): r[d].u16 = r[x].u16 >> immy; break;

-                    CASE(Op:: min_q14x2): r[d].i16x2 = min(r[x].i16x2, r[y].i16x2); break;
-                    CASE(Op:: max_q14x2): r[d].i16x2 = max(r[x].i16x2, r[y].i16x2); break;
-                    CASE(Op::umin_q14x2): r[d].u16x2 = min(r[x].u16x2, r[y].u16x2); break;
+                    CASE(Op::eq_q14): r[d].i16 = r[x].i16 == r[y].i16; break;
+                    CASE(Op::gt_q14): r[d].i16 = r[x].i16 >  r[y].i16; break;
+
+                    CASE(Op::min_q14): r[d].i16 = min(r[x].i16, r[y].i16); break;
+                    CASE(Op::max_q14): r[d].i16 = max(r[x].i16, r[y].i16); break;
+
+                    CASE(Op::bit_and_q14):   r[d].i16 = r[x].i16 &  r[y].i16; break;
+                    CASE(Op::bit_or_q14 ):   r[d].i16 = r[x].i16 |  r[y].i16; break;
+                    CASE(Op::bit_xor_q14):   r[d].i16 = r[x].i16 ^  r[y].i16; break;
+                    CASE(Op::bit_clear_q14): r[d].i16 = r[x].i16 & ~r[y].i16; break;
+
+                    CASE(Op::select_q14):
+                        r[d].i16 = skvx::if_then_else(r[x].i16, r[y].i16, r[z].i16);
+                        break;

                    // Happily, Clang can see through this one and generates perfect code
                    // using vpavgw without any help from us!
-                    CASE(Op::uavg_q14x2):
-                        r[d].u16x2 = skvx::cast<uint16_t>( (skvx::cast<int>(r[x].u16x2) +
-                                                            skvx::cast<int>(r[y].u16x2) + 1)>>1 );
+                    CASE(Op::uavg_q14):
+                        r[d].u16 = skvx::cast<uint16_t>( (skvx::cast<int>(r[x].u16) +
+                                                          skvx::cast<int>(r[y].u16) + 1)>>1 );
                        break;
+
+                    CASE(Op::to_q14):   r[d].i16 = skvx::cast<int16_t>(r[x].i32); break;
+                    CASE(Op::from_q14): r[d].i32 = skvx::cast<int32_t>(r[x].i16); break;
+
                #undef CASE
                }
            }
--- a/tests/SkVMTest.cpp
+++ b/tests/SkVMTest.cpp
@ -2372,135 +2372,136 @@ DEF_TEST(SkVM_args, r) {
    });
 }

-DEF_TEST(SkVM_Q14x2, r) {
-    // Some nice round Q14 test values, from 0.0 out to ±1.0 (0x4000, 0xc000) by 16ths (0x0400).
-    const uint32_t src[] = {
-        0x0000'0000, 0xfc00'0400, 0xf800'0800, 0xf400'0c00,
-        0xf000'1000, 0xec00'1400, 0xe800'1800, 0xe400'1c00,
-        0xe000'2000, 0xdc00'2400, 0xd800'2800, 0xd400'2c00,
-        0xd000'3000, 0xcc00'3400, 0xc800'3800, 0xc400'3c00, 0xc000'4000
+DEF_TEST(SkVM_Q14, r) {
+    // Some nice round Q14 test values from -1.0 (0xc000) to +1.0 (0x4000) by 16ths (0x0400).
+    const uint16_t src[] = {
+        0xc000, 0xc400, 0xc800, 0xcc00,  0xd000, 0xd400, 0xd800, 0xdc00,
+        0xe000, 0xe400, 0xe800, 0xec00,  0xf000, 0xf400, 0xf800, 0xfc00,
+        0x0000,
+        0x0400, 0x0800, 0x0c00, 0x1000,  0x1400, 0x1800, 0x1c00, 0x2000,
+        0x2400, 0x2800, 0x2c00, 0x3000,  0x3400, 0x3800, 0x3c00, 0x4000,
    };
-    for (int i = 0; false && i < 17; i++) {
-        // Just showing our work how we got those values.
-        int16_t x = i * (+1/16.0f) * 0x4000;
-        REPORTER_ASSERT(r, src[i] == (uint32_t)(x|-x<<16));
-    }

    // These test cases are essentially mechanically generated to get coverage...
    // I've spot checked here and there and things seem correct, but I wouldn't
    // be surprised to find that there were bugs.  Using nice round numbers to
    // avoid having to think about low-bit precision for now.
    struct {
-        skvm::Q14x2 (*fn)(skvm::Q14x2);
-        uint32_t expected[17];
+        skvm::Q14 (*fn)(skvm::Q14);
+        uint16_t expected[33];
    } cases[] = {
-        {[](skvm::Q14x2 x) { return x; },   // Just double checking the test harness works.
-         {0x00000000, 0xfc000400, 0xf8000800, 0xf4000c00,
-          0xf0001000, 0xec001400, 0xe8001800, 0xe4001c00,
-          0xe0002000, 0xdc002400, 0xd8002800, 0xd4002c00,
-          0xd0003000, 0xcc003400, 0xc8003800, 0xc4003c00, 0xc0004000}},
+        {[](skvm::Q14 x) { return x; },   // Just double checking the test harness works.
+         {0xc000, 0xc400, 0xc800, 0xcc00,  0xd000, 0xd400, 0xd800, 0xdc00,
+          0xe000, 0xe400, 0xe800, 0xec00,  0xf000, 0xf400, 0xf800, 0xfc00,
+          0x0000,
+          0x0400, 0x0800, 0x0c00, 0x1000,  0x1400, 0x1800, 0x1c00, 0x2000,
+          0x2400, 0x2800, 0x2c00, 0x3000,  0x3400, 0x3800, 0x3c00, 0x4000}},

-        {[](skvm::Q14x2 x) { return x*x; }, // square ±1/16 (0x0400) -> 1/256 (0x0040), etc.
-         {0x00000000, 0x00400040, 0x01000100, 0x02400240,
-          0x04000400, 0x06400640, 0x09000900, 0x0c400c40,
-          0x10001000, 0x14401440, 0x19001900, 0x1e401e40,
-          0x24002400, 0x2a402a40, 0x31003100, 0x38403840, 0x40004000}},
+        {[](skvm::Q14 x) { return x*x; }, // square ±1/16 (0x0400) -> 1/256 (0x0040), etc.
+         {0x4000, 0x3840, 0x3100, 0x2a40, 0x2400, 0x1e40, 0x1900, 0x1440,
+          0x1000, 0x0c40, 0x0900, 0x0640, 0x0400, 0x0240, 0x0100, 0x0040,
+          0x0000,
+          0x0040, 0x0100, 0x0240, 0x0400, 0x0640, 0x0900, 0x0c40, 0x1000,
+          0x1440, 0x1900, 0x1e40, 0x2400, 0x2a40, 0x3100, 0x3840, 0x4000}},

-        {[](skvm::Q14x2 x) { return -(x*-x); }, // square, version B
-         {0x00000000, 0x00400040, 0x01000100, 0x02400240,
-          0x04000400, 0x06400640, 0x09000900, 0x0c400c40,
-          0x10001000, 0x14401440, 0x19001900, 0x1e401e40,
-          0x24002400, 0x2a402a40, 0x31003100, 0x38403840, 0x40004000}},
+        {[](skvm::Q14 x) { return -(x*-x); }, // square, version B
+         {0x4000, 0x3840, 0x3100, 0x2a40, 0x2400, 0x1e40, 0x1900, 0x1440,
+          0x1000, 0x0c40, 0x0900, 0x0640, 0x0400, 0x0240, 0x0100, 0x0040,
+          0x0000,
+          0x0040, 0x0100, 0x0240, 0x0400, 0x0640, 0x0900, 0x0c40, 0x1000,
+          0x1440, 0x1900, 0x1e40, 0x2400, 0x2a40, 0x3100, 0x3840, 0x4000}},

-        {[](skvm::Q14x2 x) { return x>>1; },  // divide by 2
-         {0x00000000, 0xfe000200, 0xfc000400, 0xfa000600,
-          0xf8000800, 0xf6000a00, 0xf4000c00, 0xf2000e00,
-          0xf0001000, 0xee001200, 0xec001400, 0xea001600,
-          0xe8001800, 0xe6001a00, 0xe4001c00, 0xe2001e00, 0xe0002000}},
+        {[](skvm::Q14 x) { return x>>1; },  // divide by 2
+         {0xe000, 0xe200, 0xe400, 0xe600, 0xe800, 0xea00, 0xec00, 0xee00,
+          0xf000, 0xf200, 0xf400, 0xf600, 0xf800, 0xfa00, 0xfc00, 0xfe00,
+          0x0000,
+          0x0200, 0x0400, 0x0600, 0x0800, 0x0a00, 0x0c00, 0x0e00, 0x1000,
+          0x1200, 0x1400, 0x1600, 0x1800, 0x1a00, 0x1c00, 0x1e00, 0x2000}},

-        {[](skvm::Q14x2 x) { return shr(x,1); },  // logical shift by 1
-         {0x00000000, 0x7e000200, 0x7c000400, 0x7a000600,
-          0x78000800, 0x76000a00, 0x74000c00, 0x72000e00,
-          0x70001000, 0x6e001200, 0x6c001400, 0x6a001600,
-          0x68001800, 0x66001a00, 0x64001c00, 0x62001e00, 0x60002000}},
+        {[](skvm::Q14 x) { return shr(x,1); },  // logical shift by 1
+         {0x6000, 0x6200, 0x6400, 0x6600, 0x6800, 0x6a00, 0x6c00, 0x6e00,
+          0x7000, 0x7200, 0x7400, 0x7600, 0x7800, 0x7a00, 0x7c00, 0x7e00,
+          0x0000,
+          0x0200, 0x0400, 0x0600, 0x0800, 0x0a00, 0x0c00, 0x0e00, 0x1000,
+          0x1200, 0x1400, 0x1600, 0x1800, 0x1a00, 0x1c00, 0x1e00, 0x2000}},

-        {[](skvm::Q14x2 x) { return x - (x>>2); },  // 3/4 x, version A
-         {0x00000000, 0xfd000300, 0xfa000600, 0xf7000900,
-          0xf4000c00, 0xf1000f00, 0xee001200, 0xeb001500,
-          0xe8001800, 0xe5001b00, 0xe2001e00, 0xdf002100,
-          0xdc002400, 0xd9002700, 0xd6002a00, 0xd3002d00, 0xd0003000}},
+        {[](skvm::Q14 x) { return x - (x>>2); },  // 3/4 x, version A
+         {0xd000, 0xd300, 0xd600, 0xd900, 0xdc00, 0xdf00, 0xe200, 0xe500,
+          0xe800, 0xeb00, 0xee00, 0xf100, 0xf400, 0xf700, 0xfa00, 0xfd00,
+          0x0000,
+          0x0300, 0x0600, 0x0900, 0x0c00, 0x0f00, 0x1200, 0x1500, 0x1800,
+          0x1b00, 0x1e00, 0x2100, 0x2400, 0x2700, 0x2a00, 0x2d00, 0x3000}},

-        {[](skvm::Q14x2 x) { return (x>>1) + (x>>2); },  // 3/4 x, version B
-         {0x00000000, 0xfd000300, 0xfa000600, 0xf7000900,
-          0xf4000c00, 0xf1000f00, 0xee001200, 0xeb001500,
-          0xe8001800, 0xe5001b00, 0xe2001e00, 0xdf002100,
-          0xdc002400, 0xd9002700, 0xd6002a00, 0xd3002d00, 0xd0003000}},
+        {[](skvm::Q14 x) { return (x>>1) + (x>>2); },  // 3/4 x, version B
+         {0xd000, 0xd300, 0xd600, 0xd900, 0xdc00, 0xdf00, 0xe200, 0xe500,
+          0xe800, 0xeb00, 0xee00, 0xf100, 0xf400, 0xf700, 0xfa00, 0xfd00,
+          0x0000,
+          0x0300, 0x0600, 0x0900, 0x0c00, 0x0f00, 0x1200, 0x1500, 0x1800,
+          0x1b00, 0x1e00, 0x2100, 0x2400, 0x2700, 0x2a00, 0x2d00, 0x3000}},

-        {[](skvm::Q14x2 x) { return ((x>>2) + (x>>3))<<1; },  // 3/4 x, version C
-         {0x00000000, 0xfd000300, 0xfa000600, 0xf7000900,
-          0xf4000c00, 0xf1000f00, 0xee001200, 0xeb001500,
-          0xe8001800, 0xe5001b00, 0xe2001e00, 0xdf002100,
-          0xdc002400, 0xd9002700, 0xd6002a00, 0xd3002d00, 0xd0003000}},
+        {[](skvm::Q14 x) { return ((x>>2) + (x>>3))<<1; },  // 3/4 x, version C
+         {0xd000, 0xd300, 0xd600, 0xd900, 0xdc00, 0xdf00, 0xe200, 0xe500,
+          0xe800, 0xeb00, 0xee00, 0xf100, 0xf400, 0xf700, 0xfa00, 0xfd00,
+          0x0000,
+          0x0300, 0x0600, 0x0900, 0x0c00, 0x0f00, 0x1200, 0x1500, 0x1800,
+          0x1b00, 0x1e00, 0x2100, 0x2400, 0x2700, 0x2a00, 0x2d00, 0x3000}},

        // TODO: I'm not sure if this one is working correctly or not.  Should only work for >=0?
-        {[](skvm::Q14x2 x) { return unsigned_avg(x, x>>1); },  // 3/4 x, version D
-         {0x00000000, 0xfd000300, 0xfa000600, 0xf7000900,
-          0xf4000c00, 0xf1000f00, 0xee001200, 0xeb001500,
-          0xe8001800, 0xe5001b00, 0xe2001e00, 0xdf002100,
-          0xdc002400, 0xd9002700, 0xd6002a00, 0xd3002d00, 0xd0003000}},
+        {[](skvm::Q14 x) { return unsigned_avg(x, x>>1); },  // 3/4 x, version D
+         {0xd000, 0xd300, 0xd600, 0xd900, 0xdc00, 0xdf00, 0xe200, 0xe500,
+          0xe800, 0xeb00, 0xee00, 0xf100, 0xf400, 0xf700, 0xfa00, 0xfd00,
+          0x0000,
+          0x0300, 0x0600, 0x0900, 0x0c00, 0x0f00, 0x1200, 0x1500, 0x1800,
+          0x1b00, 0x1e00, 0x2100, 0x2400, 0x2700, 0x2a00, 0x2d00, 0x3000}},

-        {[](skvm::Q14x2 x) { return min(x, +0.5f); },  // clamp down to 0x2000, version A
-         {0x00000000, 0xfc000400, 0xf8000800, 0xf4000c00,
-          0xf0001000, 0xec001400, 0xe8001800, 0xe4001c00,
-          0xe0002000, 0xdc002000, 0xd8002000, 0xd4002000,
-          0xd0002000, 0xcc002000, 0xc8002000, 0xc4002000, 0xc0002000}},
+        {[](skvm::Q14 x) { return min(x, +0.5f); },  // clamp down to 0x2000, version A
+         {0xc000, 0xc400, 0xc800, 0xcc00, 0xd000, 0xd400, 0xd800, 0xdc00,
+          0xe000, 0xe400, 0xe800, 0xec00, 0xf000, 0xf400, 0xf800, 0xfc00,
+          0x0000,
+          0x0400, 0x0800, 0x0c00, 0x1000, 0x1400, 0x1800, 0x1c00, 0x2000,
+          0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000}},

-        {[](skvm::Q14x2 x) { return select(x < +0.5f, x, +0.5f); },  // clamp down to 0x2000, vB
-         {0x00000000, 0xfc000400, 0xf8000800, 0xf4000c00,
-          0xf0001000, 0xec001400, 0xe8001800, 0xe4001c00,
-          0xe0002000, 0xdc002000, 0xd8002000, 0xd4002000,
-          0xd0002000, 0xcc002000, 0xc8002000, 0xc4002000, 0xc0002000}},
+        {[](skvm::Q14 x) { return select(x < +0.5f, x, +0.5f); },  // clamp down to 0x2000, vB
+         {0xc000, 0xc400, 0xc800, 0xcc00, 0xd000, 0xd400, 0xd800, 0xdc00,
+          0xe000, 0xe400, 0xe800, 0xec00, 0xf000, 0xf400, 0xf800, 0xfc00,
+          0x0000,
+          0x0400, 0x0800, 0x0c00, 0x1000, 0x1400, 0x1800, 0x1c00, 0x2000,
+          0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000, 0x2000}},

-        {[](skvm::Q14x2 x) { return select(x == 1.0f, 0.5f, x); },
-         {0x00000000, 0xfc000400, 0xf8000800, 0xf4000c00,
-          0xf0001000, 0xec001400, 0xe8001800, 0xe4001c00,
-          0xe0002000, 0xdc002400, 0xd8002800, 0xd4002c00,
-          0xd0003000, 0xcc003400, 0xc8003800, 0xc4003c00, 0xc0002000}},
+        {[](skvm::Q14 x) { return select(x == 1.0f, 0.5f, x); },
+         {0xc000, 0xc400, 0xc800, 0xcc00,  0xd000, 0xd400, 0xd800, 0xdc00,
+          0xe000, 0xe400, 0xe800, 0xec00,  0xf000, 0xf400, 0xf800, 0xfc00,
+          0x0000,
+          0x0400, 0x0800, 0x0c00, 0x1000,  0x1400, 0x1800, 0x1c00, 0x2000,
+          0x2400, 0x2800, 0x2c00, 0x3000,  0x3400, 0x3800, 0x3c00, 0x2000}},

-        {[](skvm::Q14x2 x) { return max(x, -0.5f); },  // clamp up to 0xe000
-         {0x00000000, 0xfc000400, 0xf8000800, 0xf4000c00,
-          0xf0001000, 0xec001400, 0xe8001800, 0xe4001c00,
-          0xe0002000, 0xe0002400, 0xe0002800, 0xe0002c00,
-          0xe0003000, 0xe0003400, 0xe0003800, 0xe0003c00, 0xe0004000}},
-
-        // TODO: I had higher hopes for this op until I realized it clamps negative values
-        // to the upper limit, not zero.  Duh.  Might end up removing this.
-        {[](skvm::Q14x2 x) { return unsigned_min(x, 0.5f); },  // clamp around to [0,0x2000]
-         {0x00000000, 0x20000400, 0x20000800, 0x20000c00,
-          0x20001000, 0x20001400, 0x20001800, 0x20001c00,
-          0x20002000, 0x20002000, 0x20002000, 0x20002000,
-          0x20002000, 0x20002000, 0x20002000, 0x20002000, 0x20002000}},
+        {[](skvm::Q14 x) { return max(x, -0.5f); },  // clamp up to 0xe000
+         {0xe000, 0xe000, 0xe000, 0xe000,  0xe000, 0xe000, 0xe000, 0xe000,
+          0xe000, 0xe400, 0xe800, 0xec00,  0xf000, 0xf400, 0xf800, 0xfc00,
+          0x0000,
+          0x0400, 0x0800, 0x0c00, 0x1000,  0x1400, 0x1800, 0x1c00, 0x2000,
+          0x2400, 0x2800, 0x2c00, 0x3000,  0x3400, 0x3800, 0x3c00, 0x4000}},
    };

    for (const auto& test : cases) {
        skvm::Builder b;
        {
-            skvm::Arg dst = b.varying<uint32_t>(),
-                      src = b.varying<uint32_t>();
+            skvm::Arg dst = b.varying<uint16_t>(),
+                      src = b.varying<uint16_t>();

-            skvm::Q14x2 x = as_Q14x2(b.load32(src));
-            store32(dst, as_I32(test.fn(x)));
+            skvm::Q14 x = to_q14(b.load16(src));
+            store16(dst, to_i32(test.fn(x)));
        }

        test_jit_and_interpreter(b.done(), [&](const skvm::Program& program){
-            uint32_t dst[17];
-            program.eval(17, dst,src);
-            for (int i = 0; i < 17; i++) {
-                if (test.expected[16]) {
+            uint16_t dst[33];
+            program.eval(33, dst,src);
+            for (int i = 0; i < 33; i++) {
+                if (test.expected[32]) {
                    REPORTER_ASSERT(r, test.expected[i] == dst[i]);
                } else {
-                    if (i == 0 || i == 4 || i == 8 || i == 12) SkDebugf("\n");
-                    SkDebugf("0x%08x, ", dst[i]);
+                    if (i == 0 || i == 8 || i == 16 || i == 17 || i == 25) SkDebugf("\n");
+                    SkDebugf("0x%04x, ", dst[i]);
                }
            }
        });