Metal doesn't support broadcasting or shuffling boolean values, but we
can work around that by casting it to `ushort`, then casting it back to
`bool`. I used `ushort` instead of `uint` because 16-bit values give
better throughput on Apple GPUs.
Only the least *n* bits are significant, where *n* is the subgroup size.
The Vulkan CTS actually checks this.
The `FindLSB` tests weren't actually failing, but I masked that anyway,
in case there's some corner case the CTS is missing.
`SubgroupEqMask` had a fencepost error that gave wrong values for
invocation ID 32.
For `SubgroupGeMask` and `SubgroupGtMask`, I forgot to shift the values
from `extract_bits()` up so that the mask is in the correct position.
Using `insert_bits()` instead should fold these two operations into one.
`SubgroupLtMask` and `SubgroupLeMask` were already correct.