fae63e477c
This adds a warmup phase to let each instruction do any setup it needs, adding lookup entries for splat and bytes, and on aarch64, hoisting the mask to a register when we can. Oddly, this measures as a ~3x slowdown on the phone I'm testing, an international Galaxy S9 with a Samsung Mongoose 3 processor. I've got to imagine this somehow makes the processor think there's a carried loop dependency when there is not? Anyway, we already know that that's a pretty crazy CPU (reports FP16 compute but cannot), and this does deliver a speedup on the Pixel 2's Kryo 280 / Cortex A73, so I think maybe I'll just swap back to testing with the Pixel 2 and forget about that S9. Here's a before/after codelisting with a hoisted tbl mask. In the before case it's loaded in the loop with `ldr q3, #152`, and becomes `ldr q0, #168` outside the loop. llvm-mca says this should cut one cycle per loop, and with optimal out of order execution the loop cost would drop from ~8.7 cycles to ~8.3. In practice, it looks like about a 15% speedup. before: ldr q0, #188 ldr q1, #200 cmp x0, #4 // =4 b.lt #76 ldr q2, [x1] ldr q3, #152 tbl v3.16b, { v2.16b }, v3.16b sub v3.8h, v0.8h, v3.8h ldr q4, [x2] and v5.16b, v4.16b, v1.16b ushr v4.8h, v4.8h, #8 mul v5.8h, v5.8h, v3.8h ushr v5.8h, v5.8h, #8 mul v3.8h, v4.8h, v3.8h bic v3.16b, v3.16b, v1.16b orr v3.16b, v5.16b, v3.16b add v2.4s, v2.4s, v3.4s str q2, [x2] add x1, x1, #16 // =16 add x2, x2, #16 // =16 sub x0, x0, #4 // =4 b.al #-76 cmp x0, #1 // =1 b.lt #76 ldr s2, [x1] ldr q3, #72 tbl v3.16b, { v2.16b }, v3.16b sub v3.8h, v0.8h, v3.8h ldr s4, [x2] and v5.16b, v4.16b, v1.16b ushr v4.8h, v4.8h, #8 mul v5.8h, v5.8h, v3.8h ushr v5.8h, v5.8h, #8 mul v3.8h, v4.8h, v3.8h bic v3.16b, v3.16b, v1.16b orr v3.16b, v5.16b, v3.16b add v2.4s, v2.4s, v3.4s str s2, [x2] add x1, x1, #4 // =4 add x2, x2, #4 // =4 sub x0, x0, #1 // =1 b.al #-76 ret after: ldr q0, #168 ldr q1, #180 ldr q2, #192 cmp x0, #4 // =4 b.lt #72 ldr q3, [x1] tbl v4.16b, { v3.16b }, v0.16b sub v4.8h, v1.8h, v4.8h ldr q5, [x2] and v6.16b, v5.16b, v2.16b ushr v5.8h, v5.8h, #8 mul v6.8h, v6.8h, v4.8h ushr v6.8h, v6.8h, #8 mul v4.8h, v5.8h, v4.8h bic v4.16b, v4.16b, v2.16b orr v4.16b, v6.16b, v4.16b add v3.4s, v3.4s, v4.4s str q3, [x2] add x1, x1, #16 // =16 add x2, x2, #16 // =16 sub x0, x0, #4 // =4 b.al #-72 cmp x0, #1 // =1 b.lt #72 ldr s3, [x1] tbl v4.16b, { v3.16b }, v0.16b sub v4.8h, v1.8h, v4.8h ldr s5, [x2] and v6.16b, v5.16b, v2.16b ushr v5.8h, v5.8h, #8 mul v6.8h, v6.8h, v4.8h ushr v6.8h, v6.8h, #8 mul v4.8h, v5.8h, v4.8h bic v4.16b, v4.16b, v2.16b orr v4.16b, v6.16b, v4.16b add v3.4s, v3.4s, v4.4s str s3, [x2] add x1, x1, #4 // =4 add x2, x2, #4 // =4 sub x0, x0, #1 // =1 b.al #-72 ret Change-Id: I352a98d3ac2ad84c338330ef4cfae0292a0b32da Reviewed-on: https://skia-review.googlesource.com/c/skia/+/229064 Reviewed-by: Mike Klein <mtklein@google.com> Commit-Queue: Mike Klein <mtklein@google.com> |
||
---|---|---|
animations | ||
bench | ||
bin | ||
build_overrides | ||
dm | ||
docker | ||
docs/examples | ||
example | ||
experimental | ||
fuzz | ||
gm | ||
gn | ||
include | ||
infra | ||
modules | ||
platform_tools | ||
resources | ||
samplecode | ||
site | ||
specs | ||
src | ||
tests | ||
third_party | ||
tools | ||
.clang-format | ||
.clang-tidy | ||
.gitignore | ||
.gn | ||
AUTHORS | ||
BUILD.gn | ||
codereview.settings | ||
CONTRIBUTING | ||
CQ_COMMITTERS | ||
DEPS | ||
go.mod | ||
go.sum | ||
LICENSE | ||
OWNERS | ||
PRESUBMIT.py | ||
public.bzl | ||
README | ||
README.chromium | ||
whitespace.txt |
Skia is a complete 2D graphic library for drawing Text, Geometries, and Images. See full details, and build instructions, at https://skia.org.