These are neat but mostly just a distraction for now.
I've left all the assembly in place and unit tested
to make putting these back easy when we want to.
Change-Id: Id2bd05eca363baf9c4e31125ee79e722ded54cb7
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/283307
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
N=15 and N=63 make for nice even looking profiles
on ARM and x86 respectively, with N=15 running 3
body loops and 3 tail loops on ARM, N=63 running
7 body loops and 7 tail loops on x86.
Change-Id: Ie7616bd99c949328bbb7d7048fc6f468ff1e3ad2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/227220
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Looks like ~50ns overhead for RP vs ~14,000ns for SkVM.
Change-Id: I85ef73d3387657b14615fcfa5cfd9df5c2325343
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/223302
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This new bench lets us measure the overhead of program building,
optimization, and JITting. Surprisingly, at head the optimization in
Builder::done() takes longer than the JIT.
The new bench clocks in around 40µs on my laptop at head,
then 32µs after switching val_to_reg to be an std::vector,
then 27µs after switching deaths to be an std::vector too,
then 22µs after switching fIndex to be an SkTHashMap,
then 20µs after calling program.reserve(fProgram.size()),
then 19µs after switching JIT data maps to SkTHashMap too.
I tried swapping some std::vector for SkTDArray to no benefit, actually
a little detriment. So I think this is roughly all the low-hanging
fruit, with time split now roughly equally between Builder::Done(),
JITting in Program::eval(), and the original calls to Builder
themselves.
Also disable perf dumps on Mac. No real value there until I can dump a
dylib, and it's just one more thing I have to remember to disable before
running this sort of benchmark.
Change-Id: I1c6e58ed00ac94ad622c7d740712634f60787102
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/222984
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This moves the responsibility for allocating executable code out of
Assembler. The pages Xbyak uses are obviously executable, so this is
redundant right now, but it'll let us switch to something simple like
std::vector<uint8_t> as we continue to cut out Xbyak.
Make how Program holds its cached JIT program slightly less of a mess.
Change-Id: I38d6f01006da1da60f4aed675e9ddf97de9aec52
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/222575
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
- 32x8 i32 add,sub,mul
- add I32_Naive bench/test builder to get better i32 mul coverage
- minor refactoring all over
Change-Id: I13cc19ff37a2da0bcff289ba51baac08f456d6c5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/222485
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Eliminate the duplicate functionality,
and better testing for the bench builders.
Change-Id: If20e52107738903f854aec431416e573d7a7d640
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/218041
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Things were running suspiciously well...
_I32 had a typo that cut out 3/4 of its multiplies...
_I32_SWAR was missing a mask operation needed to drop
the junk low byte of the high half after the multiply.
The bench times now make a bit more sense and are in line
with how much work we're actually doing: F32's the slowest,
I32 a little faster, and I32_SWAR fastest:
curr/maxrss loops min median mean max stddev samples config bench
35/36 MB 58 2.03ns 2.04ns 2.04ns 2.04ns 0% ▂▂▂▂▁▁█▁▂▁ nonrendering SkVM_4096_I32_SWAR
35/36 MB 42 3.44ns 3.48ns 3.49ns 3.59ns 1% ▂▆▅█▃▃▁▂▂▄ nonrendering SkVM_4096_I32
35/36 MB 30 4.9ns 5.21ns 5.11ns 5.33ns 3% ▆▇█▆▆▁▂▁▁▅ nonrendering SkVM_4096_F32
35/36 MB 203 0.696ns 0.697ns 0.705ns 0.758ns 3% █▂▂▁▁▁▁▁▁▂ nonrendering SkVM_4096_RP
35/36 MB 942 0.188ns 0.188ns 0.188ns 0.189ns 0% ▂▁▂▁▃█▂▁▁▁ nonrendering SkVM_4096_Opts
Change-Id: I2850dc3f9df1828f03499eb278b8231f48eaae63
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/217982
Commit-Queue: Mike Klein <mtklein@google.com>
Commit-Queue: Brian Osman <brianosman@google.com>
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
With all the thinking around a stack-based interpreter,
I figured I'd sketch out some ideas for a register VM too.
I kind of have the hunch that this is the direction that
will actually let us replace large amounts of Skia's CPU
backend with an efficient interpreter or JIT.
Change-Id: Ia2b5ba4a3fc27556f5b6ba95cd1ace46d3217403
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/216665
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>