If we track how many pending batches a kickoff()
has in flight, we can endStep() it properly when
that number hits zero.
This double sync.WaitGroup trick is pretty neat.
Now we're thinking with portals...
Added some comments to prevent myself falling in
the trap of assuming we'll have runtime.NumCPU()
batches... rounding the batch size up means we'll
sometimes have fewer.
Change-Id: If50615c204485862462c240b9bbdfd4ddbad43b2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366142
Reviewed-by: Eric Boren <borenet@google.com>
It's nice to see it in the task log, and to be able to see
it's not there when we're not working with Gold (*SAN) bots.
(One trybot of each kind here.)
Change-Id: Ibb4aa20badf95ef603f3890e1c8248cad675507f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366143
Reviewed-by: Eric Boren <borenet@google.com>
Group batches from a single kickoff() into another mid-level step:
Top-Level
kickoff --some flags
batch sources...
batch (exec)
batch other sources ...
batch (exec)
rerun (exec)
rerun (exec)
batch yet other sources ...
batch (exec)
rerun (exec)
kickoff --some other flags
...
Big question: is it okay for the kickoff steps to td.EndStep() while its
kids are still running (or haven't even started) on other goroutines?
Change-Id: I77ad2274e35cea0151be0cca6c690eafc4f8983e
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366140
Reviewed-by: Eric Boren <borenet@google.com>
There are bots (*SAN) that won't ever be uploading to Gold,
so *bot != "" doesn't really describe the right condition.
We could do this logic inside fm_driver.go based on *bot,
but I kind of want the flexibility to do things like upload
local ad-hoc runs or sanitizer runs if we want using --gold.
Change-Id: Id972d8b0097616c5b2802bc99c2718fdd1568fe3
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366139
Reviewed-by: Mike Klein <mtklein@google.com>
Why have other bots when we can do it all here?
Change-Id: I6a3f3c2ed5d19a3b8ecf59f44cc0d2f6076bba7f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366138
Reviewed-by: Mike Klein <mtklein@google.com>
There's no need to tick wg up and down when running reruns, and as
written it's possible for the overall fm_driver program to exit before
one last call to endStep() has happened. Simply calling wg.Done() once
per item dequeued outside worker() fixes both.
Change-Id: I0fb0acc5a3f2c624dfc14f875fa094db6dd40838
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366137
Reviewed-by: Mike Klein <mtklein@google.com>
If a batch fails, we've got to rerun everything (or at least from the
failure on), but when it's merely unknown hashes, no need to rerun
what's produced hashes we know already.
Small tweak to FM to keep all the printed source names exactly what's
passed in, keeping the whole path for skp/svg/image files. This means
zero bookkeeping needed to know what to rerun when parsing that output.
Change-Id: I1e7ed3ee51158b68a6bdd3152560f3a282109576
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365818
Reviewed-by: Mike Klein <mtklein@google.com>
Now with ctx scoping fixed,
and steps nested just how I like them.
Change-Id: Ifa43a432faddbafaae118ab0b16f710b695b5377
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365504
Reviewed-by: Eric Boren <borenet@google.com>
This also marks the glorious return of td.FailStep() as the answer to my
question "now how do I find my failures in this giant list?"
Change-Id: I15f98862d77942f2e289dc626da8643789a91d48
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364838
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
Calling td.FailStep() as written here doesn't really do anything except
hide the more useful summary error, e.g. "484 runs of build/fm failed
after retries." Maybe it'll become useful again if I add step nesting?
Change-Id: I23eb59afce8559f4b0e549f31873577939fc7ca7
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364497
Reviewed-by: Mike Klein <mtklein@google.com>
td.FailStep() isn't enough to fail the bot,
so go back to a call to td.Fatal() when failures>0.
Change-Id: Ib2be7b15200376ab8a16e4a1b69d98fde0630673
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364471
Reviewed-by: Mike Klein <mtklein@google.com>
Even with all the workarounds (deleted here), calling td methods still
costs a fair chunk of CPU work. Instead of sneakily working around it,
just never call it when run locally.
Change-Id: I2e421a5d585c86a6315d56867a29bdcdc9d45479
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364461
Reviewed-by: Mike Klein <mtklein@google.com>
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I319f2b80aec95f51ff9fe3db341bb7bf0d82d971
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364015
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
No need for this extra parallelism, and it's extra contention.
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I5c0d52def5043555f313e99713335aa66b269e22
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364014
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
I've pulled most of this from the BonusConfigs smorgasbord,
skipping a few redundant ones (do we really need all combos
of {8888,f16}x{srgb,narrow,p3,rec2020}?).
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I56f684eb593f4e54d74f592e08508662bd7daa35
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363998
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Kicking off goroutines willy-nilly wasn't a good idea,
but some of the other work was nice and can be kept even
with the safer pool-of-goroutines strategy.
- use exec.Silent to skip some burned formatting work
if we're just going to send it all to /dev/null
- rearrange to not need both a todo list and a queue
of work makes sense... just get the workers going and
have kickoff() feed the queue directly
- straighten out worker logic flow to make it understandable
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I4b27db4b9d41cf05a1c9dee9409ebd664f566567
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364011
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This reverts commits 8ef3c539a2
and 4b09de3c90.
It turns out controlling the scheduling is a good idea;
I keep running into exec failures and process limits.
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: Ia72f446965e5093fbf996e78d9513c15dedae3d9
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364006
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Instead of making a list of work to do and then
later kicking off goroutines to do it, just start
the work as soon as it's ready to go.
Change-Id: I6bd8a031958ae440ba7f72609d9dfb867ebb2490
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363436
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Instead of standing goroutines pulling from a work queue, just kick off
individual goroutines. No need to write a scheduler on top of another
scheduler.
Local runs put this at the same wall time as before while saving a
little user/sys CPU time. Bot runs typically took 25-50s before this
change, and now 40s, 28s, and 29s, so the same there too.
We can choose whether to handle re-runs on the same goroutine or kick
off new ones. I've chosen here to run them on the same goroutine (see
the commented /*go*/), mostly because the bots quickly exhaust their
user process limits when the reruns are all spawning FM processes in
parallel. I think that means we don't need the extra parallelism. As
far as I have seen, whether we kick off a goroutine or not has had no
impact on wall/user/sys at all, so might as well not for now.
Change-Id: If2990e07a402dee8c5706f537f503421013a5586
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363376
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
When running as a bot (even locally if you want), grab the known hashes
from Gold and scan through FM's stdout looking for unknown hashes.
If we do find unknown hashes, requeue the batch for individual reruns
like we do on failure, print the command and new hash if those singleton
reruns also (re)produce an unknown hash.
Eventually, I'll have singleton runs write out .pngs and upload them to
Gold when the hash is unknown.
Change-Id: I835881e6e6260e4dbe84de8d03d16921881aae1c
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363039
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Building job strings and then parsing them into Work structs
seems a bit roundabout when there's no job string to start with.
Instead reorient so that we're building a list of Work, and create
those Work units directly when possible instead of via job strings.
Should be no practical change here.
Change-Id: I48f1eec8ab7ccbe2c46fc62174cd3625c51d3732
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363038
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
We can now mimic bots locally by running fm_driver.go like this,
go run **/fm_driver.go -bot $BOT out/fm
where $BOT is like FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All.
As a demo, skip aarectmodes and GoodHash on Debian10.
Change-Id: Iec215182dce9f05b8aa6807e837daa0618e2669f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362316
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This lets us pass a job on the command line,
go run infra/bots/task_drivers/fm_driver/fm_driver.go out/fm tests b=cpu
or use -script to pass a file or stdin,
cat << EOF | go run infra/bots/task_drivers/fm_driver/fm_driver.go -script - out/fm
b=cpu tests
gms skvm=false b=cpu w=$out/vanilla
gms skvm=true b=cpu w=$out/skvm
#gms skvm=true b=cpu w=$out/dp3 gamut=p3 tf=srgb
#gms skvm=true b=cpu w=$out/linear gamut=srgb tf=linear
#gms skvm=true b=cpu w=$out/rec2020 gamut=rec2020 tf=rec2020
EOF
(This CL will make the one FM bot temporarily do nothing,
but the next CL should fix it.)
Change-Id: I1f3badac78a0f61698179c1afec37b3020539fff
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362216
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Update comments and small tweaks as I remember how this works.
Change-Id: I4a279781e512fc707b96226e62a2831a1d0683e5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362196
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
The default Task Driver logging gets in the way on the console, so I've
sent it to /dev/null for local runs. We control the horizontal and the
vertical.
Instead, print out each isolated failure and how to reproduce it:
out/fm -i resources -b cpu -s ducky_yuv_blend #failed:
Resource "resources/images/ducky.jpg" not found.
../tools/fm/fm.cpp:573: fatal error: "Image(s) failed to load."
Signal 5:
_sigtramp (+0x1d)
sk_abort_no_print() (+0x5)
std::__1::__function::__func<...
_GLOBAL__sub_I_fm.cpp (+0x0)
main (+0x12d5)
out/fm -i resources -b cpu --skvm true -s ducky_yuv_blend #failed:
Resource "resources/images/ducky.jpg" not found.
../tools/fm/fm.cpp:573: fatal error: "Image(s) failed to load."
Signal 5:
_sigtramp (+0x1d)
sk_abort_no_print() (+0x5)
std::__1::__function::__func<...
_GLOBAL__sub_I_fm.cpp (+0x0)
main (+0x12d5)
2 runs of out/fm failed after retries.
exit status 1
Bot runs still look ok,
https://task-driver.skia.org/td/TEaSLB6jtmRq5XDBUIwS?ifNotFound=https%3A%2F%2Fchromium-swarm.appspot.com%2Ftask%3Fid%3D4bea6af25506c010
Change-Id: I56adacdaeed5545785a3096a4e495eb524db442f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/287017
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
PS 6 trying StartStep/EndStep/FailStep()
PS 7 better usage?
PS 8 goes back to td.Fatal* for top-level failures
Failures seem to be working ok as of PS 8,
but I am puzzled why PS 7 wasn't correct... much prefer it.
Also set max_attempts to 1... this driver will handle flakiness itself.
Change-Id: I7de6809920bfaf1d878d654c9cf5b7861a64d23f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/286118
Reviewed-by: Mike Klein <mtklein@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>