On batch failure we're rerunning every source in the batch, while we
really only need to rerun sources that we don't know succeeded.
If for example we run sources "foo", "bar", and "baz", and foo produces
a known hash, then bar crashes, we only need to rerun bar and baz. The
batch run was enough to demonstrate foo's good.
Change-Id: I17634a6095906bcc2ad0bd33bb78eba000654b5e
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/369456
Reviewed-by: Eric Boren <borenet@google.com>
Move definition of Work struct until just before it's used,
and show one of the sources as an example at kickoff-level step.
These are just cosmetic/refactors.
Change-Id: Ib23b9379683b9867e097c8d68ef8736013719cee
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/369356
Reviewed-by: Eric Boren <borenet@google.com>
Disabled on Adreno 5xx/6xx as the tests do not pass on those GPUs:
http://screen/3Dkgs9syj37cjBV
Change-Id: Ib935d01e8f06dbfe7decd5cc4e52e0688b48be08
Bug: skia:11306, skia:11308
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/368805
Commit-Queue: Brian Osman <brianosman@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
Auto-Submit: John Stiles <johnstiles@google.com>
If we track how many pending batches a kickoff()
has in flight, we can endStep() it properly when
that number hits zero.
This double sync.WaitGroup trick is pretty neat.
Now we're thinking with portals...
Added some comments to prevent myself falling in
the trap of assuming we'll have runtime.NumCPU()
batches... rounding the batch size up means we'll
sometimes have fewer.
Change-Id: If50615c204485862462c240b9bbdfd4ddbad43b2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366142
Reviewed-by: Eric Boren <borenet@google.com>
It's nice to see it in the task log, and to be able to see
it's not there when we're not working with Gold (*SAN) bots.
(One trybot of each kind here.)
Change-Id: Ibb4aa20badf95ef603f3890e1c8248cad675507f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366143
Reviewed-by: Eric Boren <borenet@google.com>
Group batches from a single kickoff() into another mid-level step:
Top-Level
kickoff --some flags
batch sources...
batch (exec)
batch other sources ...
batch (exec)
rerun (exec)
rerun (exec)
batch yet other sources ...
batch (exec)
rerun (exec)
kickoff --some other flags
...
Big question: is it okay for the kickoff steps to td.EndStep() while its
kids are still running (or haven't even started) on other goroutines?
Change-Id: I77ad2274e35cea0151be0cca6c690eafc4f8983e
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366140
Reviewed-by: Eric Boren <borenet@google.com>
There are bots (*SAN) that won't ever be uploading to Gold,
so *bot != "" doesn't really describe the right condition.
We could do this logic inside fm_driver.go based on *bot,
but I kind of want the flexibility to do things like upload
local ad-hoc runs or sanitizer runs if we want using --gold.
Change-Id: Id972d8b0097616c5b2802bc99c2718fdd1568fe3
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366139
Reviewed-by: Mike Klein <mtklein@google.com>
Why have other bots when we can do it all here?
Change-Id: I6a3f3c2ed5d19a3b8ecf59f44cc0d2f6076bba7f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366138
Reviewed-by: Mike Klein <mtklein@google.com>
There's no need to tick wg up and down when running reruns, and as
written it's possible for the overall fm_driver program to exit before
one last call to endStep() has happened. Simply calling wg.Done() once
per item dequeued outside worker() fixes both.
Change-Id: I0fb0acc5a3f2c624dfc14f875fa094db6dd40838
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/366137
Reviewed-by: Mike Klein <mtklein@google.com>
If a batch fails, we've got to rerun everything (or at least from the
failure on), but when it's merely unknown hashes, no need to rerun
what's produced hashes we know already.
Small tweak to FM to keep all the printed source names exactly what's
passed in, keeping the whole path for skp/svg/image files. This means
zero bookkeeping needed to know what to rerun when parsing that output.
Change-Id: I1e7ed3ee51158b68a6bdd3152560f3a282109576
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365818
Reviewed-by: Mike Klein <mtklein@google.com>
Now with ctx scoping fixed,
and steps nested just how I like them.
Change-Id: Ifa43a432faddbafaae118ab0b16f710b695b5377
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365504
Reviewed-by: Eric Boren <borenet@google.com>
These *SAN bots should be able to replace some of the Test- bots on our
tree, and the MSVC one is just another we'll need eventually.
(Similarly we'll want GCC on Linux, but I don't know how to Docker.)
Change-Id: Ied4519626f1e13bb31fcb30f37cbd1b24133aa71
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365597
Reviewed-by: Eric Boren <borenet@google.com>
I've been waiting to replace bots until they were uploading to Gold, but
these *SAN bots don't upload anything, so we're at parity there already.
I've just remembered the Mac ASAN/TSAN bots and the Windows ASAN bots,
so I'll be following up to replace them too.
Change-Id: Ia3edfda64091e4407a1131073829b74f22b32b71
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365596
Reviewed-by: Eric Boren <borenet@google.com>
I swear yesterday we had ~10 unused 10.14 bots...
Anyway, as with the last move, I don't really care where these bots are
running, so long as they have enough machines to schedule reaonably
quickly. This switches their spec to the default "Mac", which is
currently 19 VMs running 10.15.7 on trashcan Mac Pros.
Similarly, Win2019 -> Win. This is a change in bot name only, just to
capture the "I'll take whatever's default" spirit. I'd use Linux or
Debian too if there were one.
Change-Id: Ifa7615735c660018a5f3f46f4d8035e0b5bf8141
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/365518
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
Reviewed-by: Weston Tracey <westont@google.com>
This also marks the glorious return of td.FailStep() as the answer to my
question "now how do I find my failures in this giant list?"
Change-Id: I15f98862d77942f2e289dc626da8643789a91d48
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364838
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
I don't really care what machine class it's running on right now, and
status.skia.org/capacity says these 10.14.6 bots are the least loaded
of any bot on the tree.
Change-Id: Ie49b2659e99d60e450235207bbbc018e565636b4
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364716
Reviewed-by: Eric Boren <borenet@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Calling td.FailStep() as written here doesn't really do anything except
hide the more useful summary error, e.g. "484 runs of build/fm failed
after retries." Maybe it'll become useful again if I add step nesting?
Change-Id: I23eb59afce8559f4b0e549f31873577939fc7ca7
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364497
Reviewed-by: Mike Klein <mtklein@google.com>
Don't expect much out of TSAN given the process-based isolation,
but I'm curious to see how it goes. MSAN should work sensibly.
Change-Id: If0b794805461b0ecd7092900f4412d73cd80d0d2
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364466
Reviewed-by: Eric Boren <borenet@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
td.FailStep() isn't enough to fail the bot,
so go back to a call to td.Fatal() when failures>0.
Change-Id: Ib2be7b15200376ab8a16e4a1b69d98fde0630673
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364471
Reviewed-by: Mike Klein <mtklein@google.com>
Even with all the workarounds (deleted here), calling td methods still
costs a fair chunk of CPU work. Instead of sneakily working around it,
just never call it when run locally.
Change-Id: I2e421a5d585c86a6315d56867a29bdcdc9d45479
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364461
Reviewed-by: Mike Klein <mtklein@google.com>
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I319f2b80aec95f51ff9fe3db341bb7bf0d82d971
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364015
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
No need for this extra parallelism, and it's extra contention.
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I5c0d52def5043555f313e99713335aa66b269e22
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364014
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
I've pulled most of this from the BonusConfigs smorgasbord,
skipping a few redundant ones (do we really need all combos
of {8888,f16}x{srgb,narrow,p3,rec2020}?).
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I56f684eb593f4e54d74f592e08508662bd7daa35
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363998
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Kicking off goroutines willy-nilly wasn't a good idea,
but some of the other work was nice and can be kept even
with the safer pool-of-goroutines strategy.
- use exec.Silent to skip some burned formatting work
if we're just going to send it all to /dev/null
- rearrange to not need both a todo list and a queue
of work makes sense... just get the workers going and
have kickoff() feed the queue directly
- straighten out worker logic flow to make it understandable
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: I4b27db4b9d41cf05a1c9dee9409ebd664f566567
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364011
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This reverts commits 8ef3c539a2
and 4b09de3c90.
It turns out controlling the scheduling is a good idea;
I keep running into exec failures and process limits.
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All
Change-Id: Ia72f446965e5093fbf996e78d9513c15dedae3d9
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/364006
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Change-Id: I6f9221f06a8b7ed80fc2653cea3aa454a3ddc819
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363961
Auto-Submit: Weston Tracey <westont@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
Reviewed-by: Robert Phillips <robertphillips@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Commit-Queue: Robert Phillips <robertphillips@google.com>
In the short term this gives us iphone11 coverage again, but I still
need to sort out why xcode 12.3 builds won't work, since this adds
complexity and xcode version churn on builders.
Bug: skia:11129
Change-Id: Ic477b26e1cffc1d3124832cf26ec391969a617cf
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/358516
Commit-Queue: Weston Tracey <westont@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
- Updated asset version to 13, which includes the image resources and
modified directory hierarchy.
- Updated dm --svgs arg in infra/bots/recipes/test.py to add "svg"
subdirectory (relative to corpus asset root).
- Ran 'make train' in infra/bots/
Bug: skia:11229
Change-Id: I4e3c5da5945ee7ee4034ec453fdeb84c5fa08394
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/361560
Commit-Queue: Tyler Denniston <tdenniston@google.com>
Reviewed-by: Kevin Lubick <kjlubick@google.com>
and build task drivers per platform so we can.
Cq-Include-Trybots: luci.skia.skia.primary:FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Win2019-Clang-GCE-CPU-AVX2-x86_64-Debug-All,FM-Mac10.13-Clang-MacBookPro11.5-CPU-AVX2-x86_64-Debug-All
Change-Id: Ie076abc6ba4692eaac4b44c2ecdc0c07e3246044
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363737
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Eric Boren <borenet@google.com>
Instead of making a list of work to do and then
later kicking off goroutines to do it, just start
the work as soon as it's ready to go.
Change-Id: I6bd8a031958ae440ba7f72609d9dfb867ebb2490
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363436
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Instead of standing goroutines pulling from a work queue, just kick off
individual goroutines. No need to write a scheduler on top of another
scheduler.
Local runs put this at the same wall time as before while saving a
little user/sys CPU time. Bot runs typically took 25-50s before this
change, and now 40s, 28s, and 29s, so the same there too.
We can choose whether to handle re-runs on the same goroutine or kick
off new ones. I've chosen here to run them on the same goroutine (see
the commented /*go*/), mostly because the bots quickly exhaust their
user process limits when the reruns are all spawning FM processes in
parallel. I think that means we don't need the extra parallelism. As
far as I have seen, whether we kick off a goroutine or not has had no
impact on wall/user/sys at all, so might as well not for now.
Change-Id: If2990e07a402dee8c5706f537f503421013a5586
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363376
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
When running as a bot (even locally if you want), grab the known hashes
from Gold and scan through FM's stdout looking for unknown hashes.
If we do find unknown hashes, requeue the batch for individual reruns
like we do on failure, print the command and new hash if those singleton
reruns also (re)produce an unknown hash.
Eventually, I'll have singleton runs write out .pngs and upload them to
Gold when the hash is unknown.
Change-Id: I835881e6e6260e4dbe84de8d03d16921881aae1c
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363039
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Building job strings and then parsing them into Work structs
seems a bit roundabout when there's no job string to start with.
Instead reorient so that we're building a list of Work, and create
those Work units directly when possible instead of via job strings.
Should be no practical change here.
Change-Id: I48f1eec8ab7ccbe2c46fc62174cd3625c51d3732
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/363038
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
We can now mimic bots locally by running fm_driver.go like this,
go run **/fm_driver.go -bot $BOT out/fm
where $BOT is like FM-Debian10-Clang-GCE-CPU-AVX2-x86_64-Debug-All.
As a demo, skip aarectmodes and GoodHash on Debian10.
Change-Id: Iec215182dce9f05b8aa6807e837daa0618e2669f
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362316
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This lets us pass a job on the command line,
go run infra/bots/task_drivers/fm_driver/fm_driver.go out/fm tests b=cpu
or use -script to pass a file or stdin,
cat << EOF | go run infra/bots/task_drivers/fm_driver/fm_driver.go -script - out/fm
b=cpu tests
gms skvm=false b=cpu w=$out/vanilla
gms skvm=true b=cpu w=$out/skvm
#gms skvm=true b=cpu w=$out/dp3 gamut=p3 tf=srgb
#gms skvm=true b=cpu w=$out/linear gamut=srgb tf=linear
#gms skvm=true b=cpu w=$out/rec2020 gamut=rec2020 tf=rec2020
EOF
(This CL will make the one FM bot temporarily do nothing,
but the next CL should fix it.)
Change-Id: I1f3badac78a0f61698179c1afec37b3020539fff
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362216
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
Update comments and small tweaks as I remember how this works.
Change-Id: I4a279781e512fc707b96226e62a2831a1d0683e5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362196
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
This has lead to >9hr runtimes and timeouts,
making our GPU MSAN bots essentially useless black holes.
Cq-Include-Trybots: luci.skia.skia.primary:Build-Debian10-Clang-x86_64-Debug-SwiftShader_MSAN
Change-Id: I0d2b06e89ee672f1b181d140a9355e39597da49b
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/362136
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
This CL updates the infra scripts used to create the CIPD asset for
SVG's corpus on gold to download and include image resources.
Summary of changes:
- Change svg_downloader.py input argument to more generic name
- Add --keep_common_prefix arg to svg_downloader.py to preserve the
directory hierarchy for images, needed for the W3C test suite.
- Update infra SVG create.py script to download images
- Add svg_images.txt file with a list of the images we need for the W3C
test suite already in gold.
Actually updating the corpus will happen in a separate CL.
Bug: skia:11229
Change-Id: I5fe9be35db247f577bda6040ca3694a428314d0e
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/361516
Reviewed-by: Florin Malita <fmalita@chromium.org>
Reviewed-by: Kevin Lubick <kjlubick@google.com>
Commit-Queue: Tyler Denniston <tdenniston@google.com>
The perf bot on Pixel1 isn't worth much. Moved to pixel4xl since
it's one of the ones that still has some capacity.
Added a test bot that uses ASAN.
Bug: skia:10877
Change-Id: I04ac5d5c2359e38b88519e037f86911807daf32e
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/360417
Commit-Queue: Adlai Holler <adlai@google.com>
Reviewed-by: Robert Phillips <robertphillips@google.com>
Reviewed-by: Eric Boren <borenet@google.com>