Add some useful information to trooper doc.

BUG=skia:
NOTRY=true
DOCS_PREVIEW= https://skia.org/?cl=1308253005

Review URL: https://codereview.chromium.org/1308253005
This commit is contained in:
benjaminwagner 2015-09-03 09:06:30 -07:00 committed by Commit bot
parent d705958375
commit fbf908c857

View File

@ -37,4 +37,78 @@ If you need to swap shifts with someone (because you are out sick or on vacation
<a name="tips"></a>
Tips for troopers
-----------------
Add your tips here!
- Make sure you are a member of
[MDB group chrome-skia-ninja](https://ganpati.corp.google.com/#Group_Info?name=chrome-skia-ninja@prod.google.com).
Valentine passwords and Chrome Golo access are based on membership in this
group.
- These alerts generally auto-dismiss once the criteria for the alert is no
longer met:
- Monitoring alerts, including prober, collectd, and others
- Disconnected build slaves
- These alerts generally do not auto-dismiss ([issue here](https://code.google.com/p/skia/issues/detail?id=4292)):
- Build slaves that failed a step
- Disconnected devices (these are detected as the "wait for device" step failing)
- "Failed to execute query" may show a different query than the failing one;
dismiss the alert to get a new alert showing the query that is actually
failing. (All "failed to execute query" alerts are lumped into a single alert,
which is why the failed query which initially triggered the alert may not be
failing any more but the alert is still active because another query is
failing.)
- Where machines are located:
- Machine name like "skia-vm-NNN" -> GCE
- Machine name ends with "a3", "a4", "m3" -> Chrome Golo
- Machine name starts with "skiabot-" -> Chapel Hill lab
- Machine name starts with "win8" -> Chapel Hill lab (Windows machine
names can't be very long, so the "skiabot-shuttle-" prefix is dropped.)
- slave11-c3 is a Chrome infra GCE machine (not to be confused with the Skia
Buildbots GCE, which we refer to as simply "GCE")
- The [chrome-infra IRC channel](https://comlink.googleplex.com/chrome-infra) is
useful for questions regarding bots managed by the Chrome Infra team and to
get visibility into upstream failures that cause problems for us.
- To log in to a Linux buildbot in GCE, use `gcloud compute default@<machine
name>`. Choose the zone listed for the
[GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances)
(or specify it using the `--zone` command-line flag).
- To log in to a Windows buildbot in GCE, use
[Chrome RDP Extension](https://chrome.google.com/webstore/detail/chrome-rdp/cbkkbcmdlboombapidmoeolnmdacpkch?hl=en-US)
with the
[IP address of the GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances)
shown on the [host info page](https://status.skia.org/hosts) for that bot. The
username is chrome-bot and the password can be found on
[Valentine](https://valentine.corp.google.com/) as "chrome-bot (Win GCE)".
- If there is a problem with a bot in the Chrome Golo or Chrome infra GCE, the
best course of action is to
[file a bug](https://code.google.com/p/chromium/issues/entry?template=Build%20Infrastructure)
with the Chrome infra team. But if you know what you're doing:
- To access bots in the Chrome Golo,
[follow these instructions](https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md).
- Machine name ends with "a3" or "a4" -> ssh command looks like `ssh
build3-a3.chrome`
- Machine name ends with "m3" -> ssh command looks like `ssh build5-m3.golo`
- For MacOS and Windows bots, you will be prompted for a password, which is
stored on [Valentine](https://valentine.corp.google.com/) as "Chrome Golo,
Perf, GPU bots - chrome-bot".
- To access bots in the Chrome infra GCE -> command looks like `gcutil
--project=google.com:chromecompute ssh --ssh_user=default slave11-c3` (or
use the ccompute ssh script from the infra_internal repo).
- Read over the [SkiaLab documentation](../testing/skialab) for more detail on
dealing with device alerts.
- To stop a buildslave for a device, log in to the host for that device, `cd
~/buildbot/<slave name>/build/slave; make stop`. To start it again,
`TESTING_SLAVENAME=<slave name> make start`.
- Buildslaves can be slow to come up after reboot, but if the buildslave remains
disconnected, you may need to start it manually. On Mac and Linux, check using
`ps aux | grep python` that neither buildbot nor gclient are running, then run
`~/skiabot-slave-start-on-boot.sh`.