Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • afs hiccup (can't read from/write to a resource located on an afs disk)
  • nfs hiccup (can't read from/write to a resource located on an afs disk)
  • LSF hiccup (an lsf batch machine crashes or is rebooted)
  • lost Data Base connection (can happen on MOOT or more rarely on Oracle)
  • recon segfaults for no apparent reason (nothing in logfile)

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is.

...

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines
  • staging disk is full
  • u52 or u15 are full

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena (see L1 shift schedule) if you think that one of those failures might be happening during the night (if in doubt, page anyways).

...

  • G4 propagator error. If recon dies complaining about g4propagator, we can't fix it. If this happens, email Heather and Anders (and possibly Warren and Maria Elena). Include a link to the log file, which will tell them where the core file is.
  • Overlapping chunks. If findChunks complains about overlapping chunks, tell Bryson; if trending merges complain, tell Bryson and datamonlist@glast2.Stanford.EDU. Rollback won't help.

How to recognize permanent failures: besides those 2 cases, everything that doesn't get fixed after a rollback is by definition a permanent failure.

What to do in case of permanent failures: contact the appropriate people above, if you are sure you know what happened. Otherwise, page Warren and/or Maria Elena (see L1 shift schedule). If there is another part of the run waiting, the run lock (see below) will have to be removed by hand; page unless you're really sure of what you're doing.

...