Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan and include a link to the process instance.

Transient failures are rare lately. For the last couple of months, most failed processes are automatically retried once. This usually fixes transient issues, so usually when there's a failure it indicates an actual problem.

Infrastructure failures

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines - see what to do here
  • staging disk is full
  • u52 or u15 are full

...