You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Task of the Data Processing on-call expert

L1Proc and halfPipe. Every time a red cross appears on the Data Processing page, next to the L1Proc or halfPipe processing status bar, the Level 1 on-call expert will be paged. We are not on-call for ASP/GRB search (Jim Chiang is on call for these failures) and we are definitely NOT on-call for infrastructure problems (can't see monitoring plots, etc.).

Different types of failures

Transient failures

  • afs hiccup (can't read from/write to a resource located on an afs disk)
  • nfs hiccup (can't read from/write to a resource located on an afs disk)
  • LSF hiccup (an lsf batch machine crashes or is rebooted)
  • lost Data Base connection (can happen on MOOT or more rarely on Oracle)

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure).

What to do in case of transient failures: rollback the affected process(es) when possible (see below). Look for the dontCleanUp file and check the Log Watcher (see below).

Infrastructure failures

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines
  • staging disk is full
  • u52 or u15 are full

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena (see L1 shift schedule) if you think that one of those failures might be happening during the night.

  • No labels