Things to know while on-call for Data Processing

Task of the Data Processing on-call expert

L1Proc and halfPipe. Every time a red cross appears on the Data Processing page, next to the L1Proc or halfPipe processing status bar, the Level 1 on-call expert will be paged. We are not on-call for ASP/GRB search (Jim Chiang is on call for these failures) and we are definitely NOT on-call for infrastructure problems (can't see monitoring plots, etc.).

Different types of failures

Transient failures

afs hiccup (can't read from/write to a resource located on an afs disk)
nfs hiccup (can't read from/write to a resource located on an afs disk)
LSF hiccup (an lsf batch machine crashes or is rebooted)
lost Data Base connection (can happen on MOOT or more rarely on Oracle)

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure).

What to do in case of transient failures: rollback the affected process(es) when possible (see below). Look for the dontCleanUp file and check the Log Watcher (see below).

Infrastructure failures

single batch host is killing several jobs ('rogue' LSF host)
group of batch hosts crashed or went offline
afs crashed on one of the host machines
scratch disk is full on one or more of the host machines
staging disk is full
u52 or u15 are full

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena (see L1 shift schedule) if you think that one of those failures might be happening during the night.

Space shortcuts

Child pages

Task of the Data Processing on-call expert

Different types of failures

Transient failures

Infrastructure failures