Task of the Data Processing on-call expert

L1Proc and halfPipe. Every time a red cross appears on the Data Processing page, next to the L1Proc or halfPipe processing status bar, the Level 1 on-call expert will be paged. We are not on-call for ASP/GRB search (Jim Chiang is on call for these failures) and we are definitely NOT on-call for infrastructure problems (can't see monitoring plots, etc.).

Different types of failures

Transient failures

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure).

What to do in case of transient failures: rollback the affected process(es) when possible (see below). Look for the dontCleanUp file and check the Log Watcher (see below).

Infrastructure failures

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena (see L1 shift schedule) if you think that one of those failures might be happening during the night.