Things to know while on-call for Data Processing

Task of the Data Processing on-call expert

L1Proc and halfPipe. Every time a red cross appears on the Data Processing page, next to the L1Proc or halfPipe processing status bar, the Level 1 on-call expert will be paged. We are not on-call for ASP/GRB search (Jim Chiang is on call for these failures) and we are definitely NOT on-call for infrastructure problems (can't see monitoring plots, etc.). If you get paged for something that is not under your responsibility, don't try to fix it: forward the message to the appropriate people and report everything in the Ops Log.

Different types of failures

There are 3 main type of failures, and should be handled differently.

Transient failures

afs hiccup (can't read from/write to a resource located on an afs disk)
nfs hiccup (can't read from/write to a resource located on an afs disk)
LSF hiccup (an lsf batch machine crashes or is rebooted)
lost Data Base connection (can happen on MOOT or more rarely on Oracle)
recon segfaults for no apparent reason (nothing in logfile)

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is.

Infrastructure failures

single batch host is killing several jobs ('rogue' LSF host)
group of batch hosts crashed or went offline
afs crashed on one of the host machines
scratch disk is full on one or more of the host machines
staging disk is full
u52 or u15 are full

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena (see L1 shift schedule) if you think that one of those failures might be happening during the night (if in doubt, page anyways).

Permanent failures

G4 propagator error
Overlapping chunks

If recon dies complaining about g4propagator, we can't fix it.
In either of the above cases, email Heather and Anders. Include a link to the log file, which will tell them where the core file is.

If trending merges complain about overlapping chunks, tell Bryson and datamonlist@glast2.Stanford.EDU. Rollback won't help.

Everything to know about the Rollback

How to rollback

When to rollback

The dontCleanUp file

I just caught a merge not finding all of its input files because of a read failure for the first time. Remember to check the log watcher for merge errors. In a future L1 version, I'll add the run ID to the message target so it can be filtered on.

From the glast-ground page, click "Logging" (about 1/3 of the way down left side)
Click "Selection" (upper right)
Click "More Selections" (lower right)
In the "Select Program" field, enter "mergeStuff.py"
Click "Table"

Now you should have a big list of complaining about not being able to find input files. Can't filter on run number yet, sorry. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, I don't even know what.

Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back.

Space shortcuts

Child pages