Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When to rollback

The dontCleanUp file

I just caught a merge not finding Any time one of the merges processes can't find all of its input files because of a read failure for the first time. Remember to check the log watcher for merge errors. In a future L1 version, I'll add the run ID to the message target so it can be filtered on., a message is generated in the Log Watcher and cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs will fail if the dontCleanUp file is present. If everything is OK (see instructions below), that file can be removed and the jobs rolled back.

To check that everything is OK, follow these steps:

  • From the glast-ground page, click "Logging" (about 1/3 of the way down left side)
  • Click "Selection" (upper right)
  • Click "More Selections" (lower right)
  • In the "Select Program" field, enter "mergeStuff.py"
  • Click "Table"
  • Now you should have a big list of complaining about not being able to find input files. Can't filter on run number yet, sorry (but the run number will be in the message). If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem,

...

  • and this should be considered a "permanent failure".

Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back (remember to clean up first the chunks and then the entire run... you'll get an error otherwise).

Useful resources

  • List of failed runs. Report here failed runs and their status. When a run is fixed (meaning: no red cross appears anywhere anymore), you can delete it from the list (the runs listed as 'ignore error (it is fixed)' are the ones that were fixed with a fake delivery, so they still appear as failed in the Data Processing page.
  • AFS hiccups. We keep track here of the AFS hiccups (or at least we try to). You can add date/time, LFS job ID and the 'missing' file.
  • //confluence.slac.stanford.edu/display/ds/LSF+hiccups. We keep track here of the LSF hiccups (or at least we try to). You can add date/time, LSF batch host(s) and the reason, if known.
  • List of runs incorrectly marked as InProgress. These are runs where acq status is Complete, but L1 status is InProgress.

Other things you may want to know

SetL1Status task

Tired of being paged because L1Proc status still says Failed after a rollback?

/afs/slac/g/glast/ground/bin/pipeline --mode PROD createStream --define "runNumber=240831665,l1RunStatus=Running" setL1Status
OR:
/afs/slac/g/glast/ground/bin/pipeline --mode PROD createStream --define "runNumber=240837713" setL1Status

l1RunStatus defaults to Running, but you can set it to any of the allowed values.