Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines - see what to do here
  • Wiki Markup
    A staging disk is full (these are accessed from /afs/slac.stanford.edu/g/glast/ground/PipelineStaging\[1-7\])
  • /nfs/farm/g/glast/u52 u41 or /nfs/farm/g/glast/u15 is full

...

Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52u41/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.

...

This process is now automatically retried like most of the others. When it fails, it attempts to the run lock (see above) and the throttle lock (next section) by hand (and usually succeeds). But if it fails harder than usual, you might still have to do that by hand. Also, you'll probably have to:
mv /nfs/farm/g/glast/u52u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt /nfs/farm/g/glast/u52u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt.tmp

...

There's a throttle that limits the number of RDLs that can be in process (or in the hard part, at least) at once. It works by making files with names like /nfs/farm/g/glast/u52u41/L1/throttle/1.lock at the same time as it makes the run lock. config.throttleLimit is normally set at 2, 3 is usually safe but not always. Leave it at 2 unless we're way behind and you're up for watching it closely.

...

Message text: Can't open lockfile /nfs/farm/g/glast/u52u41/L1/r0248039911/r0248039911.lock.

...

You will need to then manually re-enter the run into L1Proc. To do this, bkill any findChunk processes that are associated with the RunID, remove the run lock from /nfs/farm/g/glast/u52u41/rXXX (where XXX is the run number), also move all *chunkList*.txt (leave the .txt.tmp ones alone) files to something else (just suffixing them with ".ignore" should work)

...

NOTE: If repiping a process that has completed, then it may be necessary to rename the chunkList files for that run before issuing L1Proc command (in /nfs/farm/g/glast/u52u41/L1/runs/YYY/r0XXX ... where YYY are the first three numbers to XXX).

...