Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note: the xrd.pl in the above command is found at /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrd.pl

 

Job fails to start (stays in PEND status)

The particular event that triggered this scenario was the two deliveries of run 543796948.  The first delivery (180315009) had several recon jobs that ended up on a dead host and had to be killed.  While they were hung, the rest of the events in the run arrived in delivery 180315010 and the findChunks task was in a pending state waiting for its pre-exec command (/afs/slac/g/glast/isoc/flightOps/rhel6_gcc44/ISOC_PROD/bin/isoc run /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/5.6/lockFile.py) to give it the go ahead.  After the hung recon tasks were killed, the reaper properly restarted them and the run completed and properly cleaned up after itself.  However, the findChunks task never started up.  LSF normally retries the pre-exec command regularly (every 2.5 minutes in this case - how to we check?) but had not run it in over 2 hours (again how to we check? is there a max retries?).

When this happens:

1) First try stopping and restarting the job (bstop and bresume).  This will often kickstart things and get them going again

2) If that doesn't work, just bkill the process and let the reaper resubmit it (Which is what we had to do in this case).