Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

2) If that doesn't work, just bkill the process and let the reaper resubmit it (Which is what we had to do in this case).

 

Handling a host with lots of hung jobs

This is only a partial description of handling this.  This comes up when there are a large number of jobs in an UNKWN state all on the same host.  Often, looking at the status of the host (bhosts <hostname>) will tell you it is unavailable, in which case you're probably safe.  If it is still open and accepting jobs, however, you want to prevent it from getting new batch jobs as they will just hang as well.  From Warren your options are:

Close the host.

sudo badmin -C "Witty comment" <hostname>

Which requires sudo permission which you probably don't have yet and they'll want to limit to some machine we own and we should find a box that's not my workstation to use for that.

 

Fill it up with dummy jobs.

for ii in $(seq 8) ; do bsub -q glastdataq -sp 90 -m <hostname> sleep 86400

# in bash, as glastraw

 

So probably really only one realistic option at the moment.

 

Also should probably notify unix-admin.