Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Seems to be an NFS error.  I can't write to that directory as myself or the glast account.  Need to check on propre account/permissions

 

Handling Stuck Jobs

A stuck job usually shows up in a bjobs list as having consumed no CPU or memory, and in an UNKWN state and has been sitting around for a long time (depends on job, need more data).

The procedure is to log on as glastraw an bkill them.  As Warren explained:

You can't roll back a running job. You have to bkill them and wit for the reaper to notice that they're dead. The pipeline will automatically retry them once for L1 (0 is the default, you can ask for up to 3). If the jobs you killed were retries, then you'll have to roll them back. I've never seen a job in unknown state recover by itself.

 

Checking Host Status

Use bhosts to check the host status (although you might want to grep with the host name as the list is long).

 

Running a command directly on a Host

From Warren:

You can just immediately run a command on any batch host with lsrun. Obviously this should be used sparingly. uptime is a good cheap thing that will show you if the host is responsive and what the load is. Or check to see if a particular filesystem is causing trouble, or if some command works.