Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you remove the above "#BSUB -q atlas-t3" directive, your job will be submitted to a general "fair share" queue. This is one of a group of batch queues (short, medium, long and xlong). LSF will choose the queue for you based on the wall time specified. The general "fair share" queues are a much larger pool of resource. They are available to everyone at SLAC. They are heavily loaded most of the time.

Tips:

Kill all your jobs

If you made a mistake and want to kill all your batch jobs, use the following command

Code Block
bjobs | grep -v JOBID | awk '{print "bkill", $1}' | sh 
Check the load of the cluster

To to check how many jobs are currently running and pending in atlas-t3 queue:

Code Block
bqueue atlas-t3

You can also check the Ganglia monitoring page to see the load on the cluster.

Use the batch slots efficiently

If you find that your batch job is not resource efficient (CPU/wall time ratio is low AND memory (RA(M) usage is lower (less than 1GB1.5GB, use "bjobs -l job_id" to find out)), try putting two tasks in one batch job (in the above script) and let them run in parallel. But be careful. Don't run too many tasks in a batch job in parallel because it is possible to overwhelm some components in the tightly coupled computing environment. 

Use the following command to find out if your jobs are reasonable efficient:

Code Block
$ export LSB_BJOBS_FORMAT="id:7 user:11 stat:5 exec_host:10 submit_time:13 start_time:13 max_mem:12 cpu_used run_time"
$ bjobs | grep DONE
JOBID   USER        STAT  EXEC_HOST  SUBMIT_TIME   START_TIME    MAX_MEM      CPU_USED RUN_TIME
867402  bpn7        DONE  atlprf11   Nov 19 16:18  Nov 20 02:09  654 Mbytes   462.3 second(s) 956 second(s)
...
 

The job used 462 CPU seconds during its 956 seconds run time. The maximum RAM it used was 654MB. The batch nodes are configured to accommodate 2GB RAM per CPU (on average). The CPU time and RAM that are not used by the job is wasted. So in the above job script, one can put two of such tasks in the script and let them run in parallel.