...
If you remove the above "#BSUB -q atlas-t3" directive, your job will be submitted to a general "fair share" queue. This is one of a group of batch queues (short, medium, long and xlong). LSF will choose the queue for you based on the wall time specified. The general "fair share" queues are a much larger pool of resource. They are available to everyone at SLAC. They are heavily loaded most of the time.
Tips:
Kill all your jobs
If you made a mistake and want to kill all your batch jobs, use the following command
Code Block |
---|
bjobs | grep -v JOBID | awk '{print "bkill", $1}' | sh |
Check the load of the cluster
To to check how many jobs are currently running and pending in atlas-t3 queue:
Code Block |
---|
bqueue atlas-t3 |
You can also check the Ganglia monitoring page to see the load on the cluster.
Use the batch slots efficiently
If you find that your batch job is not resource efficient (CPU/wall time ratio is low AND memory (RA(M) usage is lower (less than 1GB1.5GB, use "bjobs -l job_id" to find out)), try putting two tasks in one batch job (in the above script) and let them run in parallel. But be careful. Don't run too many tasks in a batch job in parallel because it is possible to overwhelm some components in the tightly coupled computing environment.
Use the following command to find out if your jobs are reasonable efficient:
Code Block |
---|
$ export LSB_BJOBS_FORMAT="id:7 user:11 stat:5 exec_host:10 submit_time:13 start_time:13 max_mem:12 cpu_used run_time" $ bjobs | grep DONE JOBID USER STAT EXEC_HOST SUBMIT_TIME START_TIME MAX_MEM CPU_USED RUN_TIME 867402 bpn7 DONE atlprf11 Nov 19 16:18 Nov 20 02:09 654 Mbytes 462.3 second(s) 956 second(s) ... |
The job used 462 CPU seconds during its 956 seconds run time. The maximum RAM it used was 654MB. The batch nodes are configured to accommodate 2GB RAM per CPU (on average). The CPU time and RAM that are not used by the job is wasted. So in the above job script, one can put two of such tasks in the script and let them run in parallel.