Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

SLAC uses the IBM LSF (Load Sharing Facility) batch system. LSF replica your current environment setup when submitting jobs. This includes your current working directory and any Unix environment variable setups. The following are examples of using LSF: 

  • Submit a job.
$ cat myjob.sh
#!/bin/sh
#BSUB -W180
pwd
echo "hello world"

$ bsub < myjob.sh
Job <96917> is submitted to default queue <medium>.

This will submit a job to LSF. The "pwd" command will print out the job's working directory, which should be the same directory where this job is submitted. The #BSUB -W180 directive tells LSF that the job's maximum run time limit (wall clock time) is 180 minutes. After that the job will be killed. If #BSUB -Wnnn isn't specified, your job get the default, which is 30 minutes.

  • Check job status.
$ bjobs
$ bjobs -l <job ID>
  • Kill a job.
$ bkill <job ID>

Please refer to the LSF document to get familiar with the basic usage of LSF.

...

SLAC ATLAS users have their own dedicate LSF queue and resource. They can also use the "general fairshare" queues. The latter are available to everyone at SLAC.

Dedicated LSF resource for ATLAS users

SLAC ATLAS users can run jobs in a dedicated LSF queue "atlas-t3". The following command show who can use the dedicate LSF resource, and who can add/remove users to the dedicated resource.

...

With the two "&" at the end of the task lines (task1 and task2), the two tasks run simultaneously. If you want them to run sequentially, remove the two "&".

Batch resource for general users

If you remove the above "#BSUB -q atlas-t3" directive, your job will be submitted to a general "fair share" queue. This is one of a group of batch queues (short, medium, long and xlong). LSF will choose the queue for you based on the wall time specified. The general "fair share" queues are a much larger pool of resource. They are available to everyone at SLAC. They are heavily loaded most of the time.

Tips:

Kill all your jobs

If you made a mistake and want to kill all your batch jobs, use the following command

Code Block
bjobs | grep -v JOBID | awk '{print "bkill", $1}' | sh 
Check the load of the cluster

To check how many jobs are currently running and pending in atlas-t3 queue:

...

You can also check the Ganglia monitoring page to see the load on the cluster.

Use the batch slots efficiently

If you find that your batch job is not resource efficient (CPU/wall time ratio is low AND memory (RA(M) usage is lower (less than 1.5GB, use "bjobs -l job_id" to find out)), try putting two tasks in one batch job (in the above script) and let them run in parallel. But be careful. Don't run too many tasks in a batch job in parallel because it is possible to overwhelm some components in the tightly coupled computing environment. 

...