Skip to end of metadata
Go to start of metadata

SLAC uses the LSF (Load Sharing Facility) batch system. LSF replica your current environment setup when submitting jobs. This includes your current working directory and any Unix environment variable setups. The following are examples of using LSF: 

  • Submit a job.
$ cat myjob.sh
#!/bin/sh
#BSUB -W180
pwd
echo "hello world"

$ bsub < myjob.sh
Job <96917> is submitted to default queue <medium>.

This will submit a job to LSF. The "pwd" command will print out the job's working directory, which should be the same directory where this job is submitted. The #BSUB -W180 directive tells LSF that the job's maximum run time limit (wall clock time) is 180 minutes. After that the job will be killed. If #BSUB -Wnnn isn't specified, your job get the default, which is 30 minutes.

  • Check job status.
$ bjobs
$ bjobs -l <job ID>
  • Kill a job.
$ bkill <job ID>

Please refer to the LSF document to get familiar with the basic usage of LSF.


LSF resource available to SLAC ATLAS users: (this looks old.)

SLAC ATLAS users have their own dedicate LSF queue and resource. They can also use the "general fairshare" queues. The latter are available to everyone at SLAC.

Dedicated LSF resource for ATLAS users

SLAC ATLAS users can run jobs in a dedicated LSF queue "atlas-t3". The following command show who can use the dedicate LSF resource, and who can add/remove users to the dedicated resource.

The above shows the UNIX group "atlas-t3". People in the "Owners" line can add/remove members of this group. People in the "Member" line can run jobs in the dedicate queue. (Owners are not members).

The following is an example job script for users to submit jobs to the atlas-t3 queue:

In the above script, #BSUB... lines are more than just comments in this unix Bourne Shell script. They are also directives for the LSF "bsub" command. The first two #BSUB directives tell LSF that the batch queue is "atlas-t3" and the wall time limit is 120 minutes. Please always specify a wall time. Otherwise, your jobs will be killed after 30 minutes (wall time). The third #BSUB directive is optional. It tells LSF that the job wants to run on RHEL6 platform ("rhel60") with cvmfs ("cvmfs") and outbound internet connection ("inet"), and that the job needs up to 5GB of space under /scratch, 1000MB of RAM (these are advises to the LSF scheduler, not caps or limits).

With the two "&" at the end of the task lines (task1 and task2), the two tasks run simultaneously. If you want them to run sequentially, remove the two "&".

Batch resource for general users

If you remove the above "#BSUB -q atlas-t3" directive, your job will be submitted to a general "fair share" queue. This is one of a group of batch queues (short, medium, long and xlong). LSF will choose the queue for you based on the wall time specified. The general "fair share" queues are a much larger pool of resource. They are available to everyone at SLAC. They are heavily loaded most of the time.

Tips:

Kill all your jobs

If you made a mistake and want to kill all your batch jobs, use the following command

Check the load of the cluster

To check how many jobs are currently running and pending in atlas-t3 queue:

You can also check the Ganglia monitoring page to see the load on the cluster.

Use the batch slots efficiently

If you find that your batch job is not resource efficient (CPU/wall time ratio is low AND memory (RA(M) usage is lower (less than 1.5GB, use "bjobs -l job_id" to find out)), try putting two tasks in one batch job (in the above script) and let them run in parallel. But be careful. Don't run too many tasks in a batch job in parallel because it is possible to overwhelm some components in the tightly coupled computing environment. 

Use the following command to find out if your jobs are reasonable efficient:

The job used 462 CPU seconds during its 956 seconds run time. The maximum RAM it used was 654MB. The batch nodes are configured to accommodate 2GB RAM per CPU (on average). The CPU time and RAM that are not used by the job is wasted. So in the above job script, one can put two of such tasks in the script and let them run in parallel.

  • No labels