You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

under construction ...

SLAC use IBM LSF (Load Sharing Facility) batch system. Please refer to the LSF document to get familiar with the basic usage of LSF.

LSF resource available to SLAC ATLAS users:

SLAC ATLAS users have their own dedicate LSF queue and resource. They can also use the "general fairshare" queues available to everyone at SLAC.

Dedicated LSF resource for ATLAS users

SLAC ATLAS users can run jobs in a dedicated LSF queue "atlas-t3". The following command show who can use the dedicate LSF resource, and who can add/remove users to the dedicated resource.

$ ypgroup exam -group atlas-t3
Group 'atlas-t3':
	GID:     3104
	Comment: 
	Last modified at Oct 14 00:22:52 2015 by yangw
	Owners:  sch, sudong, young, zengq 
	Members: acukierm, bpn7, laurenat, makagan, osgatlas01, rubbo, zengq, zihaoj

	This is a secondary group.

The above shows the UNIX group "atlas-t3". People in the "Owners" line and add/remove members of this group. People in the "Member" line can run jobs in the dedicate queue. (Owners are not members).

The following is an example job script for users to submit jobs to the atlas-t3 queue:

$ cat job-script.sh 
#!/bin/sh
# run in LSF queue atlas-t3 and run up to 120 minutes (wall time)
#BSUB -q atlas-t3
#BSUB -W 120
#BSUB -R "select[rhel60 && cvmfs && inet] rusage[scratch=5.0, mem=1000:decay=0]"

# create a unique working directory on batch node's /scratch space
myworkdir=/scratch/`name -n`$$
mkdir $myworkdir
cd $myworkdir

# run payload
task1 < input_of_task1 > output_of_task1 2>&1 &
task2 < input_of_task2 > output_of_task2 2>&1 &
wait  # wait for the tasks to finish 

# save the output to storage, use either "cp" to copy to NFS spaces, or "xrdcp" to copy to the xrootd spaces
cp myoutput_file /nfs/slac/g/atlas/u02/myoutput_file  
xrdcp myoutput_file root://atlprf01:11094//atlas/local/myoutput_file

# clean up
cd ..
rm -rf $myworkdir

$ bsub < job-script.sh  # submit the job

In the above script, the first two #BSUB directives tell LSF that the batch queue is "atlas-t3" and the wall time limit is 120 minutes. Please always specify a wall time. Otherwise, your jobs will be killed after 30 minutes (wall time). The third #BSUB directive is optional. It tells LSF that the job wants to run on RHEL6 platform (rhel60) with cvmfs ("cvmfs") and outbound internet connection ("inet"), and that the job needs up to 5GB of space under /scratch, 1000MB of RAM (these are advises to the LSF scheduler, not caps or limits).

With the two "&" at the end of the task lines (task1 and task2), the two tasks run simultaneously. If you want them to run sequentially, remove the two "&".

Batch resource for general users

If you remove the above "#BSUB -q atlas-t3" directive, your job will be submitted to a general "fair share" queue. This is one of a set of batch queues (short, medium, long and xlong). LSF will choose the queue for you based on the wall time specified. The general "fair share" queues are a much larger pool of resource. They are available everyone at SLAC, the therefore are heavily loaded most of the time.

Tips:

If you made a mistake and want to kill all your batch jobs, use the following command

bjobs | grep -v JOBID | awk '{print "bkill", $1}' | sh 

to check how many jobs are currently running and pending in atlas-t3 queube:

bqueue atlas-t3

If you find that your batch job is not resource efficient (CPU/wall time ratio is low AND memory usage is lower (less than 1GB, see "bjobs -l job_id") AND doesn't require lots of network IO), try putting two tasks in one batch job (in the above script). But be careful. Don't put too many tasks in a batch job because it is very possible that your job is low efficient because there is an unknown factor limiting it.

 

  • No labels