Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Batch resources available to the SLAC Analysis Computing Facility users

SLAC uses the LSF (Load Sharing Facility) batch system. LSF replica your current environment setup when submitting jobs. This includes your current working directory and any Unix environment variable setups. All batch nodes available to the ATLAS users have

  • CVMFS (/cvmfs/atlas.cern.ch, /cvmfs/sft.cern.cn, etc.)

...

  • Access to SLAC networked storages

...

  • (/nfs, /gpfs, etc.)
  • Large /scratch space for temporary use (please clean up your files from there after your job finishes).
  • Outbound network connectivity.
  • Singularity container on all batch nodes that run CentOS 7 operation system.

The following are examples of using LSF: 

...

SLAC batch resources consist of several generation of hardwares. They are listed at the the shakeholder's priority page. Some of the batch nodes run RHEL 6 operation system, while others run CentOS 7 operation system. Singularity container technology is available on the CentOS 7 batch nodes.

  • To run your job on a RHEL 6 batch node only, use:  bsub -R "select[rhel6]" ...

  • To run your job on a CentOS 7 batch node only, use: bsub -R "select[centos7]" ...

...

  • bsub -n 4 -R 'span[hosts=1]' ... will submit jobs a job requesting 4 core and (4x 4GB =) 16GB RAM), and allocate all 4 cores on one machine (This is what "span[hosts=1]" is for)

Of course, the more resource you ask, the harder to schedule the jobs, and hence the pending time will be longer.

Below is an long example of resource selection in LSF, for you to pick and choose from:

  •  -R "select[ ! preempt && rhel60 & cvmfs && inet && bullet && hname != bullet0030] rusage[scratch=5.0:duration=1440:decay=1, mem=2000:decay=0] span[hosts=1]" 

It requests the job be dispatched to a machine where

  1. Your job won't be preempted by someone else's higher priority job
  2. The machine run RHEL6 operating system (we have "rhel60" and "centos7")
  3. The machine should have CVMFS,
  4. and outbound internet connection (the "inet" key word above)
  5. The machine should be part of the "bullet" cluster (Other clusters we have: fell, hequ, dole, kiso, deft and bubble, all run "rhel60" except the last two, which run "centos7"), and not on host bullet0030
  6. Reserve 5GB of free space under /scratch, and you job will reserve it for 1440 minutes, and the reserved amount will decay linearly from 100% to 0 during this period.
  7. Reserve 2000MB of RAM, no decay (the default)
  8. span[hosts=1] means if you request more than one batch slots (the -n option above), schedule all of them on one machine.

Note:

  • a. For 6 or 7 to work, the machine should have that amount of resource available at the time the job is dispatched.
  • b. CVMFS cache is usually stored under /scratch/cvmfs2_cache. (This is a way to make sure that there are free space for CVMFS cache so your job won't get error when accessing CVMFS)
  • c. Most SLAC batch users doesn't use 6 or 7, even if they do, they can use more (because the amount specified in 6 or 7 are "reserved", not maximum). So after your job started at a machine, something bad can still happen (run out of memory, /scratch etc.) due to the activities on that machine.

 

Please refer to the LSF document to get familiar with the basic usage of LSF.LSF resource

Batch resources available to the SLAC ATLAS Department users

...

The above information describes running batch jobs on the "general fair share" queues, which is available to everyone who has a SLAC unix account. ATLAS users have a relatively higher priority on those resource according to the shakeholder's priority page. The following info is for SLAC ATLAS Department users only.

Private batch resource owned by the SLAC ATLAS Department

In addition to the "general fair share" resource, SLAC ATLAS Department

...

SLAC ATLAS users have their own dedicate LSF queue and resource. They can also use the "general fairshare" queues. The latter are available to everyone at SLAC.

Dedicated LSF resource for ATLAS users

SLAC ATLAS users can run jobs in a dedicated LSF queue "atlas-t3". The corresponding batch cluster runs RHEL6The following command show who can use the dedicate LSF resource, and who can add/remove users to the dedicated resource.

...