Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In 2021 LCLS switch to the SLURM batch system.

Information on submitting jobs to the SLURM system at LCLS can be found on this page: Submitting SLURM Batch Jobs

Information on the Automatic Run Processing system (ARP) can be found on this page: Automatic Run Processing (ARP).  This is also usable at sites like NERSC and SDF.

A "cheat sheet" showing similar commands on LSF and SLURM can be found here: https://slurm.schedmd.com/rosetta.pdf

Refer to the table below for the batch resources available in psana. Submit your job from an interactive node (where you land after doing ssh psana). All nodes in the queues listed below run RHEL7. By submitting from an interactive node, also running RHEL7, you will ensure that your job inherits a RHEL7 environment.

Note 1: Jobs for the current experiment can be submitted to fast feedback (FFB) queues, which allocate resources for the most recent experiments. The FFB queues in the tables below are for LCLS-II experiments (TMO, RIX and UED). The FEH experiments (LCLS-I, including XPP) can submit FFB jobs to the new Fast Feedback System.

Warning

As of February 2023, the offline compute resources have been consolidated into the psanaq. The priority queues have been removed.

sQueue name

Node names on SLURM queuesNumber of Nodes

Comments

Throughput
[Gbit/s]

Cores/
Node

RAM [GB/node]

Time limit 

psanaq

 psana15xx
 psana16xx

34Primary psana queue401612848hrs
psanagpuqpsanagpu113-psanagpu1186GPU nodes101612848hrs

Batch Nodes

Depending on your data access you may need to submit jobs to a specific farm. This is accomplished by submitting to the appropriate LSF batch queue. Refer to the table below. Jobs for the running experiment should be submitted to high priority queues psnehq and psfehq. Multi-core OpenMPI jobs should be run in either the psnehmpiq or psfehmpiq batch queue, see the following section on "Submitting OpenMPI Batch Jobs". Simulation jobs should be submitted to the low priority queues psfehidle and psfehidle. CPU intensive jobs which don't demand high data throughout should be submitted to the psanacsq queue.

Experimental Hall

Queue

Nodes

Data

Comments

NEH

psnehq

psana11xx

ana01, ana02

Jobs <= 6 cores

 

psnehmpiq

psana11xx,psana12xx

ana01, ana02

OpenMPI jobs > 6 cores, preemptable

 

psnehidle

psana12xx

 

Simulations, preemptable, low priority

FEH

psfehq

psana13xx

ana11, ana12

Jobs <= 6 cores

 

psfehmpiq

psana13xx,psana14xx

ana11, ana12

OpenMPI jobs > 6 cores, preemptable

 

psfehidle

psana14xx

 

Simulations, preemptable, low priority

NEH/FEH

psanacsq

psanacs0xx

ana01, ana02, ana11, ana12

CPU intensive, limited data throughput

Submitting Batch Jobs

LSF (Load Sharing Facility) is the job scheduler used at SLAC to execute user batch jobs on the various batch farms. LSF commands can be run from a number of SLAC servers, but best to use psexport or pslogin. Login first to pslogin (from SLAC) or to psexport (from anywhere). From there you can submit a job with the following command:

No Format

bsub -q psnehq -o <output file name> <job_script_command>

For example:

No Format

bsub -q psnehq -o ~/output/job.out my_program

This will submit a job (my_program) to the queue psnehq and write its output to a file named ~/output/job.out. You may check on the status of your jobs using the bjobs command.

Resource requirements can be specified using the "-R" option. For example, to make sure that a job is run on a node with 1 GB (or more) of available memory, use the following:

No Format

bsub -q psnehq -R "rusagemem=1024" my_program
Submitting OpenMPI Batch Jobs

The RedHat supplied OpenMPI packages are installed on pslogin, psexport and all of the psana batch servers.

The system default has been set to the current version as supplied by RedHat.

No Format

$ mpi-selector --query
default:openmpi-1.4-gcc-x86_64
level:system

Your environment should be set up to use this version (unless you have used RedHat's mpi-selector script, or your login scripts, to override the default). You can check to see if your PATH is correct by issuing the commandwhich mpirun. Currently, this should return /usr/lib64/openmpi/1.4-gcc/bin/mpirun. Future updates to the MPI version may change the exact details of this path.

In addition, your LD_LIBRARY_PATH should include /usr/lib64/openmpi/1.4-gcc/lib (or something similar).

For notes on compiling examples, please see:

http://www.slac.stanford.edu/comp/unix/farm/mpi.html 

The following are examples of how to submit OpenMPI jobs to the PCDS psnehmpiq batch queue:

No Format

bsub -q psnehmpiq -a mympi -n 32 -o ~/output/%J.out ~/bin/hello

Will submit an OpenMPI job (-a mympi) requesting 32 processors (-n 32) to the psnehmpiq batch queue (-q psnehmpiq).

No Format

bsub -q psfehmpiq -a mympi -n 16 -R "span[ptile=1]" -o ~/output/%J.out ~/bin/hello

Wiki Markup
Will submit an OpenMPI job (-a mympi) requesting 16 processors (-n 16) spanned as one processor per host (-R "span\[ptile=1\]") to the psfehmpiq batch queue (-q psfehmpiq).

No Format

bsub -q psfehmpiq -a mympi -n 12 -R "span[hosts=1]" -o ~/output/%J.out ~/bin/hello

Wiki Markup
Will submit an OpenMPI job (-a mympi) requesting 12 processors (-n 12) spanned all on one host (-R "span\[hosts=1\]") to the psfehmpiq batch queue (-q psfehmpiq).

Common LSF Commands

Report status of all jobs (running, pending, finished, etc) submitted by the current user:

Code Block

bjobs -w -a

Report only running or pending jobs submitted by user "radmer":

Code Block

bjobs -w -u radmer

Report running or pending jobs for all users in the psnehq queue:

Code Block

bjobs -w -u all -q psnehq

Kill a specific batch job based on its job ID number, where the "bjobs" command can be used to find the appropriate job ID (note that only batch administrators can kill jobs belonging to other users).

Code Block

bkill JOB_ID

Report current node usage on the two NEH batch farms:

Code Block

bhosts -w ps11farm ps12farm

The following links give more detailed LSF usage information:

PowerPoint presentation describing LSF for LCLS users at SLAC

Batch system in a nutshell

...