Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Slurm

  • Slurm is new job scheduling system for the LCLS batch compute systems it is replacing the current LSF system.
  • Currently only one queue on the psanagpu nodes exists.
  • Existing batch nodes and new ones will be upgraded to slurm or installed
  • The slurm client is installed on the psana nodes

Partitions

Partition is the term used by Slurm to reference queues.  Depending on your job requirements, select the appropriate partition:

PartitionNodesDataCores per NodeMem per Node (GB)GPU TypeTime Limit (hr)PriorityCommentsLocation
anagpu

16

ALL 128NVIDIA GTX 1080Ti48  B054 (SRCF)
psanagpu(101 - 105)12
psanagpu(106 110 - 116113)16
*anabatch      Default  

...

Code Block
languagebash
titlesinfo
psslurm ~ #> sinfo
PARTITION AVAIL  TIMELIMIT  NODES STATE  NODELIST 
anagpu*      up   infinite 1 drain psanagpu107     9  down* psanagpu[101-109] 
anagpu*      up   infinite 6      4   idle psanagpu[101110-104,110,113] 

The * following the name means default partition (queue) is anagpu

Job Submission

There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command to take full advantage of the computing power, and the other is to submit an interactive job. 

...

The following is a sample submission script (tst_script):

Code Block
titlesbatch
[omarq@psslurm conda 06:10:08]> cat tst_script 
#!/bin/bash
#
#SBATCH --job-name=‘name’ # Job name for allocation
#SBATCH --output=‘filename’ # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=‘filename’ # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpu # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8  # Total number of tasks
#SBATH --tasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py


[omarq@psslurm conda 06:10:11]> sbatch tst_script 
Submitted batch job 187

...

Code Block
languagebash
titlesrun
psslurm> conda 07:29:39 srun -N2 -n4 hello.mpi 
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1

...

Code Block
languagebash
titlesqueue
psslurm ~ #> squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           187     187    anagpu ExampleM    omarq  R       0:04      1 psanagpu110 

The ST (job state) field shows that jobid 187 is currently running (R).

sacct

Get information about status of finished jobs

Code Block
languagebash
titlesacct
> sacct                  # show all finished job of user
> sacct -j <jobid>       # show info for specific job