Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SLURM

Slurm

...

SLURM is new job scheduling system for the LCLS batch compute systems it is replacing the current LSF system.

...

Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.

Partitions

The partition/queue

Partitions

Partition is the term used by Slurm to reference queues.  Depending on your job requirements, select the appropriate partition:

PartitionNodesDataCores per NodeMem per Node (GB)GPU TypeTime Limit (hr)PriorityCommentsLocation
anagpu

16

ALL 128NVIDIA GTX 1080Ti48  B054 (SRCF)
psanagpu(101 - 105)12
psanagpu(110 - 113)16
*anabatch      Default  

anagpu:This 16 node partition is for individuals wishing to use GPU resources

The information can be provided by the sinfo command.

sinfo

From the psana nodes pool:

Code Block
languagebash
titlesinfo
> sinfo
PARTITION AVAIL  TIMELIMIT  NODES STATE  NODELIST 
anagpu*      up   infinite      9  down* psanagpu[101-109] 
anagpu*      up   infinite      4   idle psanagpu[110-113] 

The * following the name means default partition (queue) is anagpu

Job Submission

There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command to take full advantage of the computing powerfor later execution, and the other is to submit an interactive job via srun

Sbatch

...

sbatch

The following is a simple submission script of a parallel psana batch job run with mpi.  It can be submitted with the command "sbatch submit.sh":. The commands specified in the script file will be ran on the first available compute node that fits the resources requested. 

Code Block
psanagpu101:~$> morecat submit.sh
#!/bin/bash

#SBATCH --partition=anagpu
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=%j.log

# -u flushes print statements which can otherwise be hidden if mpi hangs
mpirun python -u /reg/g/psdm/tutorials/examplePython/mpiDataSource.py

 

This script shows some additional features controllable via slurmSLURM:

Code Block
titlesbatch
> cat tst_script 
#!/bin/bash
#
#SBATCH --job-name=‘name’ # Job name for allocation
#SBATCH --output=‘filename’ # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=‘filename’ # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpu # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8  # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py


> sbatch tst_script 
Submitted batch job 187

Srun

srun

Differently from sbatch, the srun command does not return immediately and waits for the job to complete. The srun command gets can be used to get control of a node to run jobs interactively.  These can be useful for data exploration and significant software development.

The following are a few examples:

Code Block
languagebash
titlesrun
>  srun -N2 -n4 hello.mpi 
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1

Monitoring/Status

squeue

To check that jobs that exist on the system use the squeue command:

...

The ST (job state) field shows that jobid 187 is currently running (R).

sacct

Get information about status of finished jobs

Code Block
languagebash
titlesacct
> sacct     # show all finished job of user
          JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
  ------------ ---------- ---------- ---------- ---------- ---------- -------- 
  141               sleep     anagpu                     1  COMPLETED      0:0 
  142                  ls     anagpu                     1  COMPLETED      0:0 
  143                exit     anagpu                     1     FAILED      2:0 

> sacct -j <jobid>       # show info for specific job  

 

...