Page History

...

SLURM

Slurm

...

SLURM is new job scheduling system for the LCLS batch compute systems it is replacing the current LSF system.

...

Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.

Partitions

The partition/queue

Partitions

Partition is the term used by Slurm to reference queues. Depending on your job requirements, select the appropriate partition:

Partition	Nodes	Data	Cores per Node	Mem per Node (GB)	GPU Type	Time Limit (hr)	Priority	Location
anagpu	16	ALL		128	NVIDIA GTX 1080Ti	48		B054 (SRCF)
	psanagpu(101 - 105)		12
	psanagpu(110 - 113)		16
*anabatch							Default

anagpu:This 16 node partition is for individuals wishing to use GPU resources

The information can be provided by the sinfo command.

sinfo

From the psana nodes pool:

Code Block

language	bash
title	sinfo

> sinfo
PARTITION AVAIL  TIMELIMIT  NODES STATE  NODELIST 
anagpu*      up   infinite      9  down* psanagpu[101-109] 
anagpu*      up   infinite      4   idle psanagpu[110-113] 

The * following the name means default partition (queue) is anagpu

Job Submission

There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command to take full advantage of the computing powerfor later execution, and the other is to submit an interactive job via srun.

Sbatch

...

sbatch

The following is a simple submission script of a parallel psana batch job run with mpi. It can be submitted with the command "sbatch submit.sh":. The commands specified in the script file will be ran on the first available compute node that fits the resources requested.

Code Block

psanagpu101:~$> morecat submit.sh
#!/bin/bash

#SBATCH --partition=anagpu
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=%j.log

# -u flushes print statements which can otherwise be hidden if mpi hangs
mpirun python -u /reg/g/psdm/tutorials/examplePython/mpiDataSource.py

This script shows some additional features controllable via slurmSLURM:

Code Block

title	sbatch

> cat tst_script 
#!/bin/bash
#
#SBATCH --job-name=ânameâ # Job name for allocation
#SBATCH --output=âfilenameâ # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=âfilenameâ # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpu # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8  # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py


> sbatch tst_script 
Submitted batch job 187

Srun

srun

Differently from sbatch, the srun command does not return immediately and waits for the job to complete. The srun command gets can be used to get control of a node to run jobs interactively. These can be useful for data exploration and significant software development.

The following are a few examples:

Code Block

language	bash
title	srun

>  srun -N2 -n4 hello.mpi 
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1

Monitoring/Status

squeue

To check that jobs that exist on the system use the squeue command:

...

The ST (job state) field shows that jobid 187 is currently running (R).

sacct

Get information about status of finished jobs

Code Block

language	bash
title	sacct

> sacct     # show all finished job of user
          JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
  ------------ ---------- ---------- ---------- ---------- ---------- -------- 
  141               sleep     anagpu                     1  COMPLETED      0:0 
  142                  ls     anagpu                     1  COMPLETED      0:0 
  143                exit     anagpu                     1     FAILED      2:0 

> sacct -j <jobid>       # show info for specific job

...

Child pages

Versions Compared

Old Version 17

New Version 18

Key

SLURM

Slurm

Partitions

Partitions

sinfo

Job Submission

Sbatch

sbatch

Srun

srun

Monitoring/Status

squeue

sacct