You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

SLURM

SLURM is new job scheduling system for the LCLS batch compute systems it is replacing the current LSF system. Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.

Partitions

The partition/queue information can be provided by the sinfo command.

sinfo

From the psana pool:

sinfo
> sinfo
PARTITION AVAIL  TIMELIMIT  NODES STATE  NODELIST 
anagpu*      up   infinite      9  down* psanagpu[101-109] 
anagpu*      up   infinite      4   idle psanagpu[110-113] 

The * following the name means default partition (queue) is anagpu

Job Submission

There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command for later execution, and the other is to submit an interactive job via srun

sbatch

The following is a simple submission script of a parallel psana batch job run with mpi.  It can be submitted with the command "sbatch submit.sh". The commands specified in the script file will be ran on the first available compute node that fits the resources requested. 

> cat submit.sh
#!/bin/bash

#SBATCH --partition=anagpu
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=%j.log

# -u flushes print statements which can otherwise be hidden if mpi hangs
`which mpirun` python -u /reg/g/psdm/tutorials/examplePython/mpiDataSource.py

This script shows some additional features controllable via SLURM:

sbatch
> cat tst_script 
#!/bin/bash
#
#SBATCH --job-name=‘name’ # Job name for allocation
#SBATCH --output=‘filename’ # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=‘filename’ # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpu # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8  # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py


> sbatch tst_script 
Submitted batch job 187
srun

Differently from sbatch, the srun command does not return immediately and waits for the job to complete. The srun command can be used to get control of a node to run interactively.  These can be useful for data exploration and software development.

The following are a few examples:

srun
>  srun -N2 -n4 hello.mpi 
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1

Monitoring/Status

squeue

To check that jobs that exist on the system use the squeue command:

squeue
> squeue 
       JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         187    anagpu ExampleM    omarq  R       0:04      1 psanagpu110 

The ST (job state) field shows that jobid 187 is currently running (R).

sacct

Get information about status of finished jobs

sacct
> sacct     # show all finished job of user
          JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
  ------------ ---------- ---------- ---------- ---------- ---------- -------- 
  141               sleep     anagpu                     1  COMPLETED      0:0 
  142                  ls     anagpu                     1  COMPLETED      0:0 
  143                exit     anagpu                     1     FAILED      2:0 

> sacct -j <jobid>       # show info for specific job  



  • No labels