...
...
SLURM is new job scheduling system for the LCLS batch compute systems it is replacing the current LSF system.
...
Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.
The partition/queue
Partition is the term used by Slurm to reference queues. Depending on your job requirements, select the appropriate partition:
Partition | Nodes | Data | Cores per Node | Mem per Node (GB) | GPU Type | Time Limit (hr) | Priority | Comments | Location |
---|---|---|---|---|---|---|---|---|---|
anagpu | 16 | ALL | 128 | NVIDIA GTX 1080Ti | 48 | B054 (SRCF) | |||
psanagpu(101 - 105) | 12 | ||||||||
psanagpu(110 - 113) | 16 | ||||||||
*anabatch | Default |
anagpu:This 16 node partition is for individuals wishing to use GPU resources
The information can be provided by the sinfo
command.
From the psana nodes pool:
Code Block | ||||
---|---|---|---|---|
| ||||
> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST anagpu* up infinite 9 down* psanagpu[101-109] anagpu* up infinite 4 idle psanagpu[110-113] The * following the name means default partition (queue) is anagpu |
There are 2 ways to submit a job on the cluster. The main way is by using the sbatch
command to take full advantage of the computing powerfor later execution, and the other is to submit an interactive job via srun
.
...
The following is a simple submission script of a parallel psana batch job run with mpi. It can be submitted with the command "sbatch submit.sh
":. The commands specified in the script file will be ran on the first available compute node that fits the resources requested.
Code Block |
---|
psanagpu101:~$> morecat submit.sh #!/bin/bash #SBATCH --partition=anagpu #SBATCH --ntasks=4 #SBATCH --ntasks-per-node=2 #SBATCH --output=%j.log # -u flushes print statements which can otherwise be hidden if mpi hangs mpirun python -u /reg/g/psdm/tutorials/examplePython/mpiDataSource.py |
This script shows some additional features controllable via slurmSLURM:
Code Block | ||
---|---|---|
| ||
> cat tst_script #!/bin/bash # #SBATCH --job-name=ânameâ # Job name for allocation #SBATCH --output=âfilenameâ # File to which STDOUT will be written, %j inserts jobid #SBATCH --error=âfilenameâ # File to which STDERR will be written, %j inserts jobid #SBATCH --partition=anagpu # Partition/Queue to submit job #SBATCH --gres=gpu:1080ti:1 # Number of GPUs #SBATCH --ntask=8 # Total number of tasks #SBATCH --ntasks-per-node=4 # Number of tasks per node #SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm #SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info. # srun -l hostname srun python ExampleMultipleChaperones.py > sbatch tst_script Submitted batch job 187 |
Differently from sbatch
, the srun
command does not return immediately and waits for the job to complete. The srun command gets can be used to get control of a node to run jobs interactively. These can be useful for data exploration and significant software development.
The following are a few examples:
Code Block | ||||
---|---|---|---|---|
| ||||
> srun -N2 -n4 hello.mpi Process 0 on psanagpu110 out of 1 Process 0 on psanagpu110 out of 1 Process 0 on psanagpu113 out of 1 Process 0 on psanagpu113 out of 1 |
To check that jobs that exist on the system use the squeue command:
...
The ST (job state) field shows that jobid 187 is currently running (R).
Get information about status of finished jobs
Code Block | ||||
---|---|---|---|---|
| ||||
> sacct # show all finished job of user JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 141 sleep anagpu 1 COMPLETED 0:0 142 ls anagpu 1 COMPLETED 0:0 143 exit anagpu 1 FAILED 2:0 > sacct -j <jobid> # show info for specific job |
...