Table of Contents |
---|
Partition is the term used by Slurm to reference queues. Depending on your job requirements, select the appropriate partition:
Partition | Nodes | Data | Cores per Node | Mem per Node (GB) | GPU Type | Time Limit (hr) | Priority | Comments | Location |
---|---|---|---|---|---|---|---|---|---|
anagpu | 16 | ALL | 128 | NVIDIA GTX 1080Ti | 48 | B054 (SRCF) | |||
psanagpu(101 - 105) | 12 | ||||||||
psanagpu(106 110 - 116113) | 16 | ||||||||
*anabatch | Default |
...
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm ~ #> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST anagpu* up infinite 1 drain psanagpu107 9 down* psanagpu[101-109] anagpu* up infinite 6 4 idle psanagpu[101110-104,110,113] The * following the name means default partition (queue) is anagpu |
There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command to take full advantage of the computing power, and the other is to submit an interactive job.
...
The following is a sample submission script (tst_script):
Code Block | ||
---|---|---|
| ||
[omarq@psslurm conda 06:10:08]> cat tst_script #!/bin/bash # #SBATCH --job-name=ânameâ # Job name for allocation #SBATCH --output=âfilenameâ # File to which STDOUT will be written, %j inserts jobid #SBATCH --error=âfilenameâ # File to which STDERR will be written, %j inserts jobid #SBATCH --partition=anagpu # Partition/Queue to submit job #SBATCH --gres=gpu:1080ti:1 # Number of GPUs #SBATCH --ntask=8 # Total number of tasks #SBATH --tasks-per-node=4 # Number of tasks per node #SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm #SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info. # srun -l hostname srun python ExampleMultipleChaperones.py [omarq@psslurm conda 06:10:11]> sbatch tst_script Submitted batch job 187 |
...
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm> conda 07:29:39 srun -N2 -n4 hello.mpi Process 0 on psanagpu110 out of 1 Process 0 on psanagpu110 out of 1 Process 0 on psanagpu113 out of 1 Process 0 on psanagpu113 out of 1 |
...
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm ~ #> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 187 187 anagpu ExampleM omarq R 0:04 1 psanagpu110 |
The ST (job state) field shows that jobid 187 is currently running (R).
Get information about status of finished jobs
Code Block | ||||
---|---|---|---|---|
| ||||
> sacct # show all finished job of user
> sacct -j <jobid> # show info for specific job |