Confluence will be unusable 23-July-2024 at 06:00 due to a Crowd upgrade.
...
The information can be provided by the sinfo command. From psana nodes:
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm ~ # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST anagpu* up infinite 1 drain psanagpu107 anagpu* up infinite 6 idle psanagpu[101-104,110,113] |
...
There are 2 ways to submit a job on the cluster. The main way is by using the sbatch command to take full advantage of the computing power, and the other is to submit an interactive job.
The commands specified in the script file will be ran on the first available compute node that fits the resources requested.
...
Code Block | ||
---|---|---|
| ||
[omarq@psslurm conda 06:10:08]cat tst_script #!/bin/bash # #SBATCH --job-name=ânameâ # Job name for allocation #SBATCH --output=âfilenameâ # File to which STDOUT will be written, %j inserts jobid #SBATCH --error=âfilenameâ # File to which STDERR will be written, %j inserts jobid #SBATCH --partition=anagpu # Partition/Queue to submit job #SBATCH --gres=gpu:1080ti:1 # Number of GPUs #SBATCH --nodes=1 # number of nodes. #SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm #SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info. # srun -l hostname srun python ExampleMultipleChaperones.py [omarq@psslurm conda 06:10:11]sbatch tst_script Submitted batch job 200187 |
sbatch tst_scirpt
The srun command gets control of a node to run jobs interactively. These can be useful for data exploration and significant software development.
The following are a few examples:
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm conda 07:29:39 srun -N2 -n4 hello.mpi
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1
|
To Prior to submitting a batch job it is recommended to check that jobs that exist on the system by using use the squeue command:
Code Block | ||||
---|---|---|---|---|
| ||||
psslurm ~ # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 187 anagpu ExampleM omarq R 0:04 1 psanagpu110 |
...