...
Code Block |
---|
|
>psanagpu104:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
anagpupsanagpuq* up up infinite10-00:00:0 1 drain* psanagpu118
psanagpuq* up 10-00:00:0 92 down* psanagpu[101115-109116]
psanagpuq*
anagpu up 10-00:00:0 1 drain psanagpu117
psanagpuq* up 10-00:00:0 2 idle psanagpu[113-114]
psanaq up 10-00:00:0 1 infinitedrain* psana1509
psanaq up 10-00:00:0 4 down* psana[1503,1519,1604-1605]
psanaq idle psanagpu[110-113]
up 10-00:00:0 6 mix psana[1502,1504-1506,1520,1602]
psanaq up 10-00:00:0 1 alloc psana1501
psanaq up 10-00:00:0 27 idle psana[1507-1508,1510-1518,1601,1606-1620]
psanaq up 10-00:00:0 1 down psana1603
The * following the name means default partition (queue) is anagpupsanagpuq |
Check available GPUs on a specific node
...
Code Block |
---|
> cat submit.sh
#!/bin/bash
#SBATCH --partition=anagpupsanaq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=3
#SBATCH --output=%j.log
# "-u" flushes print statements which can otherwise be hidden if mpi hangs
# "-m mpi4py.run" allows mpi to exit if one rank has an exception
mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSource.py |
One can also do this same command from the command line using the "--wrap" option for sbatch:
Code Block |
---|
sbatch -p psfehqpsanaq --nodes 2 --ntasks-per-node 3 --wrap="mpirun python mpi_simpletest.py" |
...
Code Block |
---|
|
> cat tst_script
#!/bin/bash
#
#SBATCH --job-name=<name> # Job name for allocation
#SBATCH --output=%j.log # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=%j.err # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpupsanagpuq # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8 # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py
> sbatch tst_script
Submitted batch job 187
|
...
Code Block |
---|
|
>psanagpu104:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
466440 187 psanaq anagpu ExampleM wrap omarq khegazy R 1-17:51:12 1 psana1502
0:04 1 psanagpu110
466423 psanaq wrap khegazy R 1-17:53:31 1 psana1506
466420 psanaq wrap khegazy R 1-17:53:34 1 psana1602
466421 psanaq wrap khegazy R 1-17:53:34 1 psana1504 |
The ST (job state) field shows that jobid 187 466440 is currently running (R).
...
Code Block |
---|
|
>psanagpu104:~$ sacct # show all finished job of user
-u khegazy
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
141466416 wrap psanaq 10 RUNNING 0:0
466416.batch batch sleep anagpu 10 1 COMPLETEDRUNNING 0:0
142 466418 wrap psanaq 10 RUNNING 0:0
466418.batch lsbatch anagpu 10 1 COMPLETEDRUNNING 0:0
143466420 wrap psanaq 10 exit RUNNING anagpu 0:0
466420.batch batch 1 FAILED 2:0
> sacct -j <jobid> 10 #RUNNING show info for specific job 0:0 |