Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlesinfo
>psanagpu104:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST 
anagpupsanagpuq*    up  up   infinite10-00:00:0      1 drain* psanagpu118
psanagpuq*    up 10-00:00:0      92  down* psanagpu[101115-109116]
psanagpuq*   
anagpu up 10-00:00:0      1  drain psanagpu117
psanagpuq*    up 10-00:00:0      2   idle psanagpu[113-114]
psanaq        up 10-00:00:0      1  infinitedrain* psana1509
psanaq        up 10-00:00:0      4  down* psana[1503,1519,1604-1605]
psanaq       idle psanagpu[110-113] 

 up 10-00:00:0      6    mix psana[1502,1504-1506,1520,1602]
psanaq        up 10-00:00:0      1  alloc psana1501
psanaq        up 10-00:00:0     27   idle psana[1507-1508,1510-1518,1601,1606-1620]
psanaq        up 10-00:00:0      1   down psana1603
The * following the name means default partition (queue) is anagpupsanagpuq

Check available GPUs on a specific node

...

Code Block
> cat submit.sh
#!/bin/bash

#SBATCH --partition=anagpupsanaq
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=3
#SBATCH --output=%j.log

# "-u" flushes print statements which can otherwise be hidden if mpi hangs
# "-m mpi4py.run" allows mpi to exit if one rank has an exception
mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSource.py

One can also do this same command from the command line using the "--wrap" option for sbatch:

Code Block
sbatch -p psfehqpsanaq --nodes 2 --ntasks-per-node 3 --wrap="mpirun python mpi_simpletest.py"

...

Code Block
titlesbatch
> cat tst_script 
#!/bin/bash
#
#SBATCH --job-name=<name> # Job name for allocation
#SBATCH --output=%j.log # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=%j.err # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=anagpupsanagpuq # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8  # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py


> sbatch tst_script 
Submitted batch job 187

...

Code Block
languagebash
titlesqueue
>psanagpu104:~$ squeue 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            466440  187   psanaq anagpu ExampleM   wrap omarq khegazy  R 1-17:51:12      1 psana1502
         0:04      1 psanagpu110 
   466423    psanaq     wrap  khegazy  R 1-17:53:31      1 psana1506
            466420    psanaq     wrap  khegazy  R 1-17:53:34      1 psana1602
            466421    psanaq     wrap  khegazy  R 1-17:53:34      1 psana1504 

The ST (job state) field shows that jobid 187 466440 is currently running (R).

...

Code Block
languagebash
titlesacct
>psanagpu104:~$ sacct     # show all finished job of user
 -u khegazy
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
  ------------ ---------- ---------- ---------- ---------- ---------- -------- 
  141466416             wrap     psanaq                    10    RUNNING      0:0 
466416.batch      batch   sleep         anagpu                   10  1  COMPLETEDRUNNING      0:0 
  142 466418             wrap     psanaq                    10    RUNNING      0:0 
466418.batch      lsbatch            anagpu                   10  1  COMPLETEDRUNNING      0:0 
  143466420             wrap     psanaq                    10  exit  RUNNING     anagpu 0:0 
466420.batch      batch             1     FAILED      2:0 

> sacct -j <jobid>   10    #RUNNING show info for specific job 0:0