Table of Contents |
---|
SLURM is new the batch job scheduling system for the LCLS SLAC batch compute systems it is replacing the current LSF system. Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.
Some quick guides showing equivalent commands in LSF and SLURM:
LCLS specific information about the S3DF facility is here: Running at S3DF.
General S3DF documentation is here: https://
...
...
Some hardware details about the S3DF can be found here, although note that 8 of the milano cluster node cores are reserved for the filesystem so only 120 (out of 128) can be used on each node for batch processing: https://s3df.slac.stanford.edu/public/doc/#/batch-compute?id=clusters-amp-repos
...
The partition/queue information can be provided by the sinfo
command.
From the psana poolLCLS users typically use the "milano" queue at s3df:
Code Block | ||||
---|---|---|---|---|
| ||||
psanagpu104:~$[cpo@sdfiana002 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST psanagpuqroma* up 10-00:00:0 1 drain* comp psanagpu118sdfrome004 psanagpuqroma* up 10-00:00:0 16 2 drng@ down* psanagpu[115-116] psanagpuq*sdfrome[006-018,041-043] roma* up 10-00:00:0 1 draindown$ psanagpu117sdfrome003 psanagpuqroma* up 10-00:00:0 21 idle psanagpu[113-114] psanaqdrain$ sdfrome037 roma* up 10-00:00:0 1 drain* psana1509sdfrome005 psanaqroma* up 10-00:00:0 21 4 down*mix psanasdfrome[1503,1519,1604-1605] psanaq 019-036,038-040] milano up 10-00:00:0 1 6inval sdfmilan221 milano up 10-00:00:0 14 mixdrng@ psana[1502,1504-1506,1520,1602] psanaq sdfmilan[036-038,120-121,126,129,204-205,212,229-232] milano up 10-00:00:0 4 drain$ sdfmilan[009,041,049,112] milano up 10-00:00:0 1 allocdrain psana1501 psanaqsdfmilan032 milano up 10-00:00:0 12 resv sdfmilan[001-005,029-030,052,057,117-119] milano up 10-00:00:0 102 27 idlemix psana[1507-1508,1510-1518,1601,1606-1620] psanaq sdfmilan[006-008,010-019,021-028,031,033-035,039-040,042-048,050-051,053-056,058-072,101-111,113-116,122-125,127-128,130-131,201-203,206-211,213-220,222-228] milano up 10-00:00:0 1 idle sdfmilan020 ampere up 10-00:00:0 1 drng@ sdfampere010 ampere up 10-00:00:0 1 down psana1603 The *drng followingsdfampere011 ampere the name means default partition (queue) is psanagpuq |
scontrol show node psanagpu116 -d | grep Gres=gpu
From the psana pool:
Gres=gpu:1080ti:1(S:0)
...
up 10-00:00:0 3 drain sdfampere[005,008,023]
ampere up 10-00:00:0 18 mix sdfampere[001-004,006-007,009,012-022]
[cpo@sdfiana002 ~]$ |
The "*" following the roma queue name indicates that it is a default queue for submission.
The following is a simple submission script of a parallel psana batch job run with mpi. It can be submitted with the command "sbatch submit.sh
". The The commands specified in the script file will be ran on the first available compute node that fits the resources requested. There are two ideas: "nodes" and "tasks per node". A "node" is a physical computer box (with a host-name, for example) but each box/node typically has multiple-cpu-cores (see this page for specific numbers: Batch Nodes And Queues). Typically the tasks-per-node parameter is set to utilize all the cores on each node.
NOTE: when running interactively the "mpirun" command takes a "-n" argument with a number of cores. However, when running with slurm no "-n" is needed, as slurm infers it from the "--nodes" and "--ntasks-per-node" values.
Code Block |
---|
> cat submit.sh #!/bin/bash #SBATCH --partition=psanaqmilano #SBATCH --nodes=2 #SBATCH --ntasks-per-node=3120 #SBATCH --output=%j.log # "-u" flushes print statements which can otherwise be hidden if mpi hangs # "-m mpi4py.run" allows mpi to exit if one rank has an exception mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSourcemy_psana_script.py |
One can also do this same command from the command line using the "--wrap" option for sbatch:
Code Block |
---|
sbatch -p psanaqmilano --nodes 2 --ntasks-per-node 3 --wrap="mpirun python mpi_simpletest.py" |
This script shows some additional features controllable via SLURM:
Code Block | ||
---|---|---|
| ||
> cat tst_script
#!/bin/bash
#
#SBATCH --job-name=<name> # Job name for allocation
#SBATCH --output=%j.log # File to which STDOUT will be written, %j inserts jobid
#SBATCH --error=%j.err # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=psanagpuq # Partition/Queue to submit job
#SBATCH --gres=gpu:1080ti:1 # Number of GPUs
#SBATCH --ntask=8 # Total number of tasks
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --mail-user='username'@slac.stanford.edu # Receive e-mail from slurm
#SBATCH --mail-type=ALL # Type of e-mail from slurm; other options are: Error, Info.
#
srun -l hostname
srun python ExampleMultipleChaperones.py
> sbatch tst_script
Submitted batch job 187
|
Differently from sbatch
, the srun
command does not return immediately and waits for the job to complete. The srun command can be used to get control of a node to run interactively. These can be useful for data exploration and software development.
The following are a few examples:
Code Block | ||||
---|---|---|---|---|
| ||||
> srun -N2 -n4 hello.mpi
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu110 out of 1
Process 0 on psanagpu113 out of 1
Process 0 on psanagpu113 out of 1
|
To check that jobs that exist on the system use the squeue command:
-u -m mpi4py.run my_psana_script.py" |
srun
In principle the slurm "srun" command can also be used to launch parallel jobs, however the current S3DF "srun" version only supports an older "pmi2" protocol, which is incompatible the mpi packages from conda that LCLS uses which use the newer "pmix" protocol. srun should be avoided for parallel jobs at S3DF (see output of "srun --mpi=list").
Monitoring/Status
To check that jobs that exist on the system use the squeue command:
Code Block | ||||
---|---|---|---|---|
| ||||
[cpo@sdfiana002 ~]$ squeue -u ytl
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
30703603 ampere,ro out ytl PD 0:00 1 (launch failed requeued held)
30703602 ampere,ro out ytl PD 0:00 1 (launch failed requeued held)
30701730 ampere,ro out ytl PD 0:00 1 (launch failed requeued held)
30700739 ampere,ro out ytl PD 0:00 1 (launch failed requeued held)
30700738 ampere,ro out ytl PD 0:00 1 (launch failed requeued held)
| ||||
Code Block | ||||
| ||||
psanagpu104:~$ squeue -u khegazy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 466440 psanaq wrap khegazy R 1-17:51:12 1 psana1502 466423 psanaq wrap khegazy R 1-17:53:31 1 psana1506 30699545 ampere,ro 466420 out psanaq ytl PD wrap khegazy R 1-17:53:340:00 1 psana1602 (launch failed requeued held) 30704838 milano 466421out psanaq ytl CG wrap khegazy R 1-17:53:344:07 1 psana1504 sdfmilan221 [cpo@sdfiana002 ~]$ |
The ST (job state) field shows that jobid 466440 is currently running (R). Another common state is "pending" (Pmost jobs are pending (PD) and one is completing (CG).
Get information about status of finished jobs
Code Block | ||||
---|---|---|---|---|
| ||||
psanagpu104:~$[cpo@sdfiana002 ~]$ sacct -u khegazyytl JobID JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 46641630682524 out wrap milano psanaqshared:de+ 112 10 PREEMPTED RUNNING 0:0 46641630682524.batchba+ batch shared:de+ 112 CANCELLED 10 RUNNING 0:015 46641830682524.ex+ extern wrap shared:de+ psanaq 112 10 RUNNINGCOMPLETED 0:0 466418.batch30682525 batch out milano shared:de+ 112 PREEMPTED 10 RUNNING 0:0 46642030682525.ba+ batch wrap psanaqshared:de+ 112 CANCELLED 10 RUNNING 0:015 46642030682525.batchex+ extern batch shared:de+ 112 COMPLETED 10 RUNNING 0:0 0:0 |
scontrol is used to view or modify Slurm configuration including: job, job step, node, partition, reservation, and overall system configuration. Most of the commands can only be executed by user root or an Administrator.
scontrol show jobid -dd <jobID>
scontrol show res
sacctmgr is used to deal with accounts, associations and users.
Format can be modified at will or removed to see all (can be hard to read, especially on smaller windows):
sacctmgr show associations Users=<user_names> format=cluster,account,partition,QOS
sacctmgr list associations -p accounts=<accounts>
sacctmgr show qos format=name,priority,usagefactor
Others
sprio -j <jobID>