Would you like to help test Confluence and Jira updates? Please fill out the form here: https://forms.gle/5TbSqvYYCnTyyxSN7
LCLS specific information about the S3DF facility is here: Running at S3DF.
General S3DF documentation is here: https://s3df.slac.stanford.edu
Some hardware details about the S3DF can be found here, although note that 8 of the milano cluster node cores are reserved for the filesystem so only 120 (out of 128) can be used on each node for batch processing: https://s3df.slac.stanford.edu/public/doc/#/batch-compute?id=clusters-amp-repos
The partition/queue information can be provided by the
LCLS users typically use the "milano" queue at s3df:
The "*" following the roma queue name indicates that it is a default queue for submission.
The following is a simple submission script of a parallel psana batch job run with mpi. It can be submitted with the command "
sbatch submit.sh". The commands specified in the script file will be ran on the first available compute node that fits the resources requested. There are two ideas: "nodes" and "tasks per node". A "node" is a physical computer box (with a host-name, for example) but each box/node typically has multiple-cpu-cores. Typically the tasks-per-node parameter is set to utilize all the cores on each node.
NOTE: when running interactively the "mpirun" command takes a "-n" argument with a number of cores. However, when running with slurm no "-n" is needed, as slurm infers it from the "--nodes" and "--ntasks-per-node" values.
One can also do this same command from the command line using the "--wrap" option for sbatch:
In principle the slurm "srun" command can also be used to launch parallel jobs, however the current S3DF "srun" version only supports an older "pmi2" protocol, which is incompatible the mpi packages from conda that LCLS uses which use the newer "pmix" protocol. srun should be avoided for parallel jobs at S3DF (see output of "srun --mpi=list").
To check that jobs that exist on the system use the squeue command:
The ST (job state) field shows that most jobs are pending (PD) and one is completing (CG).
Get information about status of finished jobs
scontrol is used to view or modify Slurm configuration including: job, job step, node, partition, reservation, and overall system configuration. Most of the commands can only be executed by user root or an Administrator.
scontrol show jobid -dd <jobID>
scontrol show res
sacctmgr is used to deal with accounts, associations and users.
Format can be modified at will or removed to see all (can be hard to read, especially on smaller windows):
sacctmgr show associations Users=<user_names> format=cluster,account,partition,QOS
sacctmgr list associations -p accounts=<accounts>
sacctmgr show qos format=name,priority,usagefactor
sprio -j <jobID>