Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

SLURM

SLURM is new the batch job scheduling system for the LCLS SLAC batch compute systems it is replacing the current LSF system. Generic documentation about SLURM can be found in this Quick Start User Guide. Even shorter documentation, some of it specific to psana, can be found in this page.

S3DF Computing Facility

LCLS specific information about the S3DF facility is here:  Running at S3DF.

...

Some hardware details about the S3DF can be found here, although note that 8 of the milano cluster node cores are reserved for the filesystem so only 120 (out of 128) can be used on each node for batch processing:  https://s3df.slac.stanford.edu/public/doc/#/batch-compute?id=clusters-amp-repos

...

The following is a simple submission script of a parallel psana batch job run with mpi.  It can be submitted with the command "sbatch submit.sh". The commands specified in the script file will be ran on the first available compute node that fits the resources requested.  There are two ideas: "nodes" and "tasks per node".  A "node" is a physical computer box (with a host-name, for example) but each box/node typically has multiple-cpu-cores.  Typically the tasks-per-node parameter is set to utilize all the cores on each node.

NOTE: when running interactively the "mpirun" command takes a "-n" argument with a number of cores.  However, when running with slurm no "-n" is needed, as slurm infers it from the "--nodes" and "--ntasks-per-node" values.

Code Block
> cat submit.sh #!/bin/bash

#SBATCH --partition=milano
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=120
#SBATCH --output=%j.log

# -u flushes print statements which can otherwise be hidden if mpi hangs
mpirun python -u -m mpi4py.run my_psana_script.py

...