Using "Pilot" jobs with the SLAC workflow engine

The SLAC workflow engine has been extended to allow multiple workflow engine jobs to run inside a "pilot" job. The main intent is to allow multiple (possible single thread) jobs to run on hosts supporting many threads, for example on CORI at NERSC, although the implementation is not specific to any particular machine.

Implementation

The workflow engine works by having "JobDaemons" for each type of batch system (LSF, Condor, SLURM etc). In the case of pilot jobs we use a special daemon called the PilotJobDaemon, which does not submit jobs to a batch system, but instead keeps the jobs in an internal queue until a pilot job comes and pulls them out of the queue to execute them. For the most part, using a PilotJobDaemon instead of a traditional job daemon is transparent to the workflow engine, and the developer of specific workflows.

For DESC we have set up two JobDaemons at NERSC, both running under the shared "desc" account:

SlurmJobDaemon which submits jobs to SLURM. The "jobsite" which must be used to submit jobs to this daemon is "CORI".
PilotJobDaemon which queues up jobs until a pilot job fetches them and runs them. The "jobsite" which must be used to submit jobs to this daemon is "CORIP".

To a first approximation, the only change to the workflow to switch from one to the other is to change the jobsite. Since the jobsite can be set dynamically it is possible to have one workflow which submits jobs to either site (or even both sites).

Batch Options

To the extent possible the options supported by the PilotJobDaemon are the same as those supported by the SlurmJobDaemon. Details below:

Workflow XML	Option	Alias	Default	Meaning	Comments
maxCPU			1 hour	Max cpu used by the job (in seconds).	This is used for scheduling the job but is not currently enforced by the pilot.
maxMemory			1GB	Max memory used by the job (in kB).	This is used for scheduling the job but is not currently enforced by the pilot.
batchOptions	-N	--nodes	1	The number of nodes on which the job will run
batchOptions	-t	--time	01:00:00	The wallclock time allowed for the job	This is used for scheduling jobs in the pilot, but is not currently enforced by the pilot.
batchOptions	-L	--license	none	The list of licenses required by the job separated by commas, e.g. -L SCRATCH	PilotJobs will only accept jobs if all licenses are available in the pilot job.
batchOptions	-C	--constraint	none	The list of constraints required by the job, separated by commas, e.g. -C haswell	PilotJobs will only accept job if all constraints are satisfied by the pilot job.
batchOptions	-p	--partition	none	The partition in which the job will be run.	Allows PilotJob to selectively run jobs submitted only far a particular partition. Parition names can be assigned by the user.
batchOptions	-c	--cpus-per-task	1	The number of cpus (threads) which will be allocated to this job.	This is used for scheduling jobs in the pilot, but is not currently enforced by the pilot.
batchOptions		--ntasks-per-node	1		Only for compatibility with SLURM. Option is ignored.
batchOptions	-J	--job_name		The name of the job.	Only for compatibility with SLURM. Option is ignored.

Pilot Jobs

In the current implementation the pilot jobs are not submitted automatically, although this may change in future. Currently to submit the default pilot job simply login as user "desc" (separate instructions needed?) and run the following:

sbatch ~/jobcontrol/runPilot.sh

It is possible to customize the pilot for specific applications, for example you might want to setup some specific environment which will be inherited by all jobs running in the pilot, or you might want to customize some of the options supported by the job daemon. In either case you can create your own modified copy of the runPilot.sh script, and/or specify an option on the command line.

The options supported by the JobControlPilot are:

-C (--constraint) VAL : Constraints satisfied by this pilot
-L (--license) VAL : Licenses provided by this pilot
-P N : The port that the pilot will attempt to pull jobs from (default: 0)
-c N : The total number of cores of share among all running jobs (default: 32)
-h VAL : The host from which this pilot will attempt to pull jobs (default: corigrid.nersc.gov)
-i N : The time after which the pilot will die if no work is provided (seconds) (default: 300)
-m N : The total memory (in kB) of this machine to share among all running jobs (default: 64000000)
-o : True if OK to overwrite existing files (default: false)
-p (--partition) VAL : If specified, only jobs requesting this partition will by run by this pilot
-r N : The maximum runtime for the job (seconds) (default: 172800)
-s VAL : The service name of the pilot service (default: PilotJobProvider)
-u VAL : The user name under which the pilot service is running (default: desc)

Any number of pilot jobs can be submitted.

Limitations and future plans

Currently while jobs are running in the JobControlPilot, the memory and cpu time used will always be reported as zero, although when the job completes the CPU time used will be reported normally. This will be fixed soon.
Currently if the PilotJobDaemon is stopped all information about running jobs will be lost. This will be fixed soon.
There is currently no support for checkpointing jobs running in the JobControlPilot, although plans are in place to develop such a feature in future and most of the infrastructure required is already in place.

Child pages

Using "Pilot" jobs with the SLAC workflow engine

Implementation

Batch Options

Pilot Jobs

Limitations and future plans