Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The workflow engine works by having "JobDaemons" for each type of batch system (LSF, Condor, SLURM etc). In the case of pilot jobs we use a special daemon called the PilotJobDaemon, which does not submit jobs to a batch system, but instead keeps the jobs in an internal queue until a pilot job comes and pulls them out of the queue to execute them. For the most part, using a PilotJobDaemon instead of a traditional job daemon is transparent to the workflow engine, and the developer of specific workflows.

...

To the extent possible the options supported by the PilotJobDaemon are the same as those supported by the SlurmJobDaemon. Details below:

Workflow XMLOptionAliasDefaultMeaningComments

maxCPU

  1 hourMax cpu used by the job (in seconds).This is used for scheduling the job but is not currently enforced by the pilot.

maxMemory

  1GBMax memory used by the job (in kB).This is used for scheduling the job but is not currently enforced by the pilot.

batchOptions

-N--nodes1The number of nodes on which the job will runOnly for compatibility with SLURM. Option is ignored. 

batchOptions

-t--time01:00:00The wallclock time allowed for the jobThis is used for scheduling jobs in the pilot, but is not currently enforced by the pilot.

batchOptions

-L--licensenoneThe list of licenses required by the job separated by commas, e.g. -L SCRATCHPilotJobs will only accept jobs if all licenses are available in the pilot job.

batchOptions

Accepted but not yet used.

-C--constraintnoneThe list of constraints required by the job, separated by commas, e.g. -C haswellPilotJobs will only accept job if all constraints are satisfied by the pilot job.

batchOptions

Accepted but not yet used.

-p--partitionnoneThe partition in which the job will be run.Only for compatibility with SLURM. Option is ignored.Allows PilotJob to selectively run jobs submitted only far a particular partition. Parition names can be assigned by the user.

batchOptions

-c--cpus-per-task1The number of cpus (threads) which will be allocated to this job.This is used for scheduling jobs in the pilot, but is not currently enforced by the pilot.

batchOptions

 --ntasks-per-node1 Only for compatibility with SLURM. Option is ignored.

batchOptions

-J--job_name The name of the job.Only for compatibility with SLURM. Option is ignored.

In addition the memory and maxcpu can be specified as part of the workflow job definition (in XML). These are used for scheduling the job in the pilot, but are not currently enforced by the pilot.

Pilot Jobs

In the current implementation the pilot jobs are not submitted automatically, although this may change in future. Currently to submit the default pilot job simply login as user "desc" (separate instructions needed?) and run the following:

...

The options supported by the JobControlPilot are:

-C (--constraint) VAL : Constraints satisfied by this pilot
-L (--license) VAL : Licenses provided by this pilot
-P N : The port that the pilot will attempt to pull jobs from (default: 0)
-c

...

N

...

:

...

The

...

total

...

number

...

of

...

cores

...

of

...

share

...

among

...

all

...

running

...

jobs

...

(default:

...

32)
-h

...

VAL

...

:

...

The

...

host

...

from

...

which

...

this

...

pilot

...

will

...

attempt

...

to

...

pull

...

jobs

...

(default:

...

corigrid.nersc.gov)
-i

...

N

...

:

...

The

...

time

...

after

...

which

...

the

...

pilot

...

will

...

die

...

if

...

no

...

work

...

is

...

provided

...

(seconds)

...

(default:

...

300)
-m

...

N

...

:

...

The

...

total

...

memory

...

(in

...

kB)

...

of

...

this

...

machine

...

to

...

share

...

among

...

all

...

running

...

jobs

...

(default:

...

64000000)
-o

...

:

...

True

...

if

...

OK

...

to

...

overwrite

...

existing

...

files

...

(default:

...

false)
-p

...

(--partition) VAL : If specified, only jobs requesting this partition will by run by this pilot
-r N : The maximum runtime for the job (seconds) (default: 172800)
-s

...

VAL

...

:

...

The

...

service

...

name

...

of

...

the

...

pilot

...

service

...

(default:

...

PilotJobProvider)
-u

...

VAL

...

:

...

The

...

user

...

name

...

under

...

which

...

the

...

pilot

...

service

...

is

...

running

...

(default:

...

desc)

Any number of pilot jobs can be submitted.

Limitations and future plans

  • Currently while jobs are running in the JobControlPilot, the memory and cpu time used will always be reported as zero, although when the job completes the CPU time used will be reported normally. This will be fixed soon.
  • Currently if the PilotJobDaemon is stopped all information about running jobs will be lost. This will be fixed soon.
  • There is currently no support for checkpointing jobs running in the JobControlPilot, although plans are in place to develop such a feature in future and most of the infrastructure required is already in place.