Introduction

The automated workflow system has been designed to allow batch jobs to be submitted at SLAC or other sites. Tasks can be designed to run at multiple sites, allowing jobs to flow to different sites as machines become available to run them.

When submitting jobs to remote sites the Pipeline server still runs a single instance (at SLAC). Batch jobs are submitted (at SLAC or elsewhere) using a simple batch submission daemon which is designed to hide the details of the batch submission process from the Pipeline server.

The batch daemon only provides 3 "methods":

  • Submit a job
  • Get the status of a job
  • Kill a job

The most important of these methods is the "submit a job" method. The job submitted by the pipeline consist of a top-level "pipeline_wrapper" (bash script) which sends an initial e-mail back to the pipeline server when the job starts, and a second e-mail when the job finishes (successfully or not). This mechanism could be generalized in future to support the possibility that the job daemon actively monitors the jobs and sends the start/stop e-mails instead of the batch job itself sending them. A data catalog is maintained at SLAC which can contain information on data available at SLAC or elsewhere. Batch jobs do not communicate directly with the database, with the data catalog or with the pipeline server, all communication from the batch job is sent via the termination e-mail. 

The status method is used by the pipeline server to monitor the job, but only so that the pipeline can detect if the job vanishes without trace, so that it can set its status in the pipeline appropriately in this event. The status method can also be used by a user interacting with the pipeline web interface to get the detailed status of a job while it is running.

The kill method is rarely used, and could be considered optional.

We currently have several implementations of the batch control daemon. Due to the simplicity of the interface is is expected that it will be relatively easy to port the daemon to a new service. Currently supported:

  • LSF (used at SLAC)
  • BQS (previously used at IN2P3 but no longer actively used or supported)
  • GridEngine (used at IN2P3/Lyon)
  • Condor (used at SMU)
  • PBS (developed for potential future use but not yet tested)
  • Dirac (under development by Fermi/CTA collaborators in Europe)
  • Newt (https://newt.nersc.gov/) (under development -- experimental interface to NERSC based in web services interface)

All but the last are currently kept together in the subversion repository described below.

Checking out and building the source code

The Job Submission daemon is a small Java/maven project and lives in subversion. It can browsed on the web here:

http://java.freehep.org/svn/repos/srs/list/org-srs-jobcontrol/trunk/?revision=HEAD&bypassEmpty=true

http://java.freehep.org/svn/repos/srs/list/org-srs-jobcontrol/trunk/?revision=HEAD&bypassEmpty=true

or checked out anonymously from subversion here:

svn co svn://svn.slac.stanford.edu/srs/org-srs-jobcontrol/trunk

or for read/write access using 

svn co svn+ssh://svn.slac.stanford.edu/nfs/slac/g/srs/subversion/org-srs-jobcontrol/trunk

(write access currently requires a SLAC account and ssh+kerberos access). "JavaDoc" documentation is built automatically by Jenkins whenever subversion is updated and can be accessed here:

http://srs.slac.stanford.edu/hudson/view/SRS/job/org-srs-jobcontrol/javadoc/

http://srs.slac.stanford.edu/hudson/view/SRS/job/org-srs-jobcontrol/javadoc/

The commands to check out and build the project are:

svn co svn://svn.slac.stanford.edu/srs/org-srs-jobcontrol/trunk org-srs-jobcontrol
cd org-srs-jobcontrol
mvn install

We recommend using java 1.7 or greater and mvn 3.0.3 or greater.

The pipeline server can work with anything that extends the JobControlClient class, which allows jobs to be submitted, queried for their status, and canceled. The job-control project provides a framework for implementing JobControlClient using Java RMI to talk to a remote client, and all existing implementations are build on top of that common framework. There may be some issues with firewalls at remote sites, but the server can be configured to run on any port, and only requires incoming connections to be allowed from a small set of hosts at SLAC where we run the pipeline server.

The main package of the framework is:

The following packages contain implementations for specific batch systems:

Testing a new job control service

There are example "main" classes in package org.srs.jobcontrol.demo which can be used to test a new implementation of the job control daemon in a standalone mode. Once the demos are working some configuration of the server at SLAC is needed to allow jobs to be submitted to a new site.

  • No labels