Introduction

Pipeline II has been designed to allow batch jobs to be submitted at SLAC or other sites. Tasks can even be designed to run at multiple sites, allowing jobs to flow to different sites as machines become available to run them. We expect this distributed job submission to be used mainly for MC processing.

When submitting jobs to remote sites the Pipeline II server still runs a single instance at SLAC. Batch jobs are submitted (at SLAC or elsewhere) using a simple batch submission daemon which is designed to hide the details of the batch submission process from the Pipeline II server. When jobs start they send an initial e-mail back to the pipeline server, and when they complete they send a second e-mail stating that they have finished (successfully or not). A data catalog is maintained at SLAC which can contain information on data available at SLAC or elsewhere. Batch jobs do not communicate directly with the database or with the pipeline II server.

Initially the batch submission daemon has been set up to work with LSF, the batch system in use at SLAC, but it has been designed to make few requirements on the specifics of the batch submission system so it should be reasonably easy to port to other sites.

Porting the Job Submission system to a new site

The Job Submission daemon is a separate Java project in GLAST CVS, with some documentation available here. Strictly speaking the pipeline server can work with anything that extends the JobControlClient class, which allows jobs to be submitted, queried for their status, and canceled. In practice only a couple of classes in the implementation of the job control client depend on LSF, so we expect remote sites to run a modified version of the standard JobControlClient. The classes which depend on LSF are:

The standard implementation of JobControlClient uses Java Remote Method Invocation (RMI) to communicate with the server, so it should be quite possible for the daemon to run on a remote site. There may be some issues with firewalls at remote sites, but the server can be configured to run on any port, and only requires incoming connections to be allowed from a small set of hosts at SLAC where we run the pipeline II server.

We expect a remote JobControlService to be implemented to talk directly to the batch system at remote sites. Another possibility would be to implement a JobControlService which could talk to a Grid job control system. This would have the potential advantage of allowing a single implementation of JobControlService to work with any site implementing the specific GRID job control system. We have not currently given much though to the potential pitfalls with such an approach, but it may be worth considering.

Other issues

Additional server work

Once a JobControlService is available which allows submission to remote site we will need to set up the pipeline II server to work with it. While the server was designed to allow this, it has not yet been tried, and going from one site to more than one site will certainly require some work to allow additional configuration of the server.

Installing GLAST software at remote site

In addition to the JobControlService any GLAST software actually required to run the (MC) tasks at the remote site will need to be installed. We will need to adopt some standard mechanism for locating and setting up GLAST software at remote sites (for example defining GLAST_ROOT and perhaps some standard setup scripts).

Setting up tasks to support remote sites

Pipeline II allows task developers a great deal of freedom in how they set up specific tasks to run in the server. Up to now all tasks have been designed to run at SLAC, and typically have hard-wired paths to known file locations at SLAC. Once the pipeline server and remote JobControlService are set up we will need to work with task developers to set up tasks in a site independent manner.

  • No labels