Requirements

(Copy from talks given at workshop in January)

Shortcomings of current system

Current system is very closely coupled to expected usage

  • Conditions for task to run specified only by availability of datasets
  • Tightly coupled to "Runs", makes parallelization of tasks harder
  • Access to data catalog only via pipeline database
  • Insufficient generalization makes code complex, and makes modifications hard

Pipeline Next Generation

We should implement the system as a set of components isolated by well defined interfaces to allow independent development, testing and future enhancements

We should keep modules lightweight. Use technologies for a particular component only when that technology has a clear benefit for that module

Core scheduling engine should not know about datasets, or runs, these are special cases of a more generic system.

Tasks run by scheduler should not be tied to batch system

  • Some tasks (e.g. database update) may be run in-process
  • Others may be more time-consuming and require batch job
    • Should batch job be a single "pipeline task" or several? e.g.
      • Assemble job information
      • Submit batch job
      • Wait for job to complete, fail, or timeout
      • Update history database to reflect outcome of job

Should be possible to mark Tasks as active or inactive.

  • Only tasks are loaded into Rules engine.

Rules engine should deal with "Streams" rather than runs

  • Task instance should run when conditions are valid and when

Should we use a formal "rules engine", or just code our own solution?

It may be useful to think of the dataset catalog as a product of the pipeline, rather than a part of it. The data in the catalog needs to be readily accessible to users, even in the event that the history of how it was created is no longer available, or perhaps never was available if the data was created outside of the pipeline.

As part of our work on Data Analysis for the Grid we developed a "meta data data catalog". Some of the ideas from this may be relevant and reusable.

Components for Next Generation pipeline

Major components

  • Scheduler/Rules engine
  • Task file (XML) import/export
  • Task Database
  • Data Catalog database
  • Processing history database
  • Web interface
  • Batch submission interface
  • Conditions interface (disk space/usage monitoring etc)

Things to discuss

  • Interface to batch submission system
  • No labels