Requirements

(Copy from talks given at workshop in January)

Shortcomings of current system

Current system is very closely coupled to expected usage

Conditions for task to run specified only by availability of datasets
Tightly coupled to "Runs", makes parallelization of tasks harder
Access to data catalog only via pipeline database
Insufficient generalization makes code complex, and makes modifications hard

Pipeline Next Generation

We should implement the system as a set of components isolated by well defined interfaces to allow independent development, testing and future enhancements

We should keep modules lightweight. Use technologies for a particular component only when that technology has a clear benefit for that module

Core scheduling engine should not know about datasets, or runs, these are special cases of a more generic system.

Tasks run by scheduler should not be tied to batch system

Some tasks (e.g. database update) may be run in-process
Others may be more time-consuming and require batch job
- Should batch job be a single "pipeline task" or several? e.g.
  - Assemble job information
  - Submit batch job
  - Wait for job to complete, fail, or timeout
  - Update history database to reflect outcome of job

Should be possible to mark Tasks as active or inactive.

Only tasks are loaded into Rules engine.

Rules engine should deal with "Streams" rather than runs

Task instance should run when conditions are valid and when

Should we use a formal "rules engine", or just code our own solution?

It may be useful to think of the dataset catalog as a product of the pipeline, rather than a part of it. The data in the catalog needs to be readily accessible to users, even in the event that the history of how it was created is no longer available, or perhaps never was available if the data was created outside of the pipeline.

As part of our work on Data Analysis for the Grid we developed a "meta data data catalog". Some of the ideas from this may be relevant and reusable.

Talk: http://www.ppdg.net/archives/talks/2003/ppt00077.ppt
WebStart demo: http://grid.txcorp.com/dags/dcsDemo.jsp

Components for Next Generation pipeline

Major components

Scheduler/Rules engine
Task file (XML) import/export
Task Database
Data Catalog database
Processing history database
Web interface
Batch submission interface
Conditions interface (disk space/usage monitoring etc)

Things to discuss

Interface to batch submission system

Space shortcuts

Child pages

Requirements

Shortcomings of current system

Pipeline Next Generation

Components for Next Generation pipeline

Things to discuss

Space shortcuts

Child pages

Initial Thoughts on Pipeline

Requirements

Shortcomings of current system

Pipeline Next Generation

Components for Next Generation pipeline

Things to discuss