(Copy from talks given at workshop in January)
Current system is very closely coupled to expected usage
We should implement the system as a set of components isolated by well defined interfaces to allow independent development, testing and future enhancements
We should keep modules lightweight. Use technologies for a particular component only when that technology has a clear benefit for that module
Core scheduling engine should not know about datasets, or runs, these are special cases of a more generic system.
Tasks run by scheduler should not be tied to batch system
Should be possible to mark Tasks as active or inactive.
Rules engine should deal with "Streams" rather than runs
Should we use a formal "rules engine", or just code our own solution?
It may be useful to think of the dataset catalog as a product of the pipeline, rather than a part of it. The data in the catalog needs to be readily accessible to users, even in the event that the history of how it was created is no longer available, or perhaps never was available if the data was created outside of the pipeline.
As part of our work on Data Analysis for the Grid we developed a "meta data data catalog". Some of the ideas from this may be relevant and reusable.
Major components