Pipeline II Critical Needs

From Tom

1. Communication between job steps. In general, I need the ability for any job step to produce a datum (int/float/string) which can be accessed by any other job step in the task. Uses include gathering statistics of job step performance, return codes for making processing decisions, and for compiling lists of files produced in previous job steps. A mechanism that has been discussed involves adding {key,value} pairs to the pipeline_summary file, and then providing a tool to extract these values by a subsequent job step.

2. Command line access to Pipeline Database values. This is the need to assess the progress of a running task, as well as to summarize a completed task. This was, by far, the most complex script I wrote for support of Pipeline I Monte Carlo. These scripts evolved significantly over time to handling a changing landscape as well as to provide new functionality (such as analyzing job failures as a function of time and/or batch host). Thus, this functionality needs to have a degree of flexibility built-in. A mechanism discussed with Tony is what he calls "pfind", a tool for querying the Pipeline II database.

Note: There has long been discussion of how "user variables" can be (will be) stored in the Pipeline DB. Variables with obvious general value (e.g., job step return codes, number of events processed, etc.) would be properly typed and given a permanent slot in the database. Other variables with more application-specific import (e.g., number of events passing the OBF, number of mySQL warnings, number of entries into the SAA, or whatever) might all be stored as strings in a general-purpose DB record. Such values which, over time, become more widely interesting can be "promoted" to having its own permanent spot in the DB. Dan tells me that a numeric stored as a string poses special problems when querying.

3. Command line access to (sub-)stream rollback and (sub-)stream delete. For Pipeline II, the web interface offers many new features, including the ability to rollback (but not delete) a (sub-)stream. However, I envision generating a list of streams that require rolling back or deleting based on some offline analysis (likely using "pfind") and would then wish to carry out that deed without continually requesting a new feature of the web interface.

4. Command line access to the DataCatalog, including queries, registering, editing an existing registration and removing a registration. There is a need to develop lists of registered files for many reasons, such as concatenating or skimming data. It may also be desirable to use the DataCatalog to communicate the names of created datafiles between job steps. When new columns of information are added to the DataCatalog, it may be necessary to backfill old tasks. In certain failure modes, one can imagine the need to completely wipe out an entire set of registered datasets prior to starting over. In the case of a pending or in-progress task, one may wish to adjust a status flag on a per dataset basis - either automatically within the task, or manually. Pipeline failures may leave the DB in an inconsistent or incorrect state and that will need manual intervention. We may wish to move blocks of datafiles and update the DataCatalog. Etc., etc.

5. Seamless way to logically combine data from different streams. This may be unique to Monte Carlo production, but consider this use case: producing a 55-day set of background events. For DC2, this represented about 100,000 batch jobs spanning three tasks. In Pipeline II, one might reasonably consider defining the task so that a single stream would create a sub-task which, in turn, creating 100,000 sub-streams. Ideally, this would work fine. But life is never so simple...I would probably submit a stream to create, say, 10 sub-streams to verify the code and configuration. Then, I might submit a second stream to create 10,000 runs - again, to watch for unexpected scaling problems. Finally, I might submit the whole lot. Another use case is where there is a request for 1,000 runs of a certain type of event and, upon successful completion, the request is changed to 10,000 runs. Part of Pipeline II appeal is the ability to gather together the data from many sub-streams and operate on the whole set (e.g., for concatenation). How would this be done with the current system? A "super task" to collect together streams? The ability to submit additional sub-streams for an existing stream?

  • No labels