Functional Spec for Processing Pipeline

Introduction

This Pipeline facility has five major functions:

automatically process Level 0 data through reconstruction (Level 1)
provide near real-time feedback to IOC
facilitate the verification and generation of new calibration constants
re-process existing data
produce bulk Monte Carlo simulations
backup all data that passes through

The Pipeline can be expressed as five components:

database access layer
execution layer
scheduler
user interface
relational database (management system)

The scheduler is the main loop of the Pipeline. This long running process polls the database for new tasks, and dispatches processes to the execution layer.

The execution layer exists to abstract site specific details about how computing resources are invoked. It handles launching jobs and collecting output. At SLAC, this will be a thin wrapper around the LSF batch system toolchain. Other implementations will support simple clusters of machines using SSH for remote invocation, and single machine use where jobs are launched on same machine as scheduler.

The database access module contains all SQL queries and statements required by other parts of the system. By keeping the rest of the system from knowing anything about the database, we isolate from changes to both the schema and the database engine.

Inferred Functionality

Automation

operate in an automated fashion. After initial configuration, all processing should be automatic, requiring no human intervention
maintain at least a linear chain of processes ("process elements") per run identifier that will be executed for given task type. A limited tree of processes is desired; processes should be able to run in parallel, but it is not clear we should mandate that downstream processes can depend on the output of two or more other processes.
process chains must be able to run in parallel, as in a batch farm
should be aware of its resources, eg available disk space, to avoid flushing jobs because a disk is full.
Bookkeeping
must be able to maintain multiple pipelines simultaneously
process elements should know what their input and output datasets are
maintain the state of processing: status of completed and pending elements. The state of processing should be recoverable from the database.
be able to rerun process elements, failed or not
keep track of all datasets involved in the processing, keeping metadata about the file properties and path
be able to take input from a different task, for use in reprocessing
Configuration
the pipeline operator must have a convenient (web) interface to configure a task.
this configuration must include the elements of the desired process chain, the names and versions of the scripts and applications that run in the chain, and the input and output datasets (and their desired locations) for each step.
support some flexibilty in directory structure for the output datasets. Should at least support organization by run id and by dataset type.
ensure that each dataset knows it's location, in case it has to be moved from where its brethren live
Display
provide a status display for the processing: per task, provide a summary of all jobs, and tabular information on jobs as requested (eg by processing state, run range etc)
provide a web query tool for the available datasets from the processing

Open Issues

if there are minor steps in a process chain that need redoing, perhaps with an updated version of its application, do we really need to create a new task? We would need to track the possibility of a history of the app used and the outputs. This would have to be a limited ability, if other steps would want to use the revised output.

References

Latest schema: []
Pipeline View & Config: http://www.slac.stanford.edu/www-glast-dev/cgi/Pipeline

Child pages

Functional Spec for Processing Pipeline

Introduction

Inferred Functionality

Automation

Bookkeeping

Configuration

Display

Open Issues

References