Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • operate in an automated fashion. After initial configuration, all processing should be automatic, requiring no human intervention
  • maintain at least a linear chain of processes ("process elements") per run identifier that will be executed for given task type. A limited tree of processes is desired; processes should be able to run in parallel, but it is not clear we should mandate that downstream processes can depend on the output of two or more other processes.
  • process chains must be able to run in parallel, as in a batch farm
  • should be aware of its resources, eg available disk space, to avoid flusi flushing jobs because a disk is full.

    Bookkeeping

  • must be able to maintain multiple pipelines simultaneously
  • process elements should know what their input and output datasets are
  • maintain the state of processing: status of completed and pending elements. The state of processing should be recoverable from the database.
  • be able to rerun process elements, failed or not
  • keep track of all datasets involved in the processing, keeping metadata about the file properties and path
  • be able to take input from a different task, for use in reprocessing

    Configuration

  • the pipeline operator must have a convenient (web) interface to configure a task.
  • this configuration must include the elements of the desired process chain, the names and versions of the scripts and applications that run in the chain, and the input and output datasets (and their desired locations) for each step.
  • support some flexibilty in directory structure for the output datasets. Should at least support organization by run id and by dataset type.
  • ensure that each dataset knows it's location, in case it has to be moved from where its brethren live

    Display

  • provide a status display for the processing: per task, provide a summary of all jobs, and tabular information on jobs as requested (eg by processing state, run range etc)
  • provide a web query tool for the available datasets from the processing

...