This post was created to provide details on the Processing Detail View (PFE-40) and access to PDF docs on the subject.
A PDF file showing details of the feature request is available here: http://confluence.slac.stanford.edu/download/attachments/680/processing-view.pdf?version=1
Here are what I take to be the primary issues that need addressing in the next iteration of the pipeline:
- more easily debuggable code; a language with a debugger. We've decided the bulk of the code that performs logic will be in java. Perl (or python) would be used for those functions closely tied to the unix O/S.
- should consider providing a facility for general users (hence user roles needed to control access)
- segregate resources - disk, batch allocation, db table space
- should consider running the system tests via the pipeline - also segregate its resources?)
- be able to run on remote (to SLAC) farms, maintaining the bookkeeping of the output
- need interfaces to main resources - database and batch system. Java's hibernate provides the database interface.
- the graph of processes that can be supported needs to be much richer than provided now
- conditions for running a process need not be restricted to availability of datasets
- multiple processes should be able to work in parallel on a given dataset
- a process may depend on multiple datasets
- will presumably need a mechanism to split large datasets for faster parallel processing, then reassemble the products
- a task can take inputs from another task
- the graph should support versions of process runs and if desired follow up subsequent processes whose inputs have now been incremented. This could allow reprocessing while maintaining the basic identity of the original task.
- the bookkeeping should keep track of the important applications being run (eg in current MC, version of script is not so important; version of GlastRelease is, but is not recorded)
- allow code patches, but record that they are in use. Presumably on a process instance basis (rather than by some time stamp after which it is assumed all code uses this version?).
- all files produced by the pipeline should be archived, and for real data, the input files too. At least the tasks run for the public good. Maybe not for users.
- at present, the archive is complicated by parallel tasks operating on a single run for I&T. Not clear how this would evolve in the world of a richer graph structure.
- outtput files should be write-protected
More to come
Warren sent out a thread on directions for the svac pipelines on Dec 17. This is archived as an entry in the pipeline mailing list. Note the pdf file attachment in that mail.
We discovered early on with RM that one needs to protect its output from the unwary. Those with group write privilege to the disk could modify builds. Alex then write-protected the output (presumably turned off the group write bit). We should do something similar to protect the files the pipeline creates.
This is another item we will need very soon. Navid has made a perl interface to the archive system. But now the issue is how and when to archive.
Since the I&T pipelines are parallel, they have several different named tasks operating on the same run, writing to the same directory. Additionally, not all files are reported to the pipeline, but they are wanted to be archived. So we have to archive the entire directory. Ideally we would prevail on everyone to identify every file they want archived, but we seem to be on the losing end of that one!
I think this means the archiving has to happen asynchronously to the pipeline. I'd be curious to see a comment to this blog item from Dan with his thoughts on the algorithm for figuring out what to archive and when.
Note that SCS asks that we keep the files larger than 500 MB to use the tapes efficiently. So we had been thinking to make tar files. Navid keeps track in his archiver db of the file content inside the tar file, so he can ask the archive system for the right tar file when someone asks for an individual file.
I'm at a bit of a loss at the moment to divine a way to know when one can archive in a general way. It would be nice not to need custom archiving per group of tasks.
It can be hard to find the actual code that does the work. In my recent allGamma-GR-v5r0p2 task, I have one task process configured as:
- template provided by Dan with a few lines of recommended code to access things like the run id, and input and output datasets.
- submitted by GleamWrapper.pl
- does a little setup and executes allGamma.sh which runs Gleam, using environment variables to customize it.
I realize that gleam.pl is not necessary; GleamWrapper.pl could have easily done the work. I had based my task on Warren's recon task, where his 'gleam'pl' builds the shell script, and slavishly kept his structure.
But I realized that nowhere do we record the version of the underlying code that is run: nowhere in the database do we actually record the version of GlastRelease. We do have a spot to record the version of GleamWrapper.pl (the only executable Gino knows about per task process), though the xml configurator does not allow setting this version.
For some executables - and GlastRelease is an important one - we could use the version number to access the code. The Release Manager builds the releases and maintains a database giving access to them.
It would be good to both record the important version number and allow it to be found automatically rather than (eg) hardwired into my shell script as I'm doing now.
There are 2 more use cases I expect we will need to handle:
- splitting input files
- eventually we will get ~1.5 GB files from the MOC for each downlink, and will need to separate them into many files to feed to the estimate 75 CPUs needed to process the downlink in ~1 hour. It is possible we will receive the 1.5 GB in multiple files - unknown yet, but see next bullet anyway.
- by the time we get ~4 towers integrated this coming year, keeping the raw data files small enough so that the recon files don't get out of sight will become a problem. They will be taking 10 minute or less runs!
- so we need a mechanism to expand out a run's input file to go to multiple nodes and then regroup them all at the end, I assume allowing a run to be a logical entity whose datasets could be lists of files. Further complication would be in handling subsequent steps in the pipeline for that run, and whether external pipelines make use of a dataset.
- MC concatenation of files
- once an MC run is done, we gather up the myriad ntuple files, run a pruning mechanism on them (for the MC I just ran I weeded out events with no tracks) and then concatenate them to a more handleable set of files (by user request <~ 200 MB or so).
- it would be nice if these could both be done automatically in the pipeline and be recorded. I am doing both manually. A task would have to have something like summary properties where you could make outputs based on a selection of all the input datasets in the task, and record the outcome. And again, the output dataset could be a list of files.
At the moment Gino is run from cron. When it wakes up, it checks to see if an instance of itself is already running, and exits if it is, to not step on itself.
Gino also is fairly verbose in generating a log file (that blew the glast04 /pipeline/ partition last week and has been moved to u12/pipeline/). The log is rather hard to parse since Gino spits out processes that write to the log asynchronously. And huge.
If we want to check aliveness with the resource checker, the best option currently is to check the last touched date on the log file. The Gino process itself cannot respond to queries.
Matt has suggested we move towards a java server, initially wrapping the scheduler perl script.
He says the java wrapper can (out of the box, more or less, I think):
*handle log files, automatically breaking them up into nice-sized chunks
*provide network connections for querying, though this would only tell you the wrapper is running in that stage of evolution.
I imagine there are other features I am forgetting. It would be nice if Matt could elaborate, giving a fuller feature list, a pointer to further reading and perhaps a simple demonstration example that wraps a perl script issuing a print statement or two?
I notice there is no place in the task table for a description of the task. Would be nice.
Also, it could be useful to allow the user to add options to the bsub command. One that comes to mind immediately is the -R option. One might be willing to trade time waiting for a job to start for the x2-3 gain in CPU between the barb and noma batch workers.
Of course both would have to be configurable from the web front end.
Posted from Warren's pipelinelist entry
A new run status, OldFail, or AcknowledgedFailure or something, which I could manually set runs/processes in the current Fail state to after investigating the failure. This would simplify debugging as I wouldn't have to wade through piles of old failures to find the (hopefully) few new ones.
The ability to filter the run list on run id (a list of ranges would be
good) or date.
The ability to filter the run list based on run status. Ideally I'd be able to OR and NOT them as well.