This is another item we will need very soon. Navid has made a perl interface to the archive system. But now the issue is how and when to archive.
Since the I&T pipelines are parallel, they have several different named tasks operating on the same run, writing to the same directory. Additionally, not all files are reported to the pipeline, but they are wanted to be archived. So we have to archive the entire directory. Ideally we would prevail on everyone to identify every file they want archived, but we seem to be on the losing end of that one!
I think this means the archiving has to happen asynchronously to the pipeline. I'd be curious to see a comment to this blog item from Dan with his thoughts on the algorithm for figuring out what to archive and when.
Note that SCS asks that we keep the files larger than 500 MB to use the tapes efficiently. So we had been thinking to make tar files. Navid keeps track in his archiver db of the file content inside the tar file, so he can ask the archive system for the right tar file when someone asks for an individual file.
I'm at a bit of a loss at the moment to divine a way to know when one can archive in a general way. It would be nice not to need custom archiving per group of tasks.
4 Comments
Richard Dubois AUTHOR
How about the following? Add a 'group' property to the task and set it the same for all related tasks (ie all those who share processing of the same runs). Then the asynchronous archiver could check that all related tasks have finished processing a run, and then archive all directories specified for all tpi's. If they are truly related, that is only one directory, but picks up the possibility of them being spread out a little. This depends on the archive facility keepings tabs of what is in the tar files.
I'd humbly suggest we do this in java.
Richard Dubois AUTHOR
From Navid:
Yes the archiver divides the files into 500MB chunks. It's always done that. You just need to give it a parent directory and a list of files/directories and tars them all up and splits them into 500MB pieces and stores them in /mstore.
The 500MB pieces is configurable so if you want larger/smaller pieces you can specify that too.
You can see it in action here (including all the features it has). Of course there's a command line API to it too. https://www.slac.stanford.edu/www-glast-dev/cgi/DiskArchive
Richard Dubois AUTHOR
Hmm. An additional wrinkle. As I recall, svac has optional pipelines. So if particle=cosmic, their recon pipeline gets triggered; otherwise not. How will we know (in a generic sense)??
Richard Dubois AUTHOR
Yet a further wrinkle, perhaps more directed to our current implementation. Dependent tasks get their input via soft links to files from other tasks. This is because, at present, all datasets in a task must have a filename based on the task name.
When we will archive the directories, will we also archive the soft links? If not, is the information on how to get to the initial input file lost? I guess it also means we would have to restore to the identical location.