Problem:

Due to the overhead in data taking for taking short enough runs that make moderate demands on CPU and filesizes, it is desirable to allow for longer runs (hence larger digi files) and reconstruct pieces of them in parallel.

The pieces of the puzzle are:

1. read in input digi file and find number of events to be processed.
2. create a list of event ranges for each job to munch on
3. set up the run conditions (via jobOptions, environment variables) for Gleam
4. fork off separate jobs for each piece waiting for them all to finish
5. concatenate the output recon and merit files

State of Prototyping

The pieces of this have been prototyped using python by Richard and Warren, and the code can be found on SLAC afs in /afs/slac/u/ey/richard/GLAST/Pipeline/parallel/. The work so far is done in recon.py. It opens a digi file and gets the number of events, calculating the event ranges to work on.

It makes up a fake set of batch jobs and runs them, waiting for the last to finish.

It then sets up to concatenate the output files.

There are 2 ouput modules suppling tools: reconPM.py and runMany.py (suppled by Warren).

What it does not do

It doesn't create the shell/jobOptions files to run Gleam with. At the moment, it does not get the return codes from the batch jobs, so it cannot tell if they failed.

Presumably it should also delete the intermediate files if all went well.

TODO list

  • fix the rc handling of the threads
  • prepare the shell and jobOptions files for each job
  • report the rc at end of the script and clean up intermediate files if all went well

As things stand, I'm hoping for Warren to pick up these pieces to construct a parallel recon task that handles his recon needs.

  • No labels

5 Comments

  1. Unknown User (wbfocke)

    This is in CVS project svac/svacPipeline, current tag is parallelRecon-v0r1.

    It makes jobOption files, submits jobs, they run OK, return codes are handled.

    Merit chunks get combined OK, as far as I can tell. Recon files don't get combined, I think it's some problem with LD_LIBRARY_PATH, ROOTSYS, CMTCONFIG, or GLAST_EXT.

  2. Unknown User (wbfocke)

    It's mostly working now. RootIoSvc.EvtMax is not being honored, so the first chunk gets all the events, the second gets all but the ones that were supposed to be in the first, ..., the last chunk gets what it should.

    Current changes are not committed to CVS because I screwed up with branching, I'll try to fix it tomorrow.

  3. Unknown User (wbfocke)

    I think it works. But I'm trying to find a workaround for some frailty in AFS/amd/LSF/something that makes bsub fail sometimes.

  4. Unknown User (wbfocke)

    Whatever the problem with bsub was, I can't get it do do it now. I went ahead and put in the next workaround that I was going to try - a few seconds delay between submissions. Was trying to run 50ish bsubs not just in quick succession, but simultaneously.

    Still trying to fix the CVS branch issue, so the CVS version is still not current. But the code is installed at /nfs/slac/g/svac/common/pipeline/Integration/svacPipeline/v3r2p0/.

  5. Unknown User (wbfocke)

    So of course, mere seconds after I posted the last comment, several chunks failed due to some other LSF/NFS/amd flakiness:
    "/nfs/farm/lsb_spool/1122338134.98066: line 8: /nfs/slac/g/svac/common/pipeline/Integration/svacPipeline/v3r2p0/recon/reconOne-v3r2p0.csh: Input/output error"

    With the current strategy of running the chunks with bsub, rather than making a chunk task, we can't roll back individual chunks, so they have to all succeed in the same GINO run. If there are 50 chunks in a run, it only takes a 1.4% failure rate on the individual chunks to produce a 50% failure rate for the whole job.