Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

doRecon controls the reconstruction of the chunks.  It reads a list of jobs to run, and for each one, spawns a thread that submits a batch job, waits for it to complete, and returns its status.  It then writes a file listing which, if any, of the chunks failed, and exits with an unsuccessful return code if there were failed chunks.  When the TP first starts, it checks for this list of failed chunks, and if it is present and nonempty, it uses it instead of the original list of all chunks writted by setupRecon.  This way, if some chunks fail, the TP can be rolled back and it will only need to redo the failed chunks.  The failed chunk list is not registered as a pipeline dataset, since it violates logical constraints within GINO for a TP to modify its input this way.

 In In order to avoid problems with unreliable NFS service, the chunk jobs copy the input digi file to a local disk on the batch host (if it has one, I think they all do now) and writes the output files there as well.  It then moves the output files to a staging directory on AFS, and deletes the local copy of the input file.  When chunks fail, these files are left behind and eventually fill up the local disk.  Therefore, there is a script to seek out and delete these orphaned files.  It must run as user glastdpf.  I usually log into a noric as glastdpf and run it by hand every once in a while, but a better solution would probably be to wrap it in a task and run that task every night from my crontab.

Merging the chunks of the recon file uses 4 TPs.  The first (mergeRecon) performs the actual merge, from chunk files on the staging disk to a recon file on the staging disk.  The second deletes the chunk files from the staging disk.  The third copies the merged recon file from the staging disk to its final destination on NFS.  The fourth deletes the berged file from the staging area.  This may seem unreasonably complicated, but it reduces the amount the amount of work that must be redone on a rollback, and in some cases a rollback wouldn't work otherwise - we used to get situations where something would fail after the chunks had ben deleted, and we'd have to redo the reconstruction.

Once the recon file is merged & moved, the reconReport and svac or beamtest tuple tasks are launched so that the can run while later steps are happening.

Merit and CAL files are merged in 4 TPs each, similarly to recon, but these files have a sufficiently simple structure that they can be merged with hadd instead of the custom pyRoot script used for recon. 

3 more TPs enter URLS for the recon, merit, and CAL files into the eLog DB.

The final step (cleanup) just removes directories that  were created in the staging area by the setup step.  If the directories are not empty, the attempt to remove them will fail.  This indicates that there was a problem ealier in the task.  This should be investigated before removing the offending files by hand and doing a rollback.

 Purpose

 Associated Scripts

Input

 Output

Comments

 finish up a run

  cleanup.py

 

 

  external side

finish up a run

  cleanupWrapper.pl

 

 

 pipeline side

control job for reconstruction of chunks

  doRecon.pl

 

 

 external side

control job for reconstruction of chunks

  doReconWrapper.pl

 

 

  pipeline side

 

  genRTRLaunchWrapper.pl

 

 

  pipeline side

make XML config file for task

  genXml.pl

 

 

 

  merge chunks of recon file

  mergeRecon.py

 

 

external side

 merge chunks of recon file

  mergeReconWrapper.pl

 

 

  pipeline side

 

  recon.py

 

 

 obsolete

 reconstruct one chunk

  recon0ne.csh

 

 

 

 

  reconWrapper.pl

 

 

 obsolete

 

  reprocess-licos.csh

 

 

 delete

 

  reprocess-v3r1p5.csh

 

 

delete

 

  reprocess-version.csh

 

 

 delete

 launch recon report

  RunRALaunchWrapper.pl

 

 

 pipeline side

 prepare chunk jobs

  setupRecon.py

 

 

 external side

 prepare chunk jobs

  setupReconWrapper.pl

 

 

 pipeline side

...