Page History

...

doRecon controls the reconstruction of the chunks. It reads a list of jobs to run, and for each one, spawns a thread that submits a batch job, waits for it to complete, and returns its status. It then writes a file listing which, if any, of the chunks failed, and exits with an unsuccessful return code if there were failed chunks. When the TP first starts, it checks for this list of failed chunks, and if it is present and nonempty, it uses it instead of the original list of all chunks writted by setupRecon. This way, if some chunks fail, the TP can be rolled back and it will only need to redo the failed chunks. The failed chunk list is not registered as a pipeline dataset, since it violates logical constraints within GINO for a TP to modify its input this way.

In In order to avoid problems with unreliable NFS service, the chunk jobs copy the input digi file to a local disk on the batch host (if it has one, I think they all do now) and writes the output files there as well. It then moves the output files to a staging directory on AFS, and deletes the local copy of the input file. When chunks fail, these files are left behind and eventually fill up the local disk. Therefore, there is a script to seek out and delete these orphaned files. It must run as user glastdpf. I usually log into a noric as glastdpf and run it by hand every once in a while, but a better solution would probably be to wrap it in a task and run that task every night from my crontab.

Merging the chunks of the recon file uses 4 TPs. The first (mergeRecon) performs the actual merge, from chunk files on the staging disk to a recon file on the staging disk. The second deletes the chunk files from the staging disk. The third copies the merged recon file from the staging disk to its final destination on NFS. The fourth deletes the berged file from the staging area. This may seem unreasonably complicated, but it reduces the amount the amount of work that must be redone on a rollback, and in some cases a rollback wouldn't work otherwise - we used to get situations where something would fail after the chunks had ben deleted, and we'd have to redo the reconstruction.

Once the recon file is merged & moved, the reconReport and svac or beamtest tuple tasks are launched so that the can run while later steps are happening.

Merit and CAL files are merged in 4 TPs each, similarly to recon, but these files have a sufficiently simple structure that they can be merged with hadd instead of the custom pyRoot script used for recon.

3 more TPs enter URLS for the recon, merit, and CAL files into the eLog DB.

The final step (cleanup) just removes directories that were created in the staging area by the setup step. If the directories are not empty, the attempt to remove them will fail. This indicates that there was a problem ealier in the task. This should be investigated before removing the offending files by hand and doing a rollback.

Purpose	Associated Scripts	Comments
finish up a run	cleanup.py	external side
finish up a run	cleanupWrapper.pl	pipeline side
control job for reconstruction of chunks	doRecon.pl	external side
control job for reconstruction of chunks	doReconWrapper.pl	pipeline side
	genRTRLaunchWrapper.pl	pipeline side
make XML config file for task	genXml.pl
merge chunks of recon file	mergeRecon.py	external side
merge chunks of recon file	mergeReconWrapper.pl	pipeline side
	recon.py	obsolete
reconstruct one chunk	recon0ne.csh
	reconWrapper.pl	obsolete
	reprocess-licos.csh	delete
	reprocess-v3r1p5.csh	delete
	reprocess-version.csh	delete
launch recon report	RunRALaunchWrapper.pl	pipeline side
prepare chunk jobs	setupRecon.py	external side
prepare chunk jobs	setupReconWrapper.pl	pipeline side

...

Child pages

Versions Compared

Old Version 29

New Version 30

Key