Tom would like to be able to operate on directories or files created by the pipeline server using "unix tools". He needs this capability to debug and fix problems, the nature of which cannot be known in advance, so he needs a method which is very flexible so it can be quickly modified to address arbitrary problems as they occur. The method needs to be able to handle large numbers (>30000) of files and directories.
Below I present a few performance measurements, and then a proposal for a slightly different way of addressing the problem.
Notes:
No Format |
---|
[~tonyj:glastlnx07] /nfs/farm/g/glast/u26/MC-tasks/obssim-ST-v7r6p1/output > /usr/bin/time grep cob0343 */logFile.txt
0.16user 0.13system 0:00.73elapsed 39%CPU
|
Note This does not scale to large number of directories, since */logFile.txt is expanded by the shell and eventually the expanded line becomes too long.
No Format |
---|
[~tonyj:glastlnx07] /nfs/farm/g/glast/u26/MC-tasks/obssim-ST-v7r6p1/output > /usr/bin/time find */logFile.txt -exec grep cob0343 \{\} \;
0.44user 1.95system 0:03.41elapsed 69%CPU
|
Note This command scales much better. Note that grep is invoked 1500 times in this case, but that does not seem to introduce a huge overhead.
No Format |
---|
[~tonyj:glastlnx07] /nfs/farm/g/glast/u26/MC-tasks/obssim-ST-v7r6p1/output > ls -1 */logFile.txt > /tmp/file.list
cat /tmp/file.list | /usr/bin/time xargs -i grep cob0343 \{\}
0.50user 1.80system 0:03.58elapsed 64%CPU
|
Note Performance is very similar to using find. Similar to find grep is invoked 1500 times.
Define a new command "pipeline find" which is able to return a list of files or directories. This can then be used with xargs (see experiment 3 above). So to search all log files we could use the command:
No Format |
---|
pipeline find obssim-ST-v7r6p1 obssim logFile | xargs -i grep cob0343 \{\}
|
or to delete all obsolete working directories we could use
No Format |
---|
pipeline find obssim-ST-v7r6p1 obssim logFile | xargs -i grep cob0343 \{\}
|
pipeline find <options> <task-name> <process-name> [<output> ,<output>...]
<task-name> | The task on which to operate. Can include version and subtasks, e.g. parent(1.0)/child |
<process-name> | The process name. |
<output> | An item to output. Defaults to workingDir. See valid items below. |
--latest | Show only "latest" items |
--all | All (not only latest) |
--obsolete | all - latest |
--stream <run-range-list> | List of stream ranges (not yet implemented) |
--filter <filter-spec> | Filter the results (e.g. exitcode != 0). Filters can use any of the supported output items, including meta-data |
Supported output items
Item |
---|
workingDir |
exitCode |
stream |
createDate |
submitDate |
endDate |
cpuSecondsUsed |
host |
exitCode |
logFile |
jobId |
executionNumber |
isLatest |
streamPath |
or any meta-data item associated with the task.
No Format |
---|
pipeline find backgndSC-GR-v10r4 runMonteCarlo -s logFile exitCode stream evtsSim evtsOut --filter "evtsOut>200"
|