Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is how far I've gotten alone. Currently L1Proc is unable to finish a stream. L1Proc was setup and ran with instructions from here The reasons seem to vary somewhat as to why a job is unable to proceed.

Wiki Markup
For example, jobs occasionally get in a position of exit code 127 shortly after being submitted. When this occurs, the Execution CWD (you can examine this by using bjobs \-al \[jobid\]) isn't set correctly. I see this on jobs with pre-execute commands, I'm not sure if I've seen it on other jobs. As a result, I believe this may occur in a few situations:

  • A pre-execute command fails for some reason
    • In which case, it appears LSF loses the CWD, which then defaults to /tmp on the batch machine, and can't find the pipeline_wrapper script.
  • NFS/AFS is unable to be read correctly, which may be a reason the pre-execute command fails.
  • Bash is/may be set up different on the clusters.

Take a look at this task for an example:
http://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/pi.jsp?pi=65850&dataSourceMode=Test
The first execution terminated because of an exit code of 127. This is what the logFile.txt looks like (note: Since the job never starts, the pipeline never sets the actual logFile in the web interface, so you need to click on the Working Dir, and then logFile.txt to see it)

Code Block
PIPELINE_MODE not set.
Test mode: True
%DEBUG: 20130710:16:57:11 - GPLinit(init)/line-88 - Using GPLtools from user-specified $GPL2 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools/python
drwxrwxr-x 4 monzani gl 2048 Feb 19 15:37 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools
Using logging flavor DEVEL
Attempting to lock directory [/nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816] and to obtain a throttle lock at [Wed Jul 10 16:57:11 2013]
Trying to create /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/5294<at:var at:name="kele0010" />1373500631
Trying to link /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/r0376929816.lock
Trying to create /nfs/farm/g/glast/u41/L1Test/test/throttle/5294<at:var at:name="kele0010" />1373500631
Trying to link /nfs/farm/g/glast/u41/L1Test/test/throttle/0.lock
bash: pipeline_wrapper: No such file or directory

The second execution/retry (linked to above) doesn't get as far because the lock file is still left over from the first execution. When this happens, LSF puts the job back in pending and it keeps on trying to execute, always failing because it never receives the lock. I bkill'd that execution because it was being stuck in pending foreverAfter Renata updated the slactest cluster to have a glatadataq (although it's not preemptive) for L1Proc testing, I had some problem with a task being stuck in LSF pending after it was trying to acquire a lockfile it was never able to acquire (I think a previous execution had it locked up). I removed the lock file, and now I'm back to being stuck on
this execution of findChunks. The batch job is unable to acquire an Oracle connection. This is currently where we get stuck. I think it may be due to the pre-execution command, or possibly the test cluster set up, but I don't know.

Recommendations

  • Attempt to run other streams of L1Proc on TEST pipeline
  • Run other tasks on TEST pipeline
  • Run test tasks testing pre-execute commands on TEST pipeline server

...