LSF 9 Migration and L1Proc

Status

This is how far I've gotten alone. Currently L1Proc is unable to finish a stream. L1Proc was setup and ran with instructions from here
The reasons seem to vary somewhat as to why a job is unable to proceed.

For example, jobs occasionally get in a position of exit code 127 shortly after being submitted. When this occurs, the Execution CWD (you can examine this by using bjobs -al [jobid]) isn't set correctly. I see this on jobs with pre-execute commands, I'm not sure if I've seen it on other jobs. As a result, I believe this may occur in a few situations:

A pre-execute command fails for some reason
- In which case, it appears LSF loses the CWD, which then defaults to /tmp on the batch machine, and can't find the pipeline_wrapper script.
NFS/AFS is unable to be read correctly, which may be a reason the pre-execute command fails.
Bash is/may be set up different on the clusters.

Take a look at this task for an example:
http://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/pi.jsp?pi=65850&dataSourceMode=Test
The first execution terminated because of an exit code of 127. This is what the logFile.txt looks like (note: Since the job never starts, the pipeline never sets the actual logFile in the web interface, so you need to click on the Working Dir, and then logFile.txt to see it)

PIPELINE_MODE not set.
Test mode: True
%DEBUG: 20130710:16:57:11 - GPLinit(init)/line-88 - Using GPLtools from user-specified $GPL2 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools/python
drwxrwxr-x 4 monzani gl 2048 Feb 19 15:37 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools
Using logging flavor DEVEL
Attempting to lock directory [/nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816] and to obtain a throttle lock at [Wed Jul 10 16:57:11 2013]
Trying to create /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/5294<at:var at:name="kele0010" />1373500631
Trying to link /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/r0376929816.lock
Trying to create /nfs/farm/g/glast/u41/L1Test/test/throttle/5294<at:var at:name="kele0010" />1373500631
Trying to link /nfs/farm/g/glast/u41/L1Test/test/throttle/0.lock
bash: pipeline_wrapper: No such file or directory

The second execution/retry (linked to above) doesn't get as far because the lock file is still left over from the first execution. When this happens, LSF puts the job back in pending and it keeps on trying to execute, always failing because it never receives the lock. I bkill'd that execution because it was being stuck in pending forever.

Recommendations

Attempt to run other streams of L1Proc on TEST pipeline
Run other tasks on TEST pipeline
Run test tasks testing pre-execute commands on TEST pipeline server

Reminder: Any use of the datacatalog on the TEST pipeline will need to use datacatalogPROD if you want to read from the production datacatalog.

System Setup

We are running job control client on kele0001. All bsub/bjobs/bkill/b* or LSF related commands must be ran from there in order to get the correct information.
The Pipeline TEST environment is currently using kele0001 for the job control daemon. It's only using it for tasks executed with the SLAC and SLACDATA jobsites (glast and glastraw, no glastmc) user accounts. So tasks with tasktypes TEST/test/SKIM and DATA/Data will work appropriately.

Space shortcuts

Child pages

Status

Recommendations

System Setup