Confluence will be unusable 23-July-2024 at 06:00 due to a Crowd upgrade.
This is how far I've gotten alone. Currently L1Proc is unable to finish a stream. L1Proc was setup and ran with instructions from here
The reasons seem to vary somewhat as to why a job is unable to proceed.
For example, jobs occasionally get in a position of exit code 127 shortly after being submitted. When this occurs, the Execution CWD (you can examine this by using bjobs -al [jobid]) isn't set correctly. I see this on jobs with pre-execute commands, I'm not sure if I've seen it on other jobs. As a result, I believe this may occur in a few situations:
Take a look at this task for an example:
http://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/pi.jsp?pi=65850&dataSourceMode=Test
The first execution terminated because of an exit code of 127. This is what the logFile.txt looks like (note: Since the job never starts, the pipeline never sets the actual logFile in the web interface, so you need to click on the Working Dir, and then logFile.txt to see it)
PIPELINE_MODE not set. Test mode: True %DEBUG: 20130710:16:57:11 - GPLinit(init)/line-88 - Using GPLtools from user-specified $GPL2 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools/python drwxrwxr-x 4 monzani gl 2048 Feb 19 15:37 /afs/slac/g/glast/ground/releases/volume13/L1Proc/4.3/GPLtools Using logging flavor DEVEL Attempting to lock directory [/nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816] and to obtain a throttle lock at [Wed Jul 10 16:57:11 2013] Trying to create /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/5294<at:var at:name="kele0010" />1373500631 Trying to link /nfs/farm/g/glast/u41/L1Test/test/runs/376/r0376929816/r0376929816.lock Trying to create /nfs/farm/g/glast/u41/L1Test/test/throttle/5294<at:var at:name="kele0010" />1373500631 Trying to link /nfs/farm/g/glast/u41/L1Test/test/throttle/0.lock bash: pipeline_wrapper: No such file or directory
The second execution/retry (linked to above) doesn't get as far because the lock file is still left over from the first execution. When this happens, LSF puts the job back in pending and it keeps on trying to execute, always failing because it never receives the lock. I bkill'd that execution because it was being stuck in pending forever.
Reminder: Any use of the datacatalog on the TEST pipeline will need to use datacatalogPROD if you want to read from the production datacatalog.
We are running job control client on kele0001. All bsub/bjobs/bkill/b* or LSF related commands must be ran from there in order to get the correct information.
The Pipeline TEST environment is currently using kele0001 for the job control daemon. It's only using it for tasks executed with the SLAC and SLACDATA jobsites (glast and glastraw, no glastmc) user accounts. So tasks with tasktypes TEST/test/SKIM and DATA/Data will work appropriately.