org.srs.jobcontrol.JobControlException: Remote Exception performing operation
	at org.srs.jobcontrol.JobControlClient.checkException(JobControlClient.java:219)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:77)
	at org.srs.pipeline.server.process.batch.BatchProcessExecutor.run(BatchProcessExecutor.java:202)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:152)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
	at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
	at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
	at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
	at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
	at com.sun.proxy.$Proxy7.submit(Unknown Source)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:74)
	... 4 more

Solution

...

Seems to be an NFS error. I can't write to that directory as myself or the glast account. Need to check on proprer account/permissions.

Handling Stuck Jobs

A stuck job usually shows up in a bjobs list as having consumed no CPU or memory, and in an UNKWN state and has been sitting around for a long time (depends on job, need more data).

...

Failure to execute java script

...

When a script gets terminated and viewing messages includes the error message "failure when executing java...(Need to get exact text next time it happens)". This is typically an indication of a bad variable in the environment/database that the process is running it. This bad value is typically set by the upstream process having run in some strange manner and mangling the variable. In every case so far that I've seen, the upstream process ran twice simultaneously on two different hosts and so the two processes were overwriting each other. This is fairly easy to identify as the output log contains all the execution output (except possibly the environment output) twice and there are two different LSF summary blocks at the end of the log.

...

Check run failed and I thought at first it was a case of bad data as the process ran simultaneously on two different hosts. However a simple rollback did not solve the problem as there was a lock left from running the mergeCalChunks upstream process. This lock was left because of an infrastructure glitch preventing proper access to the xroot data disks. Rerunning that process removed the lock file but checkRun still didn't work, complaining about the existence of a dontCleanUp file. This file was actually left from the bad mergeCalChunks run. Before deleting it we need to check that the files listed there actually exist. This command:

for ff in $(egrep '^root' <path to>/dontCleanUp) ; do xrd.pl --where stat $ff ; done

will poll the xroot server for information on the files listed in the dontCleanUp file.

...

The particular event that triggered this scenario was the two deliveries of run 543796948. The first delivery (180315009) had several recon jobs that ended up on a dead host and had to be killed. While they were hung, the rest of the events in the run arrived in delivery 180315010 and the findChunks task was in a pending state waiting for its pre-exec command (/afs/slac/g/glast/isoc/flightOps/rhel6_gcc44/ISOC_PROD/bin/isoc run /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/5.6/lockFile.py) to give it the go ahead. After the hung recon tasks were killed, the reaper properly restarted them and the run completed and properly cleaned up after itself. However, the findChunks task never started up. LSF normally retries the pre-exec command regularly (every 2.5 minutes in this case - how to we check?) but had not run it in over 2 hours (again how to we check? is there a max retries?).

When this happens:1)

First try stopping and restarting the job (bstop and bresume). This will often kickstart things and get them going again

...

If that doesn't work, just bkill the process and let the reaper resubmit it (Which is what we had to do in this case).

Handling a host with lots of hung jobs

...

Basically several jobs were hung, when they were killed, they still didn't work and rolling things back made it worse. In the end, we needed to copy the files back from xroot to NFS and restart the process. Some questions, comments, and answers from the thread (some of these still need answers):1)

How do we know that? Is it just because the logs complain that they are not there? How do we check for their existence?

...

Where do they live to copy them from?

...

I'm assuming they need to get copied to directory that it can't find them in (root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/542/r0542185591/e00000000000002729588/event/ in this case). Or do they go somewhere else?

...

How do we perform the copy?

...

See #10 below.

...

I'm guessing we'll need to move/rename the existing chunkList file so a new one can be created at this point? Is this correct? (BTW the notes say we should have a section on chunkList files that no one has written yet)

...

Where do we roll back at to get everything going again? Just a general rollback on the command line? Or is there a specific task that can be rolled back to kick everything off properly again?

...

The .evt files reference by the chunkList file (in /nfs/farm/g/glast/u28/stage/180308007/r0542185591/) don't exist.
They've been moved to xroot. It really would be better if they were copied, I [Warren] don't remember why I get discouraged every time I try to make that change.

...

The directory in xroot (root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/542/r0542185591/e00000000000002729588/event/) only contains a single evt file: r0542185591_e00000000000002729588_v246043240_event.evt. The failed jobs are complaining about other files which are missing.
Each was moved to it's own directory.

...

In the /nfs/farm/g/glast/u41/L1/deliveries/1803 directory, there is no entry for this delivery (180308007). The entries for 180308006 and 180308008 are there though. Found it. It's in the /nfs/farm/g/glast/u28/stage directory. I looks like that gets moved over to deliveries once it's done.

...

I still can't seem to find the original evt files.
Each has a directory corresponding to the chunk number. The second-to-last component in the path you give above.

...

I use a command like this to generate a script to move them back to NFS:

...

"awk '/^\ /afs.*xrdcp/ {print $1, $2, $3, $5, $4}' /nfs/farm/g/glast/u41/L1/logs/PROD/L1Proc/5.5/doRun/findChunks/160xxxxxx/515xxx/011/485xxxxxx/001xxx/694/archive/198092024/logFile.txt

...

Here's an example the output for the issue in https://www-glast.stanford.edu/protected/mail/opsprob/15272.html

Code Block

language	bash
title	Example	collapse	true

% awk '/^\/afs.*xrdcp/{print  $1, $2, $3, $5, $4}' /nfs/farm/g/glast/u41/L1/logs/PROD/L1Proc/5.9/doRun/findChunks/220xxxxxx/805xxx/013/681xxxxxx/413xxx/156/archive/342431233/logFile.txt | tee cp.sh
/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000009129671/event/r0681413156_e00000000000009129671_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000009129671.evt
/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000008510966/event/r0681413156_e00000000000008510966_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000008510966.evt
% cat cp.sh
/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000009129671/event/r0681413156_e00000000000009129671_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000009129671.evt
/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000008510966/event/r0681413156_e00000000000008510966_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000008510966.evt

From Michael:

The log file is of the findChunks instance that copied the evt files but didn't finish for whatever reason.
It's prudent to check that the script looks reasonable. I execute it as me, it never was neccessary to become glastraw. After that remove the run lock and roll back findChunks.

...

Throttling the Pipeline

If you ever need to limit the amount of work being done on the pipeline (like we wanted to with the LAT restart in April 2018), you can manually create throttle locks to limit the number of simultaneous runs that can be worked on at a time. Right now the pipeline is set to allow up to 6 runs to be worked on at once. If you want to limit that, simply create lock files in the /nfs/farm/g/glast/u41/L1/throttle directory of the form 0.lock, 1.lock, ... up to 5.lock. The contents can be anything you want. It is just the presence of the file that stops things from running. Each lock file created will reduce the number of simultaneous runs by one. Creating all six will stop the pipeline from processing anything.

...

$> cd /nfs/farm/g/glast/u28 
$> mv RePipe/$dl/r0$run/*.evt stage/$dl/r0$run 
$> cd /nfs/farm/g/glast/u41/L1/runs/546/r0546962321/ 
$> cd /nfs/farm/g/glast/u41/L1/runs/546/r0546962321/ 
$> /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/5.6/newChunkList.py /nfs/farm/g/glast/u28/stage/$dl/r0$run 
$> mv r0546962321_180502012_chunkList.txt.new r0546962321_180502012_chunkList.txt

...

Space shortcuts

Child pages

Versions Compared

Old Version 18

New Version 19

Key

Solution

Seems to be an NFS error. I can't write to that directory as myself or the glast account. Need to check on proprer account/permissions.

Handling Stuck Jobs

Failure to execute java script

Handling a host with lots of hung jobs

Throttling the Pipeline

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 18

New Version 19

Key

Solution

Seems to be an NFS error. I can't write to that directory as myself or the glast account. Need to check on proprer account/permissions.

Handling Stuck Jobs

Failure to execute java script

Handling a host with lots of hung jobs

Throttling the Pipeline