Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Can't open lockfile

This message was received:

...

org.srs.jobcontrol.JobControlException: Remote Exception performing operation
	at org.srs.jobcontrol.JobControlClient.checkException(JobControlClient.java:219)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:77)
	at org.srs.pipeline.server.process.batch.BatchProcessExecutor.run(BatchProcessExecutor.java:202)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:152)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
	at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
	at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
	at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
	at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
	at com.sun.proxy.$Proxy7.submit(Unknown Source)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:74)
	... 4 more

 


Solution:

Seems to be an NFS error.  I can't write to that directory as myself or the glast account.  Need to check on proprer account/permissions 


Handling Stuck Jobs

A stuck job usually shows up in a bjobs list as having consumed no CPU or memory, and in an UNKWN state and has been sitting around for a long time (depends on job, need more data).

...

You can't roll back a running job. You have to bkill them and wit for the reaper to notice that they're dead. The pipeline will automatically retry them once for L1 (0 is the default, you can ask for up to 3). If the jobs you killed were retries, then you'll have to roll them back. I've never seen a job in unknown state recover by itself.

 


Checking Host Status

Use bhosts to check the host status (although you might want to grep with the host name as the list is long).

 


Running a command directly on a Host

...

You can just immediately run a command on any batch host with lsrun. Obviously this should be used sparingly. uptime is a good cheap thing that will show you if the host is responsive and what the load is. Or check to see if a particular filesystem is causing trouble, or if some command works.

 


Failure to execute java script

...

The solution is to simply roll back the upstream process so it gets a clean execution. 


Failure in checkRun

Check run failed and I thought at first it was a case of bad data as the process ran simultaneously on two different hosts.  However a simple rollback did not solve the problem as there was a lock left from running the mergeCalChunks upstream process.  This lock was left because of an infrastructure glitch preventing proper access to the xroot data disks.  Rerunning that process removed the lock file but checkRun still didn't work, complaining about the existence of a dontCleanUp file.  This file was actually left from the bad mergeCalChunks run.  Before deleting it we need to check that the files listed there actually exist.  This command:

...

Note: the xrd.pl in the above command is found at /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrd.pl 


Job fails to start (stays in PEND status)

...

2) If that doesn't work, just bkill the process and let the reaper resubmit it (Which is what we had to do in this case). 


Handling a host with lots of hung jobs

...

Which requires sudo permission which you probably don't have yet and they'll want to limit to some machine we own and we should find a box that's not my workstation to use for that.

 


Fill it up with dummy jobs.

for ii in $(seq 8) ; do bsub -q glastdataq -sp 90 -m <hostname> sleep 86400

# in bash, as glastraw

 


So probably really only one realistic option at the moment.

...


Also you should probably notify unix-admin. 


Files missing from NFS file server

...

Among the jobs that were launched after 2:30 and failed was findChunks, which had moved some of the evt files. The automatic retry moved the rest. Then some of the digi and fastMon jobs couldn't find their inputs. This was a little trickier than the usual "move them back" scenario, because their locations in xroot had to be gathered from 2 log files. So I did that and rolled back findChunks and it all looks OK to me now.

...


Throttling the Pipeline

If you ever need to limit the amount of work being done on the pipeline (like we wanted to with the LAT restart in April 2018), you can manually create throttle locks to limit the number of simultaneous runs that can be worked on at a time.  Right now the pipeline is set to allow up to 6 runs to be worked on at once.  If you want to limit that, simply create lock files in the /nfs/farm/g/glast/u41/L1/throttle directory of the form 0.lock, 1.lock, ... up to 5.lock.  The contents can be anything you want.  It is just the presence of the file that stops things from running.  Each lock file created will reduce the number of simultaneous runs by one.  Creating all six will stop the pipeline from processing anything.

 


Changing the Reaper Settings

...

The other parameter is set on a per submission source basis.  This is the ReaperDelayMinutes parameter and is found in each of the later sections on that page.  This controls how long a process has to be dead for before the reaper kills it.  It is typically set to 60 or 120 minutes.  NOTE (from Warren): I'm not convinced that ReaperDelayMinutes actually does anything. Restarting the pipeline set both of them back to default.

 


HalfPipe marked as Failed but L1 started successfully

...

4) remove the lock file and the symbolic link.

 


Files missing from NFS and xrootd servers

...

You'll have to be glastraw for the last 2 steps.

But then I messed up. I rolled back fondChunks 180502012.546962321 and it failed complaining about overlapping data. The right thing is to roll back the whole doRun stream while defining deliveriesToIgnore=180502013 (How?). Which I've now done.

And now there are more errors, which I'll have to investigate later, but probably involve the magic7 file. 


Multiple Submissions of the same Job

...

In 95% of the cases, all that needs to be done is a simple rollback of that affected process.  In the other 5%, there is some other underlying problem that is also affecting the completion of the job and you'll need to search through the log files for the error. 


HalfPipe progress bar field has red question mark

...

  1. There really is no data.  Looking at the list of files in the delivery (click on the progress bar for FastCopy on that delivery) shows that there are no  *.LSCI.* files in the delivery.  Nothing needs to be done.
  2. It's duplicate data.  An example of this is delivery 18729012.  In this case the same files (but with a .02 version number) were delivered that had come in a previous delivery (in this case delivery 18729011).  The system determined that there were no new events so the halfpipe wasn't started.  Nothing needs to be done.
  3. The half-pipe failed to start.  For some reason, The FastCopy processing successfully completed but the half pipe failed to launch.  When this happens, and the delivery contains science data follow the instructions on the HTF FC Incoming page for "Dispatching Runs" (item 2).  This will launch the half-pipe for that run.

...


Unknown LSE_Keys error

They symptom of this is that you get an error of the form:

...

To date, I've only ever seen this in the mergeEvt task.  This is symptomatic of a problem upstream in the doChunk streams (running the makeEvt task).  In every instance, I've found one or more of those makeEvt tasks had a multiple submission.  Rolling all the doChunk streams back with the multiple submissions fixes the problem.

 

 



Set up a rollback when jobs still running

...

I haven't actually used this yet so I'll update it (or remove this line) when I can verify that it works. 


cleanupCompleteRun doesn't start after last run processes

...