You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Can't open lockfile

This message was received:

Event type: L1Proc.unLockRun
Event timestamp: 2017-10-30 21:57:42.474089
Spacecraft ID: -1
Target: None
Trigger name: L1-processing
Cluster size: 1
Message text: Can't open lockfile cd /r0531060188.lock.
 

Status in data monitoring pages:

The process was listed as terminated and has this java error:

org.srs.jobcontrol.JobControlException: Remote Exception performing operation
	at org.srs.jobcontrol.JobControlClient.checkException(JobControlClient.java:219)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:77)
	at org.srs.pipeline.server.process.batch.BatchProcessExecutor.run(BatchProcessExecutor.java:202)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:152)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
	at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
	at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
	at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
	at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
	at com.sun.proxy.$Proxy7.submit(Unknown Source)
	at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:74)
	... 4 more

 

Solution:

Seems to be an NFS error.  I can't write to that directory as myself or the glast account.  Need to check on propre account/permissions

 

Handling Stuck Jobs

A stuck job usually shows up in a bjobs list as having consumed no CPU or memory, and in an UNKWN state and has been sitting around for a long time (depends on job, need more data).

The procedure is to log on as glastraw an bkill them.  As Warren explained:

You can't roll back a running job. You have to bkill them and wit for the reaper to notice that they're dead. The pipeline will automatically retry them once for L1 (0 is the default, you can ask for up to 3). If the jobs you killed were retries, then you'll have to roll them back. I've never seen a job in unknown state recover by itself.

 

Checking Host Status

Use bhosts to check the host status (although you might want to grep with the host name as the list is long).

 

Running a command directly on a Host

From Warren:

You can just immediately run a command on any batch host with lsrun. Obviously this should be used sparingly. uptime is a good cheap thing that will show you if the host is responsive and what the load is. Or check to see if a particular filesystem is causing trouble, or if some command works.

 

Failure to execute java script

(Need to update this title to exact phrase.  The system doesn't keep old messages and I forgot to write it down)

When a script gets terminated and viewing messages includes the error message "failure when executing java...(Need to get exact text next time it happens)".  This is typically an indication of a bad variable in the environment/database that the process is running it.  This bad value is typically set by the upstream process having run in some strange manner and mangling the variable.  In every case so far that I've seen, the upstream process ran twice simultaneously on two different hosts and so the two processes were overwriting each other.  This is fairly easy to identify as the output log contains all the execution output (except possibly the environment output) twice and there are two different LSF summary blocks at the end of the log.

The solution is to simply roll back the upstream process so it gets a clean execution.

 

Failure in checkRun

Check run failed and I thought at first it was a case of bad data as the process ran simultaneously on two different hosts.  However a simple rollback did not solve the problem as there was a lock left from running the mergeCalChunks upstream process.  This lock was left because of an infrastructure glitch preventing proper access to the xroot data disks.  Rerunning that process removed the lock file but checkRun still didn't work, complaining about the existence of a dontCleanUp file.  This file was actually left from the bad mergeCalChunks run.  Before deleting it we need to check that the files listed there actually exist.  This command:

for ff in $(egrep '^root' <path to>/dontCleanUp) ; do xrd.pl --where stat $ff ; done

will poll the xroot server for information on the files listed in the dontCleanUp file.

If they are all there, we are good to go and can remove the dontCleanUp file and rollback the checkRun process.  If not, we need to look further upstream in the data processing chain.

 

  • No labels