Confluence will be down for maintenance June 14 2024 at 6AM PT.
...
org.srs.jobcontrol.JobControlException: Remote Exception performing operation at org.srs.jobcontrol.JobControlClient.checkException(JobControlClient.java:219) at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:77) at org.srs.pipeline.server.process.batch.BatchProcessExecutor.run(BatchProcessExecutor.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readByte(DataInputStream.java:265) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148) at com.sun.proxy.$Proxy7.submit(Unknown Source) at org.srs.jobcontrol.JobControlClient.submit(JobControlClient.java:74) ... 4 more
Solution:
Seems to be an NFS error. I can't write to that directory as myself or the glast account. Need to check on proprer account/permissions.
A stuck job usually shows up in a bjobs list as having consumed no CPU or memory, and in an UNKWN state and has been sitting around for a long time (depends on job, need more data).
...
(Need to update this title to exact phrase. The system doesn't keep old messages and I forgot to write it down)
When a script gets terminated and viewing messages includes the error message "failure when executing java...(Need to get exact text next time it happens)". This is typically an indication of a bad variable in the environment/database that the process is running it. This bad value is typically set by the upstream process having run in some strange manner and mangling the variable. In every case so far that I've seen, the upstream process ran twice simultaneously on two different hosts and so the two processes were overwriting each other. This is fairly easy to identify as the output log contains all the execution output (except possibly the environment output) twice and there are two different LSF summary blocks at the end of the log.
...
Check run failed and I thought at first it was a case of bad data as the process ran simultaneously on two different hosts. However a simple rollback did not solve the problem as there was a lock left from running the mergeCalChunks upstream process. This lock was left because of an infrastructure glitch preventing proper access to the xroot data disks. Rerunning that process removed the lock file but checkRun still didn't work, complaining about the existence of a dontCleanUp file. This file was actually left from the bad mergeCalChunks run. Before deleting it we need to check that the files listed there actually exist. This command:
for ff in $(egrep '^root' <path to>/dontCleanUp) ; do xrd.pl --where stat $ff ; done
will poll the xroot server for information on the files listed in the dontCleanUp file.
...
The particular event that triggered this scenario was the two deliveries of run 543796948. The first delivery (180315009) had several recon jobs that ended up on a dead host and had to be killed. While they were hung, the rest of the events in the run arrived in delivery 180315010 and the findChunks task was in a pending state waiting for its pre-exec command (/afs/slac/g/glast/isoc/flightOps/rhel6_gcc44/ISOC_PROD/bin/isoc run /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/5.6/lockFile.py) to give it the go ahead. After the hung recon tasks were killed, the reaper properly restarted them and the run completed and properly cleaned up after itself. However, the findChunks task never started up. LSF normally retries the pre-exec command regularly (every 2.5 minutes in this case - how to we check?) but had not run it in over 2 hours (again how to we check? is there a max retries?).
When this happens:1)
...
...
You may need to do something with files on XrootD. See the Intro to using xrootd.
This one came up the incident reference by this thread in the opsprob mailing list: https://www-glast.stanford.edu/protected/mail/opsprob/11338.html
Basically several jobs were hung, when they were killed, they still didn't work and rolling things back made it worse. In the end, we needed to copy the files back from xroot to NFS and restart the process. Some questions, comments, and answers from the thread (some of these still need answers):1)
...
...
...
...
...
...
...
...
...
...
...
...
...
Here's an example the output for the issue in https://www-glast.stanford.edu/protected/mail/opsprob/15272.html
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
% awk '/^\/afs.*xrdcp/{print $1, $2, $3, $5, $4}' /nfs/farm/g/glast/u41/L1/logs/PROD/L1Proc/5.9/doRun/findChunks/220xxxxxx/805xxx/013/681xxxxxx/413xxx/156/archive/342431233/logFile.txt | tee cp.sh /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000009129671/event/r0681413156_e00000000000009129671_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000009129671.evt /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000008510966/event/r0681413156_e00000000000008510966_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000008510966.evt % cat cp.sh /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000009129671/event/r0681413156_e00000000000009129671_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000009129671.evt /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD/bin/xrdcp -np -f root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/681/r0681413156/e00000000000008510966/event/r0681413156_e00000000000008510966_v342431233_event.evt /nfs/farm/g/glast/u28/stage/220805013/r0681413156/r0681413156-e00000000000008510966.evt |
From Michael:
The log file is of the findChunks instance that copied the evt files but didn't finish for whatever reason.
It's prudent to check that the script looks reasonable. I execute it as me, it never was neccessary to become glastraw. After that remove the run lock and roll back findChunks.
...
If you ever need to limit the amount of work being done on the pipeline (like we wanted to with the LAT restart in April 2018), you can manually create throttle locks to limit the number of simultaneous runs that can be worked on at a time. Right now the pipeline is set to allow up to 6 runs to be worked on at once. If you want to limit that, simply create lock files in the /nfs/farm/g/glast/u41/L1/throttle directory of the form 0.lock, 1.lock, ... up to 5.lock. The contents can be anything you want. It is just the presence of the file that stops things from running. Each lock file created will reduce the number of simultaneous runs by one. Creating all six will stop the pipeline from processing anything.
...
This came up with delivery 180409011. In this case the launchL1 task was started simultaneously on two different hosts in the batch queue. The first on ran successfully but since the second one failed with an error saying the L1 stream already existed, the process was marked as failed. To clean up the data display do the following:1)
...
...
...
...
$> cd /nfs/farm/g/glast/u28
$> mv RePipe/$dl/r0$run/*.evt stage/$dl/r0$run
$> cd /nfs/farm/g/glast/u41/L1/runs/546/r0546962321/
$> cd /nfs/farm/g/glast/u41/L1/runs/546/r0546962321/
$> /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/5.6/newChunkList.py /nfs/farm/g/glast/u28/stage/$dl/r0$run
$> mv r0546962321_180502012_chunkList.txt.new r0546962321_180502012_chunkList.txt
...
It's overkill to roll back from the checkChunk. Maybe even harmful, but I guess the chunk data are still on xrootd. But this should be verified.
This has happened a few times. For instance,
https://www-glast.stanford.edu/protected/mail/opsprob/15247.html
https://www-glast.stanford.edu/protected/mail/opsprob/14337.html
https://www-glast.stanford.edu/protected/mail/opsprob/11489.html
A typical rollback probably won't work. It will likely need to be rolled back from an earlier process like scanDigi.
This mysql server hosts the GlastCalibDB database where the L1 pipeline looks up the calibration file to use. If it's down or needs to be taken down (e.g., to install an update), Brian left instructions to halt job submission for L1Proc so that other tasks may run as normal.
Stopping a single task:
This should suspend job submission and leave processes in the Waiting state.
To clear it (and all other limits that were set for the jobsite):
When this happens, it will leave a lock file behind (e.g., /nfs/farm/g/glast/u41/L1/runs/697/r0697981861/r0697981861.lock) that you will need to delete before rolling the job back. Otherwise, the rollback will stay queued but not run. If you do the rollback before removing the lock file you should bkill the job, wait for the reaper to terminate it, then remove the lock file, and do the rollback again.
This will cause an error in the logs like:
MootSvc FATAL Hw key 8517 in data does not match hw key belonging to Moot Config #3063
This has happened in 2023, 2017, and 2013 due to a problem with an update to the hot strip mask on the satellite, e.g., the change was reverted but the pipeline was expecting the MOOT key for the new configuration. Following what was first done in 2013, the fix is to update the databases to change the MOOT key for the failed runs back to the old value and rollback the failures. Other stuff mentioned in the original thread doesn't seem necessary.
1) Change the moot_key in glastops_acqsummary in the Oracle database.
2) Change the config_fk (mootkey) in MOOD's Acq_summary in the MOOD/FMX MySQL database.
To access the database, you need to login to fermilnx02 (in 2023) and issue the command mysql -h glastMOOD -P8208 -u <username> -p
. You will need to get a password from the secret/fpd/fermi/isoc vault. Richard may need to put in a ticket for you to get permission. In 2023, we used Steve Tether's account (tether) since the "magician" account is overpowered for what we need (it's basically the super-user account).
use mood;
.3) In some receipts (e.g. in older emails) is written that a repipe is neccessary. Michael doesn't recall repipes for the 2017 key mismatch. At least no helpful repipes. In 2023 for sure no repipes were performed. Don't repipe unless really neccessary!
4) Rollback the affected deliveries
If they are in a Failed state, you can just do a standard rollback.
If findChunks is stuck, then bkill the process and wait for the reaper to get it. Most of them should then auto rollback without intervention.
History
For instance, this error in ft2Runs stream 240415006.734877891:
Code Block | ||
---|---|---|
| ||
> terminate called after throwing an instance of 'std::runtime_error'
> what(): FATAL: the provided Magic 7 file does not cover the requested time interval. To cover the requested interval we would need to extrapolate position and attitude (forward) more than what permitted by the current configuration (see the parameter 'extrapolationLimit'). |
Michael fixed this by:
1) doRun.ft2Runs (case of 240415006.734877891) reads from the runs area:
stageIn for: /nfs/farm/g/glast/u41/L1/runs/734/r0734877891/r0734877891_v000_magic7L1.txt
The magic7 file in the staging area, which is also being copied into the
runs area, is complete. makeM7L1 reads from the staging area.
Thus, I just rolled back makeM7L1, which found all packets and created a
valid magic7L1 file in the runs area, to be read by ft2Runs.
2) doRun.doChunk.fakeFT2 (case of 240414007.734792595.6674757) stages from
the staging area:
stageIn for: /nfs/farm/g/glast/u28/stage/240414007/magic7_240414007.txt
This file was incomplete! I replaced it by the 240414008 magic7 file