...
...
Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs, and checkRun, will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back (remember to clean up first the crumbs and then the entire run... you'll get an error that's hard to fix otherwise. The cleanupCrumbs jobs shouldn't fail if they're rolled back after cleanupChunks has run (they used to, but don't anymore).
Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.
...
Go to the mission timeline viewer (glast-ground -> Mission Planning Web View -> Timeline), get the start time for the physics runs, then plug them into xTime. Make sure to uncheck "Apply Clock Offset Correction(s) for RXTE and Swift" at the bottom of the page.
By default the timeline doesn't go very far into the past, you may need to change that by clicking on "selections" in the top right corner of the page.
They They have names like /nfs/farm/g/glast/u52/L1/r0263753970/r0263753970.lock
At the moment it doesn't matter what's in them, a half-sentence explaining why you made the lock is good. An empty file, or a rant about how much it sucks that you have to do this, works also. When you're ready to let the run go, just remove the file and it the run should start up in 5-10 minutes.
...
There's 3 places to get the information...
When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. I usually start thinking about taking action if it stays below 60 for more than a few minutes For us, it seems llike things usually work as long as it stays above 60 or so (although afs112 has been known to go as low as 40, since PipelineStanging6 is for the HalfPipe). This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between:. Suspending jobs will make the server recover sooner, but some of the jobs are likely to get timeout errors and fail due to the delay of being suspended, so it may be best to wait until idle threads hits 0 before suspending.
...