Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Look at the dontCleanUp file. It should contain a list of all merge processes that missed files, and what files were missing. It is I's possible that it will get garbled by mutiple multiple jobs writing to it at once, so if it doesn't seem to make sense, you can still get the information by following the steps below.

...

Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs, and checkRun, will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back (remember to clean up first the crumbs and then the entire run... you'll get an error that's hard to fix otherwise. The cleanupCrumbs jobs shouldn't fail if they're rolled back after cleanupChunks has run (they used to, but don't anymore).

The run lock

Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.

...

Go to the mission timeline viewer (glast-ground -> Mission Planning Web View -> Timeline), get the start time for the physics runs, then plug them into xTime. Make sure to uncheck "Apply Clock Offset Correction(s) for RXTE and Swift" at the bottom of the page.

Making run locks:

By default the timeline doesn't go very far into the past, you may need to change that by clicking on "selections" in the top right corner of the page.

Making run locks:

They They have names like /nfs/farm/g/glast/u52/L1/r0263753970/r0263753970.lock
At the moment it doesn't matter what's in them, a half-sentence explaining why you made the lock is good. An empty file, or a rant about how much it sucks that you have to do this, works also. When you're ready to let the run go, just remove the file and it the run should start up in 5-10 minutes.

...

  • new nagios (02) click "Hostgroup Overview" on the left side, then the host you want in the "AFS fileservers" table. This shows idle threads for that server on the "AFSfs" row.
  • old nagios (01) then click "AFS FS Graphs" in the upper left. This shows most or all SLAC AFS servers. Not as reliable as nagios02, and it's going away eventually.
  • Warren's "old" 1-hour page

There's 3 places to get the information...

What to do with the info

When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. I usually start thinking about taking action if it stays below 60 for more than a few minutes For us, it seems llike things usually work as long as it stays above 60 or so (although afs112 has been known to go as low as 40, since PipelineStanging6 is for the HalfPipe). This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between:. Suspending jobs will make the server recover sooner, but some of the jobs are likely to get timeout errors and fail due to the delay of being suspended, so it may be best to wait until idle threads hits 0 before suspending.

  • From a noric, as glastraw:
  • setenv L1ProcROOT /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/1.70 (any L1 version >= 1.70 should work)
  • $L1ProcROOT/tools/grepJobs.py PipelineStaging2 1200 60
  • first arg is the volume to be suspended
  • second arg is time seconds to wait between stopping and resuming
  • third arg is time seconds to wait between resuming jobs
  • If you want to do afs106, you have to say "PipelineStaging/"; if you say "PipelineStaging" it will suspend everything.

...