Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Bad merges: If a process that's merging crumb-level files into chunks or chunks into runs can't find all of its input files, it won't fail. See the "dontCleanUp" section below. Processes downstream of such a merge may fail because they are trying to use different types of input files (e.g., digi and recon) and the events don't match up because some are missing from one file and not the other. Then you need to roll back the merge even though it "succeeded" the first time.

Infrastructure failures

...

  • nagios then click "AFS FS Graphs" in the upper left. This shows most or all SLAC AFS servers.
  • Warren's custom disk monitoring pages shows just the servers we're actually interested in, and shows the mapping between server and volume names.

When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. I usually start thinking about taking action if it stays below 60 for more than a few minutes. This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between:

  • from From a noric, as glastraw:
  • setenv L1ProcROOT /nfs/slac/g/svac/focke/L1/dev/1.70 (can use the prod location once we put 1.70 in prod)
  • $L1ProcROOT/tools/grepJobs.py PipelineStaging2 300 30
  • first arg is the volume to be suspended
  • second arg # is time to wait between stopping and resumingsecond
  • # third arg is time to wait between resuming jobs
  • If you want to do afs106, you have to say "PipelineStaging/"; if you say "PipelineStaging" it will suspend everything.

...