Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When the AFS servers where we keep temporary files hiccup, it's usually because they ran low on idle threads. It is possible to monitor this value and intervene to stave off disaster. It can be viewed with Ganglia or Nagios. Nagios only works inside SLAC's firewall, but is somewhat more reliable. Ganglia works from anywhere, and shows higher time resolution, but sometimes it stops updating and just shows a flat line with some old value.

Ganglia

Nagios

  • Using ssh as a proxy: You can use ssh as a SOCKS proxy to relay requests through a SLAC computer.

...

  • nagios then click "AFS FS Graphs" in the upper left. This shows most or all SLAC AFS servers.
  • Warren's custom disk monitoring pages shows just the servers we're actually interested in, and shows the mapping between server and volume names."old" 1-hour page

What to do with the info

When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. I usually start thinking about taking action if it stays below 60 for more than a few minutes. This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between:

...