Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Ganglia pages display strip charts for CPU usage and I/O rates as a function of time.  Here is an historical image of the Fermi user/group space disk, wain025, Ganglia page's first few plots.  The third plot shows CPU time.  User jobs should avoid a situation where the "System CPU" exceeds about 50%.  Greater loads will begin to cause file-access delays, then timeouts and, eventually, the machine will hang or crash.  Now take a look at wain025's performance at this very moment.  In addition to the top five plots (as shown in the historical image), you can scroll down to see activity on each user and group partition on the server.  Near the very bottom, keep an eye open for the "nfs_server_badcalls" plot; experience has shown that any entries at all in this plot is a very bad thing.

What to be alert for...
  • CPU utilization > 50%  (especially "System CPU")
  • NFS disk I/O > 30 MB/s
  • AFS disk I/O > 5-10 MB/s
  • xroot disk I/O >> 200 MB/s (wains only)

...