Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

AMI is instrumented extensively with various Prometheus metrics which are pushed to Grafana to allow users to monitor the performance of AMI. Additionally, user graphs are stored in Prometheus and can be pulled and executed offline to investigate performance issues.

Prometheus Metrics




Image Added

Image Added

Image AddedImage AddedImage Added


  • Heartbeat Time: times multiple events in a heartbeat interval from reception from psana to the sending to the local collector
  • Latency: measures the wall-clock-time minus the heartbeat time (we think the first event in a heartbeat, but we are not certain) at varying stages in the path: localCollector to globalCollector to manager (last one is for display to clients).  We think latencies are never measured w.r.t. event times.  We think the latencies may not be shown in grafana in a "synchronous" fashion: for one sample we may be looking at results from different heartbeats
  • Partial Event Rate: measures the rate at which events are incomplete, where we think (but are not certain) that "complete" means that all parts of the graph have the data they need to execute
  • Idle Time: the time we spend waiting for psana to give us an event
  • Datagram Size: the size of the packet that is sent to localCollector, globalCollector, manager via ZMQ
  • Datagram Time: the average time it takes to process a single event in the heartbeat (only measured by workers)
  • Graph Version: increments when pushing a new version of the graph to AMI
  • Pruned Heartbeats: happens when a heartbeat is missing a contribution (e.g. from a worker who received no events during that heartbeat interval).  We do NOT timeout heartbeats with a timer, they only get partially-completed when a later heartbeat fragment shows up from the missing worker, for example.
  • Datagrams per second: the rate of datagrams seen by each worker.