Page History

AMI is instrumented extensively with various Prometheus metrics which are pushed to Grafana to allow users to monitor the performance of AMI. Additionally, user graphs are stored in Prometheus and can be pulled and executed offline to investigate performance issues.

Prometheus Metrics

Image Added

Image AddedImage AddedImage Added

Heartbeat Time: times multiple events in a heartbeat interval from reception from psana to the sending to the local collector
Latency: measures the wall-clock-time minus the heartbeat time (we think the first event in a heartbeat, but we are not certain) at varying stages in the path: localCollector to globalCollector to manager (last one is for display to clients). We think latencies are never measured w.r.t. event times. We think the latencies may not be shown in grafana in a "synchronous" fashion: for one sample we may be looking at results from different heartbeats
Partial Event Rate: measures the rate at which events are incomplete, where we think (but are not certain) that "complete" means that all parts of the graph have the data they need to execute
Idle Time: the time we spend waiting for psana to give us an event
Datagram Size: the size of the packet that is sent to localCollector, globalCollector, manager via ZMQ
Datagram Time: the average time it takes to process a single event in the heartbeat (only measured by workers)
Graph Version: increments when pushing a new version of the graph to AMI
Pruned Heartbeats: happens when a heartbeat is missing a contribution (e.g. from a worker who received no events during that heartbeat interval). We do NOT timeout heartbeats with a timer, they only get partially-completed when a later heartbeat fragment shows up from the missing worker, for example.
Datagrams per second: the rate of datagrams seen by each worker.

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Prometheus Metrics