Monitoring AMI Performance

AMI is instrumented extensively with various Prometheus metrics which are pushed to Grafana to allow users to monitor the performance of AMI. Additionally, user graphs are stored in Prometheus and can be pulled and executed offline to investigate performance issues.

Prometheus Metrics

Name	Type	Labels	Description
ami_event_count	Counter	hutch, type, process	Counts occurrences of different types of events
ami_event_time_secs	Gauge	hutch, type, process	Measures elapsed time of different types of events
ami_event_size_bytes	Gauge	hutch, process	Measures size of data sent over ZMQ sockets
ami_event_latency_secs	Gauge	hutch, sender, process	Measures time it takes to send data over ZMQ.
ami_graph	Info	hutch, name	JSON string of AMI client graphs.
ami_graph_version	Gauge	hutch, name	Version number of graph.

Grafana Monitoring

Datagram Time: the average time it takes to process a single event in the heartbeat (only measured by workers)
Datagrams per second: the rate of datagrams seen by each worker.

Heartbeat Time: times multiple events in a heartbeat interval from reception from psana to the sending to the collectors
Heartbeats per second: rate of heartbeats processed by individual processes

Idle Time: the time we spend waiting for psana to give us an event
Partial Event Rate: measures the rate at which events are incomplete, where we think (but are not certain) that "complete" means that all parts of the graph have the data they need to execute

Datagram Size: the size of the packet that is sent to localCollector, globalCollector, manager via ZMQ
Pruned Heartbeats: happens when a heartbeat is missing a contribution (e.g. from a worker who received no events during that heartbeat interval). We do NOT timeout heartbeats with a timer, they only get partially-completed when a later heartbeat fragment shows up from the missing worker, for example.

Latency: measures the wall-clock-time minus the heartbeat time (we think the first event in a heartbeat, but we are not certain) at varying stages in the path: localCollector to globalCollector to manager (last one is for display to clients). We think latencies are never measured w.r.t. event times. We think the latencies may not be shown in grafana in a "synchronous" fashion: for one sample we may be looking at results from different heartbeats
Heartbeat: current heartbeat of each process

Graph Version: increments when pushing a new version of the graph to AMI
Transitions: number of psana transitions seen by each process

Retrieving Client Graphs from Prometheus

from prometheus_api_client import PrometheusConnect
import datetime as dt
import pandas as pd

prom = PrometheusConnect(url="http://psmetric03:9090", disable_ssl=True)
prom.all_metrics()
label = {'hutch': 'local', 'name': 'graph'}
data = prom.get_metric_range_data("ami_graph_info",
                           label_config=label,
                           start_time=(dt.datetime.now() - dt.timedelta(hours=8)),
                           end_time=dt.datetime.now())

metrics = list(map(lambda i: i.get('metric'), data))
df = pd.DataFrame(metrics)

row = df[df['version'] == '3']
with open('dump.fc', 'w') as f:
    f.writelines(row.graph)

Page tree

Monitoring AMI Performance

Prometheus Metrics

Grafana Monitoring

Retrieving Client Graphs from Prometheus