AMI is instrumented extensively with various Prometheus metrics which are pushed to Grafana to allow users to monitor the performance of AMI. Additionally, user graphs are stored in Prometheus and can be pulled and executed offline to investigate performance issues.

Prometheus Metrics

NameTypeLabelsDescription
ami_event_countCounterhutch, type, processCounts occurrences of different types of events
ami_event_time_secsGaugehutch, type, processMeasures elapsed time of different types of events
ami_event_size_bytesGaugehutch, processMeasures size of data sent over ZMQ sockets
ami_event_latency_secsGaugehutch, sender, processMeasures time it takes to send data over ZMQ.
ami_graphInfohutch, nameJSON string of AMI client graphs.
ami_graph_versionGaugehutch, nameVersion number of graph.

Grafana Monitoring

  • Datagram Time: the average time it takes to process a single event in the heartbeat (only measured by workers)
  • Datagrams per second: the rate of datagrams seen by each worker.

  • Heartbeat Time: times multiple events in a heartbeat interval from reception from psana to the sending to the collectors
  • Heartbeats per second: rate of heartbeats processed by individual processes

  • Idle Time: the time we spend waiting for psana to give us an event
  • Partial Event Rate: measures the rate at which events are incomplete, where we think (but are not certain) that "complete" means that all parts of the graph have the data they need to execute

  • Datagram Size: the size of the packet that is sent to localCollector, globalCollector, manager via ZMQ
  • Pruned Heartbeats: happens when a heartbeat is missing a contribution (e.g. from a worker who received no events during that heartbeat interval).  We do NOT timeout heartbeats with a timer, they only get partially-completed when a later heartbeat fragment shows up from the missing worker, for example.

  • Latency: measures the wall-clock-time minus the heartbeat time (we think the first event in a heartbeat, but we are not certain) at varying stages in the path: localCollector to globalCollector to manager (last one is for display to clients).  We think latencies are never measured w.r.t. event times.  We think the latencies may not be shown in grafana in a "synchronous" fashion: for one sample we may be looking at results from different heartbeats
  • Heartbeat: current heartbeat of each process

  • Graph Version: increments when pushing a new version of the graph to AMI
  • Transitions: number of psana transitions seen by each process

Retrieving Client Graphs from Prometheus

from prometheus_api_client import PrometheusConnect
import datetime as dt
import pandas as pd

prom = PrometheusConnect(url="http://psmetric03:9090", disable_ssl=True)
prom.all_metrics()
label = {'hutch': 'local', 'name': 'graph'}
data = prom.get_metric_range_data("ami_graph_info",
                           label_config=label,
                           start_time=(dt.datetime.now() - dt.timedelta(hours=8)),
                           end_time=dt.datetime.now())

metrics = list(map(lambda i: i.get('metric'), data))
df = pd.DataFrame(metrics)

row = df[df['version'] == '3']
with open('dump.fc', 'w') as f:
    f.writelines(row.graph)



  • No labels