Page History
AMI is instrumented extensively with various Prometheus metrics which are pushed to Grafana to allow users to monitor the performance of AMI. Additionally, user graphs are stored in Prometheus and can be pulled and executed offline to investigate performance issues.
Prometheus Metrics
Name | Type | Labels | Description |
---|---|---|---|
ami_event_count | Counter | hutch, type, process | Counts occurrences of different types of events |
ami_event_time_secs | Gauge | hutch, type, process | Measures elapsed time of different types of events |
ami_event_size_bytes | Gauge | hutch, process | Measures size of data sent over ZMQ sockets |
ami_event_latency_secs | Gauge | hutch, sender, process | Measures time it takes to send data over ZMQ. |
ami_graph | Info | hutch, name | JSON string of AMI client graphs. |
ami_graph_version | Gauge | hutch, name | Version number of graph. |
Grafana Monitoring
- Datagram Time: the average time it takes to process a single event in the heartbeat (only measured by workers)
- Datagrams per second: the rate of datagrams seen by each worker.
...
- Graph Version: increments when pushing a new version of the graph to AMI
- TranisitionsTransitions: number of psana transitions seen by each process
Retrieving Client Graphs from Prometheus
Code Block | ||||
---|---|---|---|---|
| ||||
from prometheus_api_client import PrometheusConnect
import datetime as dt
import pandas as pd
prom = PrometheusConnect(url="http://psmetric03:9090", disable_ssl=True)
prom.all_metrics()
label = {'hutch': 'local', 'name': 'graph'}
data = prom.get_metric_range_data("ami_graph_info",
label_config=label,
start_time=(dt.datetime.now() - dt.timedelta(hours=8)),
end_time=dt.datetime.now())
metrics = list(map(lambda i: i.get('metric'), data))
df = pd.DataFrame(metrics)
row = df[df['version'] == '3']
with open('dump.fc', 'w') as f:
f.writelines(row.graph) |
Overview
Content Tools