Wilko/Thorsten worry about scalability of prometheus (can handle millions of variables per second?), called "cardinality" (number of unique time-series being currently populated, name+labels)
- worry more about storing data, less about retrieving since we do that less often?
Wilko purges the prometheus database every 30 days to save on space
- could make it longer if we needed to ... push to other databases like influxDB?
Mona pushes to the prometheus gateway once every 5 seconds
DAQ is "scraped" (or polled) from psmetric03. from a list of nodes/ports
- list of daq nodes/ports is in /cds/group/psdm/psdatmgr/etc/config/prom/{hutch}
  - has old stuff in it and not clear how to clean it up
- not clear we can use the scraping pattern with psana slurm jobs
DAQ is OK since we just get a handful of numbers per second from each drp process (not each drp worker thread)
Ric was worried about lifetime of variables in the gateway but Wilko thinks these can be cleaned up.
Lifetime of gateway variables is independence of lifetime in prometheus database
Plan to run a separate prometheus gateway at nersc (and data may stay at nersc)
psana's labelling of the data (uses rank 0 processid) feels not-ideal, since can conflict with other batch jobs
- daq uses instrument/partition/detector_alias/segment
- could we use the daq labelling when running drp-python?
the gateway has an additional argument called "grouping key" that is not present for the daq scraping
- might make it difficult to have the same model for daq/offline-psana
Ric has code to measure det.calib and libpressio time. He is thinking of commenting that out for now because of the second-configure problem described below.

To-do List

can we change drp-python labeling to make it like the daq so it's easier to harvest the results?
a second configure in drp-python causes the metrics to be recreated which fails
psana shouldn't use rank0 processid for labelling since it is not unique
we will watch to see if prometheus scaling issues slow down psana (Mona says it's easy to turn off prometheus if this happens)
automate clean up of the gateway variables

Page tree

Prometheus Monitoring

To-do List