- Wilko/Thorsten worry about scalability of prometheus (can handle millions of variables per second?), called "cardinality" (number of unique time-series being currently populated, name+labels)
- worry more about storing data, less about retrieving since we do that less often?
- Wilko purges the prometheus database every 30 days to save on space
- could make it longer if we needed to ... push to other databases like influxDB?
- Mona pushes to the prometheus gateway once every 5 seconds
- DAQ is "scraped" (or polled) from psmetric03. from a list of nodes/ports
- list of daq nodes/ports is in /cds/group/psdm/psdatmgr/etc/config/prom/{hutch}
- has old stuff in it and not clear how to clean it up
- not clear we can use the scraping pattern with psana slurm jobs
- list of daq nodes/ports is in /cds/group/psdm/psdatmgr/etc/config/prom/{hutch}
- DAQ is OK since we just get a handful of numbers per second from each drp process (not each drp worker thread)
- Ric was worried about lifetime of variables in the gateway but Wilko thinks these can be cleaned up.
- Lifetime of gateway variables is independence of lifetime in prometheus database
- Plan to run a separate prometheus gateway at nersc (and data may stay at nersc)
- psana's labelling of the data (uses rank 0 processid) feels not-ideal, since can conflict with other batch jobs
- daq uses instrument/partition/detector_alias/segment
- could we use the daq labelling when running drp-python?
- the gateway has an additional argument called "grouping key" that is not present for the daq scraping
- might make it difficult to have the same model for daq/offline-psana
- Ric has code to measure det.calib and libpressio time. He is thinking of commenting that out for now because of the second-configure problem described below.
To-do List
- can we change drp-python labeling to make it like the daq so it's easier to harvest the results?
- a second configure in drp-python causes the metrics to be recreated which fails
- psana shouldn't use rank0 processid for labelling since it is not unique
- we will watch to see if prometheus scaling issues slow down psana (Mona says it's easy to turn off prometheus if this happens)
- automate clean up of the gateway variables
Overview
Content Tools