• Wilko/Thorsten worry about scalability of prometheus (can handle millions of variables per second?), called "cardinality" (number of unique time-series being currently populated, name+labels)
    • worry more about storing data, less about retrieving since we do that less often?
  • Wilko purges the prometheus database every 30 days to save on space
    • could make it longer if we needed to ... push to other databases like influxDB?
  • Mona pushes to the prometheus gateway once every 5 seconds
  • DAQ is "scraped" (or polled) from psmetric03. from a list of nodes/ports
    • list of daq nodes/ports is in /cds/group/psdm/psdatmgr/etc/config/prom/{hutch} 
      • has old stuff in it and not clear how to clean it up
    • not clear we can use the scraping pattern with psana slurm jobs
  • DAQ is OK since we just get a handful of numbers per second from each drp process (not each drp worker thread)
  • Ric was worried about lifetime of variables in the gateway but Wilko thinks these can be cleaned up.
  • Lifetime of gateway variables is independence of lifetime in prometheus database
  • Plan to run a separate prometheus gateway at nersc (and data may stay at nersc)
  • psana's labelling of the data (uses rank 0 processid) feels not-ideal, since can conflict with other batch jobs
    • daq uses instrument/partition/detector_alias/segment
    • could we use the daq labelling when running drp-python?
  • the gateway has an additional argument called "grouping key" that is not present for the daq scraping
    • might make it difficult to have the same model for daq/offline-psana
  • Ric has code to measure det.calib and libpressio time.  He is thinking of commenting that out for now because of the second-configure problem described below.

To-do List

  • can we change drp-python labeling to make it like the daq so it's easier to harvest the results?
  • a second configure in drp-python causes the metrics to be recreated which fails
  • psana shouldn't use rank0 processid for labelling since it is not unique
  • we will watch to see if prometheus scaling issues slow down psana (Mona says it's easy to turn off prometheus if this happens)
  • automate clean up of the gateway variables
  • No labels