Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • I hacked psana to provide support for adding application-level Prometheus metrics such as psana_compress_time and psana_calibration_time, shown above.  The hack currently prevents DrpPython from going through Configure a 2nd time so metric instantiation needs to be commented out in production code (currently only in the epixHrEmu.py DrpPython script).  I didn't quickly find a fix for this.  I believe the crash was a complaint about the metrics being created a second time with the same jobid label.
  • psana metrics come are created with a jobid label to keep individual instances of them unique and identifiable.  The jobid is the process id of the process in which the metric was instantiated.  For DAQ processes, labels such as instrument, partition number and detector alias are more convenient for identifying metric instances.
  • psana uses a push gateway to publish metric values to Prometheus, whereas DAQ processes use an http exporter to expose metrics to Prometheus servers for scraping.  Apparently push gateways require metrics to be actively deleted or they hang around "forever".  They are perhaps cleared upon reboot of the push gateway host or limited by a maximum metric instance count, etc.  I couldn't make code added to psana to do this active deletion didn't work, and anyway it has the problem of not cleaning up upon crashes.  Would it be possible to move psana to using an http exporter similarly to the DAQ?  Thorsten mentioned that the current approach isn't sustainable, but it wasn't clear to me which portion he was referring to.
  • Registering metrics with a push gateway requires labels to be entered in a grouping_key argument.  I didn't get an understanding of what is special about these kinds of labels.  Do we have the right mix of the labels vs regular labels?  Does it need to be different for the DAQ environment?
  • psana/psexp/__init__.py has circular imports.
  • The MPI support in psana is brought in when using psana in the DrpPython.  This cause an 'orted' process to be started for every worker process.  Apparently it comes about from an import of mpi4py and is difficult to prevent.
  • Naively, one might expect a DRP process instance to consist of 3 threads (PGP reader, collector and EbReceiver) plus however many worker threads.  It turns out, however, that there are several other threads that consume CPU time on a per event bases.  It would be good to understand why these threads require CPU time.
  • Would hyperthreading be beneficial to the compressor?