Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following 2 screen shots show a run with the EpixHrEmu DRP running at 5 kHz with 26 workers.  This is was taken in a marginally different situation from the above plot, which was with 60 workers.  With a processing time of 5.5 ms and a trigger rate of 5 kHz, one might expect the DRP to keep up (no deadtime) with 5.5 * 5 = 27.5 workers.  When dialing down the number of workers, I found that the system will run without deadtime even at 26 workers, but other grafana plots show the DRP struggling to keep buffers available.  Rate plots become more noisy.  But on average, the DRP keeps up.  The first screen shot below is a display from htop.  It shows that there are insufficient unloaded cores left to run a second DRP instance.  Indeed, when tried, such a system produced high deadtime.  Perhaps if one DRP were able to handle 2 panels (2 lanes) of data (currently this crashes), it might fit into one 64-core node, but it would be tight.  We decided against going down this path because some processing power headroom is needed for things like file writing.

Concerns

  • I hacked psana to provide support for adding application-level Prometheus metrics such as psana_compress_time and psana_calibration_time, shown above.  The hack currently prevents DrpPython from going through Configure a 2nd time so needs to be commented out in production code.  I didn't quickly find a fix for this.  I believe the crash was a complaint about the metrics being created a second time with the same jobid label.
  • psana metrics come with a jobid label to keep individual instances of them unique and identifiable.  The jobid is the process id of the process in which the metric was instantiated.  For DAQ processes, labels such as instrument, partition number and detector alias are more convenient for identifying metric instances.
  • psana uses a push gateway to publish metric values to Prometheus, whereas DAQ processes use an http exporter to expose metrics to Prometheus servers for scraping.  Apparently push gateways require metrics to be actively deleted or they hang around "forever".  They are perhaps cleared upon reboot of the push gateway host or limited by a maximum metric instance count, etc.  I couldn't make code added to psana to do this active deletion didn't work, and anyway it has the problem of not cleaning up upon crashes.  Would it be possible to move psana to using an http exporter similarly to the DAQ?  Thorsten mentioned that the current approach isn't sustainable, but it wasn't clear to me which portion he was referring to.
  • Registering metrics with a push gateway requires labels to be entered in a grouping_key argument.  I didn't get an understanding of what is special about these kinds of labels.  Do we have the right mix of the labels vs regular labels?  Does it need to be different for the DAQ environment?
  • psana/psexp/__init__.py has circular imports.
  • The MPI support in psana is brought in when using psana in the DrpPython.  This cause an 'orted' process to be started for every worker process.  Apparently it comes about from an import of mpi4py and is difficult to prevent.
  • Naively, one might expect a DRP process instance to consist of 3 threads (PGP reader, collector and EbReceiver) plus however many worker threads.  It turns out, however, that there are several other threads that consume CPU time on a per event bases.  It would be good to understand why these threads require CPU time.
  • Would hyperthreading be beneficial to the compressor?