EpixHR Emulator

Overview

Before an EpixHR camera becomes available in hardware form, an emulator in SRCF is to be set up to develop and test the operation of all the components involved in the DAQ for this device. This set-up would consist of:

10 or 20 DRP nodes configured with the DrpTDet KCU firmware
- The goal is to have each DRP handle 1 or 2 tiles/panels of the camera
The tdetsim sim_length parameter configured to produce a similar data volume as the real camera
Substituting real EpixHR2x2 data for the tdetsim data in the DRP's event() method
A DrpPython script to compress event data into 'FEX' contributions, while retaining a prescaled amount of the original 'raw' data

Method

For the initial step, the DRP code was modified to support an epixhremu detector. This detector requires the TDetSim firmware in the KCU. We plan for 1 or 2 camera tiles/panels to be supported on 1 or 2 lanes for the segment level DRP. An EpixHRemu C++ class was created to dummy up the serial number and data for the detector.

Serial number

The serial number is of the form epixhremu_00cafe0000-0000000000-0000000000-0000000000-0000000000-0000000000-0000000000, where the 4 zeros after 'cafe' are replaced with the segment number, i.e. ...00cafe0000... through ...00cafe0013....

Data

To emulate a similar data bandwidth as expected from the real detector, the TDetSim sim_length parameter was set to 55292, corresponding to 221 kB of pixel data. This matches the panel size of 144 by 192 by 4 ASICs times 2 bytes per pixel, 221184 bytes.

The data passed downstream is provided by an XTC file of existing data. This file is specified on the DRP command line through an xtcfile keyword argument. In case the file contains some junk events, an l1aOffset kwarg can be used to skip past the first N events in the file. The substitute data is further indexed by the segment number, multiplied by the number of lanes in use, to avoid passing the same data for each panel.

The data files we currently (9/1/23) have available were recorded during the EpixHR2x2 testing. This detector had only one panel sort-of working, which was handled by ASIC 0. A file that had some somewhat varying data in it was rixx1003721-r0106-s010-c000.xtc2. For the emulated data (which is either 4 or 8 panels, depending on the lane mask), each panel is given ASIC 0 data from a different event. Thus all emulated events have identical data.

Python detector interface

With the above data was recorded (in April 2023) and used to develop the detector interface. From this we see events like the following in the data files. Both runs were taken with one panel per segment. The first is from run 277 (psana://exp=tstx00417,run=277,dir=/cds/data/drpsrcf/tst/tstx00417/xtc) and contains the segment 3 panel, and the second is from 276 (psana://exp=tstx00417,run=276,dir=/cds/data/drpsrcf/tst/tstx00417/xtc) containing all 20 segments.

DrpPython

In the next step, the new DrpPython functionality was used to compress the raw data with libpressio. We first tried the SZ algorithm and found that its performance didn't scale well with rate. SZ3 worked better and we saw compression times of ~3.5 ms at a trigger rate of 5 kHz. Calibrating the data to ready it for the compressor took 1.8 ms. 60 workers (60 python processes) were used to distribute the load and achieve the rate. The following listing shows the DrpPython script used:

epixHrEmu.py

from psana import DataSource
from psana.dgramedit import AlgDef, DetectorDef, DataType
import psana.psexp.TransitionId
import sys
import numpy as np
from libpressio import PressioCompressor
import json

# Define compressor configuration:
lpjson = {
    "compressor_id": "sz3", #the compression algo.
    "compressor_config": {
        #"sz:data_type"           : lp.pressio_uint16_dtype,
        #"sz:data_type"           : np.dtype('uint16'),
        ###"sz:error_bound_mode_str" : "abs",
        ###"sz:abs_err_bound"        : 10, # max error
        "sz3:abs_error_bound"     : 10, # max error
        "sz3:metric"              : "size",
        #"pressio:nthreads"        : 4
    },
}

ds = DataSource(drp=drp_info, monitor=True)
thread_num = drp_info.worker_num

cfgAlg = AlgDef("config", 0, 0, 1)
fexAlg = AlgDef("fex", 0, 0, 1)
detDef = DetectorDef(drp_info.det_name, drp_info.det_type, drp_info.det_id)
cfgDef = {
    "compressor_json" : (str,      1),
}
fexDef = {
    "fex"             : (np.uint8, 1), # Why not float32?
}
nodeId = None
namesId = None

cfg = ds.add_detector(detDef, cfgAlg, cfgDef, nodeId, namesId, drp_info.det_segment)
det = ds.add_detector(detDef, fexAlg, fexDef, nodeId, namesId, drp_info.det_segment)

cfg.config.compressor_json = json.dumps(lpjson)

ds.add_data(cfg.config)

# configure
compressor = PressioCompressor.from_config(lpjson)
#print(compressor.get_config())

for myrun in ds.runs():
    epixhr = myrun.Detector('epixhr_emu')
    for nevt,evt in enumerate(myrun.events()):
        cal = epixhr.raw.calib(evt)
        det.fex.fex = compressor.encode(cal)
        ds.add_data(det.fex)
        if nevt%1000!=0: ds.remove_data('epixhr_emu','raw')

The times were measured by adding prometheus metrics to the script (not shown here for clarity) and viewing the results with grafana. This performance was measured using one DRP segment and one panel. A typical live AMI data image from such a run (segment 0 on drp-srcf-cmp035) is:

Note that this shows data that has been decompressed by the detector interface. The following is a snapshot of the grafana performance plot showing the calibration and compression times (1.94 ms, 3.55 ms) seen for a 5 kHz run. The green trace showing 5.52 ms is the amount of time that is spent by the python script for each event from the C++ code's perspective.

The following 2 screen shots show a run with the EpixHrEmu DRP running at 5 kHz with 26 workers. This is was taken in a marginally different situation from the above plot, which was with 60 workers. With a processing time of 5.5 ms and a trigger rate of 5 kHz, one might expect the DRP to keep up (no deadtime) with 5.5 * 5 = 27.5 workers. When dialing down the number of workers, I found that the system will run without deadtime even at 26 workers, but other grafana plots show the DRP struggling to keep buffers available. Rate plots become more noisy. But on average, the DRP keeps up. The first screen shot below is a display from htop. It shows that there are insufficient unloaded cores left to run a second DRP instance. Indeed, when tried, such a system produced high deadtime. Perhaps if one DRP were able to handle 2 panels (2 lanes) of data (currently this crashes), it might fit into one 64-core node, but it would be tight. We decided against going down this path because some processing power headroom is needed for things like file writing.

Concerns

I hacked psana to provide support for adding application-level Prometheus metrics such as psana_compress_time and psana_calibration_time, shown above. The hack currently prevents DrpPython from going through Configure a 2nd time so metric instantiation needs to be commented out in production code (currently only in the epixHrEmu.py DrpPython script). I didn't quickly find a fix for this. I believe the crash was a complaint about the metrics being created a second time with the same jobid label.
psana metrics are created with a jobid label to keep individual instances of them unique and identifiable. The jobid is the process id of the process in which the metric was instantiated. For DAQ processes, labels such as instrument, partition number and detector alias are more convenient for identifying metric instances.
psana uses a push gateway to publish metric values to Prometheus, whereas DAQ processes use an http exporter to expose metrics to Prometheus servers for scraping. Apparently push gateways require metrics to be actively deleted or they hang around "forever". They are perhaps cleared upon reboot of the push gateway host or limited by a maximum metric instance count, etc. I couldn't make code added to psana to do this active deletion didn't work, and anyway it has the problem of not cleaning up upon crashes. Would it be possible to move psana to using an http exporter similarly to the DAQ? Thorsten mentioned that the current approach isn't sustainable, but it wasn't clear to me which portion he was referring to.
Registering metrics with a push gateway requires labels to be entered in a grouping_key argument. I didn't get an understanding of what is special about these kinds of labels. Do we have the right mix of the labels vs regular labels? Does it need to be different for the DAQ environment?
psana/psexp/__init__.py has circular imports.
The MPI support in psana is brought in when using psana in the DrpPython. This cause an 'orted' process to be started for every worker process. Apparently it comes about from an import of mpi4py and is difficult to prevent.
Naively, one might expect a DRP process instance to consist of 3 threads (PGP reader, collector and EbReceiver) plus however many worker threads. It turns out, however, that there are several other threads that consume CPU time on a per event bases. It would be good to understand why these threads require CPU time.
Would hyperthreading be beneficial to the compressor?

Page tree