Page History
...
UNSOLVED (any rate): with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang. Reproducible 1 out of 3 attempts. Learned that Pause/Resume does not see this problem - must just disable triggers. When this happens, the monitoring shows that the DRP does not process all of the L1s. It's like a batch gets stuck. After that, nothing gets processed (transition or more L1s).
UNSOLVED: when doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback). These files suggest that threads are being deleted in ~PGPDetector destructor before being joined. Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread. See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory). The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection". But we see that error message many times in that log file (every time we move up and down the state machine I think). To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).
UNSOLVED: hsd configure times out the first time, but works second time. Matt says that he seen the transition not received at all by the hsd. Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation? log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log. What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2. How I think it's supposed to work: tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll). both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process). both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07). hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY). tmohsd_0/1 both watch for their own A:READY and B:READY for each channel. maybe resets take a while to recover? maybe a "reentrancy" problem in hsdioc?Could it be the clear readout? But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).
...
Murali proposes to increase 3.05s timeout in psdaq/configdb/ configdb.py or webconfigdb.py (not sure which).
UNSOLVED: Saw this error in /reg/neh/home/cpo/2020/06/15_14:39:26_daq-tst-dev07:hsdioc_89.log when starting up processes after a power-cycle (for new hsd firmware for the interrupt-holdoff):
...