Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

UNSOLVEDwhen doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback).  These files suggest that threads are being deleted in ~PGPDetector destructor before being joined.  Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread.  See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory).  The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection".  But we see that error message many times in that log file (every time we move up and down the state machine I think).  To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).  printed out the thread id and it looks the same for the call to join and the door.  Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file)

 

UNSOLVEDhsd configure times out the first time, but works second time.  Matt says that he seen the transition not received at all by the hsd.  Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation?  log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log.  What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2.  How I think it's supposed to work:  tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll).  both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process).  both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07).  hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY).  tmohsd_0/1 both watch for their own A:READY and B:READY for each channel.  maybe resets take a while to recover?  maybe a "reentrancy" problem in hsdioc?Could it be the clear readout?  But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).

...