Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s).

 

UNSOLVEDSOLVEDwhen doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback).  These files suggest that threads are being deleted in ~PGPDetector destructor before being joined.  Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread.  See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory).  The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection".  But we see that error message many times in that log file (every time we move up and down the state machine I think).  To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).  printed out the thread id and it looks the same for the call to join and the door.  Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file).  Update: Ric thinks he may have fixed it in a recent push (he said it was a problem with psmetric std::vectors doing "move" when vectors of monitored variables were increased in size, creating zero pointers).  Since then I haven't been able to duplicate the error running up/down the state machine for about an hour so calling it fixed.

 

UNSOLVEDhsd configure times out the first time, but works second time.  Matt says that he seen the transition not received at all by the hsd.  Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation?  log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log.  What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2.  How I think it's supposed to work:  tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll).  both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process).  both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07).  hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY).  tmohsd_0/1 both watch for their own A:READY and B:READY for each channel.  maybe resets take a while to recover?  maybe a "reentrancy" problem in hsdioc?Could it be the clear readout?  But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).

...