Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SOLVEDwhen doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback).  These files suggest that threads are being deleted in ~PGPDetector destructor before being joined.  Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread.  See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory).  The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection".  But we see that error message many times in that log file (every time we move up and down the state machine I think).  To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).  printed out the thread id and it looks the same for the call to join and the door.  Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file).  Update: Ric thinks he may have fixed it in a recent push (he said it was a problem with psmetric std::vectors doing "move" when vectors of monitored variables were increased in size, creating zero pointers).  Since then I haven't been able to duplicate the error running up/down the state machine for about an hour so calling it fixed.


UNSOLVED:  saw this timeout in two hsd's after several iterations up/down the state machine, but pvget a little later on the command line returns the variable correctly.  See /reg/neh/home/cpo/2020/07/22_19:56:56_drp-tst-dev020:tmohsd_5.log.

 

Code Block
Traceback (most recent call last):^M
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/configdb/hsd_config.py", line \
15, in hsd_config^M
    partitionDelay = ctxt.get(epics_prefix+':MONTIMING').msgdelayset^M
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p\
4p/client/thread.py", line 274, in get^M
    raise TimeoutError()^M

 

...

 

UNSOLVED: core dump from hsd134PVs in an epics channel "put".  See ~cpo/problems/hsd134PVs_segv for traceback, and log file /reg/neh/home/cpo/2020/07/15_13:55:31_daq-tst-dev07:hsdioc_89.log

...