Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf
/reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start
*** ERR: no prompt at drp-tst-acc06 port 29400
*** ERR: no prompt at drp-tst-dev016 port 29400
*** ERR: no prompt at drp-tst-dev020 port 29400
*** ERR: no prompt at drp-tst-dev021 port 29400
*** ERR: no prompt at drp-tst-dev022 port 29400
(ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400
Trying 172.21.148.224...
Connected to drp-tst-acc06.
Escape character is '^]'.
@@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC)
@@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart
@@@ procServ server PID: 174485
@@@ Server startup directory: /
@@@ Child startup directory: /tmp
@@@ Child started as: /bin/tcsh
@@@ Child "/bin/tcsh" PID: 174541
@@@ procServ server started at: Thu Jun  4 17:23:15 2020
@@@ Child "/bin/tcsh" started at: Thu Jun  4 17:23:15 2020
@@@ 0 user(s) and 0 logger(s) connected (plus you)
> 
telnet> q
Connection closed.
(ps-3.1.11) daq-tst-dev03:cnf$ date
Thu Jun  4 17:41:18 PDT 2020
(ps-3.1.11) daq-tst-dev03:cnf$ 

SOLVED: hsdpva crash on startup.  p4p get timeout too short? default is 5, try increase to 10.  Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

Code Block
(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main
    ui.setupUi(MainWindow,args.base)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi
    maintab.addTab( HsdJesd         (title), 
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__
    pvTtl.get()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get
    self.__value__ = pvactx.get(self.pvname)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get
    raise TimeoutError()
TimeoutError

SOLVEDSee 100% dead time from all 5 hsd's.  Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1).  this suggests that all buffers are in the kcu:  (not stuck in software).  disable times out.  answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers.  needs to be 98 or 99.

...

groupca crashes with a p4p timeout.  trace it back to pva get() working but put() failing:

 

Code Block
(ps-3.1.11) daq-tst-dev03:cnf$ pvget DAQ:LAB2:XPM:2:PART:4:Master
DAQ:LAB2:XPM:2:PART:4:Master 2020-06-05 17:09:25.695  1 
(ps-3.1.11) daq-tst-dev03:cnf$ pvput DAQ:LAB2:XPM:2:PART:4:Master 1
Old : 2020-06-05 17:09:25.695  1 
Put timeout
(ps-3.1.11)

in turn trace it down to a crash in pyxpm-2:

Code Block
Group: 4  Master: 1  RateSel: 0  DestSel: 8000  Ena: 1
Group: 3  Master: 1  RateSel: 3  DestSel: 8000  Ena: 0
Group: 7  Master: 1  RateSel: 4  DestSel: 8000  Ena: 0
Traceback (most recent call last): File "/reg/neh/home/cpo/git/lcls2/install/bin/groupca", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'groupca')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/groupca.py", line 313, in main
    ui.setupUi(MainWindow,args.pvbase,args.xpmroot,args.groups,args.prod)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/groupcapvstats.py", line 263457, in setupUiupdate
    addGroup(tw, pvbase, g, xpmself._links[i].update()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/groupcapvstats.py", line 234136, in addGroupupdate
    pvXpm.put(1updatePv(self._pv_remoteLinkId,self._app.remId.get())
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/pveditpvstats.py", line 86114, in putupdatePv
    ret =  pvactx.put(self.pvname, newval, wait=waitpv.post(value)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/clientserver/threadraw.py", line 369160, in putpost
    raise TimeoutError()
TimeoutError_SharedPV.post(self, self._wrap(value))
RuntimeError: bad_weak_ptr
Caught exception... retrying.


running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?):

...