Page History
...
Code Block |
---|
(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf /reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start *** ERR: no prompt at drp-tst-acc06 port 29400 *** ERR: no prompt at drp-tst-dev016 port 29400 *** ERR: no prompt at drp-tst-dev020 port 29400 *** ERR: no prompt at drp-tst-dev021 port 29400 *** ERR: no prompt at drp-tst-dev022 port 29400 (ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400 Trying 172.21.148.224... Connected to drp-tst-acc06. Escape character is '^]'. @@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC) @@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart @@@ procServ server PID: 174485 @@@ Server startup directory: / @@@ Child startup directory: /tmp @@@ Child started as: /bin/tcsh @@@ Child "/bin/tcsh" PID: 174541 @@@ procServ server started at: Thu Jun 4 17:23:15 2020 @@@ Child "/bin/tcsh" started at: Thu Jun 4 17:23:15 2020 @@@ 0 user(s) and 0 logger(s) connected (plus you) > telnet> q Connection closed. (ps-3.1.11) daq-tst-dev03:cnf$ date Thu Jun 4 17:41:18 PDT 2020 (ps-3.1.11) daq-tst-dev03:cnf$ |
SOLVED: hsdpva crash on startup. p4p get timeout too short? default is 5, try increase to 10. Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.
Code Block |
---|
(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log # ID: hsdpva # PLATFORM:4 # HOST: daq-tst-dev03.pcdsn # CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B # TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install # CONDA_REL:ps-3.1.11 # GIT_DESCRIBE:3.0.4-39-ga30d0131 @@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" @@@ (as /bin/env) @@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772 @@@ @@@ @@@ @@@ @@@ 5.12.3 Traceback (most recent call last): File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get value, i = done.get(timeout=timeout) File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module> load_entry_point('psdaq', 'console_scripts', 'hsdpva')() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main ui.setupUi(MainWindow,args.base) File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi maintab.addTab( HsdJesd (title), File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__ pvTtl.get() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get self.__value__ = pvactx.get(self.pvname) File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get raise TimeoutError() TimeoutError |
SOLVED: See 100% dead time from all 5 hsd's. Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1). this suggests that all buffers are in the kcu: (not stuck in software). disable times out. answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers. needs to be 98 or 99.
...
groupca crashes with a p4p timeout. trace it back to pva get() working but put() failing:
Code Block |
---|
(ps-3.1.11) daq-tst-dev03:cnf$ pvget DAQ:LAB2:XPM:2:PART:4:Master DAQ:LAB2:XPM:2:PART:4:Master 2020-06-05 17:09:25.695 1 (ps-3.1.11) daq-tst-dev03:cnf$ pvput DAQ:LAB2:XPM:2:PART:4:Master 1 Old : 2020-06-05 17:09:25.695 1 Put timeout (ps-3.1.11) |
in turn trace it down to a crash in pyxpm-2:
Code Block |
---|
Group: 4 Master: 1 RateSel: 0 DestSel: 8000 Ena: 1 Group: 3 Master: 1 RateSel: 3 DestSel: 8000 Ena: 0 Group: 7 Master: 1 RateSel: 4 DestSel: 8000 Ena: 0 Traceback (most recent call last): File "/reg/neh/home/cpo/git/lcls2/install/bin/groupca", line 11, in <module> load_entry_point('psdaq', 'console_scripts', 'groupca')() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/groupca.py", line 313, in main ui.setupUi(MainWindow,args.pvbase,args.xpmroot,args.groups,args.prod) File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/groupcapvstats.py", line 263457, in setupUiupdate addGroup(tw, pvbase, g, xpmself._links[i].update() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/groupcapvstats.py", line 234136, in addGroupupdate pvXpm.put(1updatePv(self._pv_remoteLinkId,self._app.remId.get()) File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/caspyxpm/pveditpvstats.py", line 86114, in putupdatePv ret = pvactx.put(self.pvname, newval, wait=waitpv.post(value) File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/clientserver/threadraw.py", line 369160, in putpost raise TimeoutError() TimeoutError_SharedPV.post(self, self._wrap(value)) RuntimeError: bad_weak_ptr Caught exception... retrying. |
running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?):
...