Ric Issues
- PCIe issues in which various cards don't come up right, I think we’ve seen this with Enet, IB and KCUs now
- ATCA cards not powering up correctly; maybe a sequencing issue?
- The incredibly long boot times of some nodes sometimes
HSD
early procmgr telnet timeout?
(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf /reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start *** ERR: no prompt at drp-tst-acc06 port 29400 *** ERR: no prompt at drp-tst-dev016 port 29400 *** ERR: no prompt at drp-tst-dev020 port 29400 *** ERR: no prompt at drp-tst-dev021 port 29400 *** ERR: no prompt at drp-tst-dev022 port 29400 (ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400 Trying 172.21.148.224... Connected to drp-tst-acc06. Escape character is '^]'. @@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC) @@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart @@@ procServ server PID: 174485 @@@ Server startup directory: / @@@ Child startup directory: /tmp @@@ Child started as: /bin/tcsh @@@ Child "/bin/tcsh" PID: 174541 @@@ procServ server started at: Thu Jun 4 17:23:15 2020 @@@ Child "/bin/tcsh" started at: Thu Jun 4 17:23:15 2020 @@@ 0 user(s) and 0 logger(s) connected (plus you) > telnet> q Connection closed. (ps-3.1.11) daq-tst-dev03:cnf$ date Thu Jun 4 17:41:18 PDT 2020 (ps-3.1.11) daq-tst-dev03:cnf$
hsdpva crash on startup. p4p get timeout too short? default is 5, try increase to 10
(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log # ID: hsdpva # PLATFORM:4 # HOST: daq-tst-dev03.pcdsn # CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B # TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install # CONDA_REL:ps-3.1.11 # GIT_DESCRIBE:3.0.4-39-ga30d0131 @@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" @@@ (as /bin/env) @@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772 @@@ @@@ @@@ @@@ @@@ 5.12.3 Traceback (most recent call last): File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get value, i = done.get(timeout=timeout) File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module> load_entry_point('psdaq', 'console_scripts', 'hsdpva')() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main ui.setupUi(MainWindow,args.base) File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi maintab.addTab( HsdJesd (title), File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__ pvTtl.get() File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get self.__value__ = pvactx.get(self.pvname) File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get raise TimeoutError() TimeoutError
See 100% dead time from all 5 hsd's. Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1). this suggests that all buffers are in the kcu: (not stuck in software). disable times out.
Buffers In User : 0 (number of buffers in software) Buffers In Hw : 4095 (number of buffers in the kcu) Buffers In Pre-Hw Q : 61334 (number of buffers driver has available to hand to the kcu) Buffers In Rx Queue : 0 (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)
Overview
Content Tools