Ric Issues

PCIe issues in which various cards don't come up right, I think we’ve seen this with Enet, IB and KCUs now
ATCA cards not powering up correctly; maybe a sequencing issue?
The incredibly long boot times of some nodes sometimes

HSD

UNSOLVED: early procmgr telnet timeout?

(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf
/reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start
*** ERR: no prompt at drp-tst-acc06 port 29400
*** ERR: no prompt at drp-tst-dev016 port 29400
*** ERR: no prompt at drp-tst-dev020 port 29400
*** ERR: no prompt at drp-tst-dev021 port 29400
*** ERR: no prompt at drp-tst-dev022 port 29400
(ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400
Trying 172.21.148.224...
Connected to drp-tst-acc06.
Escape character is '^]'.
@@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC)
@@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart
@@@ procServ server PID: 174485
@@@ Server startup directory: /
@@@ Child startup directory: /tmp
@@@ Child started as: /bin/tcsh
@@@ Child "/bin/tcsh" PID: 174541
@@@ procServ server started at: Thu Jun  4 17:23:15 2020
@@@ Child "/bin/tcsh" started at: Thu Jun  4 17:23:15 2020
@@@ 0 user(s) and 0 logger(s) connected (plus you)
> 
telnet> q
Connection closed.
(ps-3.1.11) daq-tst-dev03:cnf$ date
Thu Jun  4 17:41:18 PDT 2020
(ps-3.1.11) daq-tst-dev03:cnf$

SOLVED: hsdpva crash on startup. p4p get timeout too short? default is 5, try increase to 10. Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main
    ui.setupUi(MainWindow,args.base)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi
    maintab.addTab( HsdJesd         (title), 
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__
    pvTtl.get()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get
    self.__value__ = pvactx.get(self.pvname)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get
    raise TimeoutError()
TimeoutError

SOLVED: See 100% dead time from all 5 hsd's. Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1). this suggests that all buffers are in the kcu: (not stuck in software). disable times out. answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers using "pvput DAQ:LAB2:XPM:2:PART:4:L0Delay 98". setting it to 98 displays it as 99 for reasons I don't understand.

  Buffers In User : 0          (number of buffers in software)
  Buffers In Hw : 4095         (number of buffers in the kcu)
  Buffers In Pre-Hw Q : 61334  (number of buffers driver has available to hand to the kcu)
  Buffers In Rx Queue : 0      (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)

UNSOLVED: groupca crashes with a p4p timeout. trace it back to pva get() working but put() failing:

(ps-3.1.11) daq-tst-dev03:cnf$ pvget DAQ:LAB2:XPM:2:PART:4:Master
DAQ:LAB2:XPM:2:PART:4:Master 2020-06-05 17:09:25.695  1 
(ps-3.1.11) daq-tst-dev03:cnf$ pvput DAQ:LAB2:XPM:2:PART:4:Master 1
Old : 2020-06-05 17:09:25.695  1 
Put timeout
(ps-3.1.11)

in turn trace it down to a crash in pyxpm-2:

Group: 4  Master: 1  RateSel: 0  DestSel: 8000  Ena: 1
Group: 3  Master: 1  RateSel: 3  DestSel: 8000  Ena: 0
Group: 7  Master: 1  RateSel: 4  DestSel: 8000  Ena: 0
Traceback (most recent call last):
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 457, in update
    self._links[i].update()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 136, in update
    updatePv(self._pv_remoteLinkId,self._app.remId.get())
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 114, in updatePv
    pv.post(value)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/server/raw.py", line 160, in post
    _SharedPV.post(self, self._wrap(value))
RuntimeError: bad_weak_ptr
Caught exception... retrying.

UNSOLVED: running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?). removed dev010 but saw disable timeout on dev019:

Message from syslogd@drp-tst-dev010 at Jun  5 13:02:43 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [swapper/10:0]

problem goes away if we reduce the trigger rate to 71kHz. even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz. "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp. pgp driver and firmware haven't changed since February. caused by conda compiler work?

UNSOLVED: hsd configure times out the first time, but works second time. Matt says that he seen the transition not received at all by the hsd. Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation? log file in /reg/neh/home/cpo/2020/06/09_14:38:23_drp-tst-dev021:tmohsd_0.log. Could it be the clear readout? But we have a 1s delay in control.py.

UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time. at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed? (which causes him to latch dead time until receiving clear readout).

UNSOLVED: running 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s. I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99. Do we need to increase msgdelay? Matt says msgdelay is max 100 (corresponding to 1us). This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above. So could be the same issue.

Page tree

Failure Modes

Ric Issues

HSD