Ric Issues

PCIe issues in which various cards don't come up right, I think we’ve seen this with Enet, IB and KCUs now
ATCA cards not powering up correctly; maybe a sequencing issue?
The incredibly long boot times of some nodes sometimes

HSD

UNSOLVED: early procmgr telnet timeout?

(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf
/reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start
*** ERR: no prompt at drp-tst-acc06 port 29400
*** ERR: no prompt at drp-tst-dev016 port 29400
*** ERR: no prompt at drp-tst-dev020 port 29400
*** ERR: no prompt at drp-tst-dev021 port 29400
*** ERR: no prompt at drp-tst-dev022 port 29400
(ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400
Trying 172.21.148.224...
Connected to drp-tst-acc06.
Escape character is '^]'.
@@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC)
@@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart
@@@ procServ server PID: 174485
@@@ Server startup directory: /
@@@ Child startup directory: /tmp
@@@ Child started as: /bin/tcsh
@@@ Child "/bin/tcsh" PID: 174541
@@@ procServ server started at: Thu Jun  4 17:23:15 2020
@@@ Child "/bin/tcsh" started at: Thu Jun  4 17:23:15 2020
@@@ 0 user(s) and 0 logger(s) connected (plus you)
> 
telnet> q
Connection closed.
(ps-3.1.11) daq-tst-dev03:cnf$ date
Thu Jun  4 17:41:18 PDT 2020
(ps-3.1.11) daq-tst-dev03:cnf$

SOLVED: hsdpva crash on startup. p4p get timeout too short? default is 5, try increase to 10. Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main
    ui.setupUi(MainWindow,args.base)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi
    maintab.addTab( HsdJesd         (title), 
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__
    pvTtl.get()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get
    self.__value__ = pvactx.get(self.pvname)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get
    raise TimeoutError()
TimeoutError

SOLVED: See 100% dead time from all 5 hsd's. Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1). this suggests that all buffers are in the kcu: (not stuck in software). disable times out. answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers using "pvput DAQ:LAB2:XPM:2:PART:4:L0Delay 98". setting it to 98 displays it as 99 for reasons I don't understand.

  Buffers In User : 0          (number of buffers in software)
  Buffers In Hw : 4095         (number of buffers in the kcu)
  Buffers In Pre-Hw Q : 61334  (number of buffers driver has available to hand to the kcu)
  Buffers In Rx Queue : 0      (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)

UNSOLVED: groupca crashes with a p4p timeout. trace it back to pva get() working but put() failing:

(ps-3.1.11) daq-tst-dev03:cnf$ pvget DAQ:LAB2:XPM:2:PART:4:Master
DAQ:LAB2:XPM:2:PART:4:Master 2020-06-05 17:09:25.695  1 
(ps-3.1.11) daq-tst-dev03:cnf$ pvput DAQ:LAB2:XPM:2:PART:4:Master 1
Old : 2020-06-05 17:09:25.695  1 
Put timeout
(ps-3.1.11)

in turn trace it down to a crash in pyxpm-2:

Group: 4  Master: 1  RateSel: 0  DestSel: 8000  Ena: 1
Group: 3  Master: 1  RateSel: 3  DestSel: 8000  Ena: 0
Group: 7  Master: 1  RateSel: 4  DestSel: 8000  Ena: 0
Traceback (most recent call last):
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 457, in update
    self._links[i].update()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 136, in update
    updatePv(self._pv_remoteLinkId,self._app.remId.get())
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 114, in updatePv
    pv.post(value)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/server/raw.py", line 160, in post
    _SharedPV.post(self, self._wrap(value))
RuntimeError: bad_weak_ptr
Caught exception... retrying.

SOLVED (1MHz): running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?). removed dev010 but saw disable timeout on dev019:

Message from syslogd@drp-tst-dev010 at Jun  5 13:02:43 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [swapper/10:0]

problem goes away if we reduce the trigger rate to 71kHz. even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz. "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp. pgp driver and firmware haven't changed since February. caused by conda compiler work? no. can see problem with pgpread with both old/new compilers. maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable). Matt has been unable to reproduce with fakecam so far. Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing). Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups. Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together. Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

UNSOLVED (1MHz): with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang. Reproducible 1 out of 3 attempts.

UNSOLVED: hsd configure times out the first time, but works second time. Matt says that he seen the transition not received at all by the hsd. Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation? log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log. What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2. How I think it's supposed to work: tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll). both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process). both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07). hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY). tmohsd_0/1 both watch for their own A:READY and B:READY for each channel. maybe resets take a while to recover? maybe a "reentrancy" problem in hsdioc?Could it be the clear readout? But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).

UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time. at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed? (which causes him to latch dead time until receiving clear readout).

PROPOSED SOLUTION: running 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s. I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99. Do we need to increase msgdelay? Matt says msgdelay is max 100 (corresponding to 1us). This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe). So could be the same issue. Matt will have SlowUpdates pay attention to dead time to avoid this.

UNSOLVED: Saw this error in /reg/neh/home/cpo/2020/06/15_14:39:26_daq-tst-dev07:hsdioc_89.log when starting up processes after a power-cycle (for new hsd firmware for the interrupt-holdoff):

ADC Initial Lane Alignment Failed!
reg7 = 0xC

hsd "89" was on the slave XPM:1. another hsd "88" was on the same XPM:1 but worked. all links looked locked on the xpm side.

Page tree

Failure Modes

Ric Issues

HSD