Ric Issues

PCIe issues in which various cards don't come up right, I think we’ve seen this with Enet, IB and KCUs now
ATCA cards not powering up correctly; maybe a sequencing issue?
The incredibly long boot times of some nodes sometimes

HSD

Critical

SOLVED: when doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback). These files suggest that threads are being deleted in ~PGPDetector destructor before being joined. Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread. See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory). The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection". But we see that error message many times in that log file (every time we move up and down the state machine I think). To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition). printed out the thread id and it looks the same for the call to join and the door. Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file). Update: Ric thinks he may have fixed it in a recent push (he said it was a problem with psmetric std::vectors doing "move" when vectors of monitored variables were increased in size, creating zero pointers). Since then I haven't been able to duplicate the error running up/down the state machine for about an hour so calling it fixed.

UNSOLVED: saw this timeout in two hsd's after several iterations up/down the state machine, but pvget a little later on the command line returns the variable correctly. See /reg/neh/home/cpo/2020/07/22_19:56:56_drp-tst-dev020:tmohsd_5.log.

Traceback (most recent call last):^M
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/configdb/hsd_config.py", line \
15, in hsd_config^M
    partitionDelay = ctxt.get(epics_prefix+':MONTIMING').msgdelayset^M
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p\
4p/client/thread.py", line 274, in get^M
    raise TimeoutError()^M

UNSOLVED: core dump from hsd134PVs in an epics channel "put". See ~cpo/problems/hsd134PVs_segv for traceback, and log file /reg/neh/home/cpo/2020/07/15_13:55:31_daq-tst-dev07:hsdioc_89.log

UNSOLVED: hsd configure times out the first time, but works second time. Matt says that he seen the transition not received at all by the hsd. Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation? log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log. What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2. How I think it's supposed to work: tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll). both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process). both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07). hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY). tmohsd_0/1 both watch for their own A:READY and B:READY for each channel. maybe resets take a while to recover? maybe a "reentrancy" problem in hsdioc?Could it be the clear readout? But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).

In these log files saw that only one half of the hsd didn't get phase2: /reg/neh/home/cpo/2020/06/15_18:39:03_drp-tst-dev021:tmohsd_7.log /reg/neh/home/cpo/2020/06/15_18:39:03_drp-tst-dev021:tmohsd_6.log.

took out the resets in Digitizer.cc and didn't see any failures on the configure transition in about 40 tries, but did see one hsd (1A:A) get dead time with headercntof=2 which wasn't cleared until I put the resets back in Digitizer.cc?

SOLVED: Disable phase 2 failed in these log files: /reg/neh/home/cpo/2020/07/13_19:30:*drp*.log. Added a print statement in PGPDetector.cc to explicitly print out if disable phase 2 received by segment level. Maybe disable phase 2 conflicts with l1accept? Same problem as described below "all 10 hsds didn't respond to Disable".

PROPOSED SOLUTION: Saw this error in /reg/neh/home/cpo/2020/06/16_15:49:13_drp-tst-dev020:tmohsd_4.log:

ERROR:root:Web server error: HTTPSConnectionPool(host='pswww.slac.stanford.edu', port=443): Read timed out. (read timeout=3.05)

Murali proposes to increase 3.05s timeout in psdaq/configdb/ configdb.py or webconfigdb.py (not sure which).

PROPOSED SOLUTION: Saw hsd config timeouts. A second (old) hsdpvs process was running, resulting in this warning:

[ClientChannel::Impl] warning : More than one channel with name 'DAQ:LAB2:HSD:DEV07_88:A:FWBUILD' detected, connected to: 172.21.148.110:53472, ignored: 172.21.148.150:51349

This resulted in the intermittent failures, since the drp would sometimes connect to the wrong wrong and fail to learn that the configuration was complete. Have sent mail to controls group to see if we can turn this into an error.

UNSOLVED: Saw this error in /reg/neh/home/cpo/2020/06/15_14:39:26_daq-tst-dev07:hsdioc_89.log when starting up processes after a power-cycle (for new hsd firmware for the interrupt-holdoff):

ADC Initial Lane Alignment Failed!
reg7 = 0xC

hsd "89" was on the slave XPM:1. another hsd "88" was on the same XPM:1 but worked. all links looked locked on the xpm side. saw this again in /reg/neh/home/cpo/2020/06/16_18:19:24_daq-tst-dev07:hsdioc_89.log (same digitizer, but also saw 88 fail below). Note that this is after I commented out all the configure-resets in Digitizer.cc (that happened on Jun 16 at 14:54). The various digitizers look similar:

Restarting the hsdioc_89 process "fixed" the problem. Also saw this in hsdioc_88: /reg/neh/home/cpo/2020/06/18_18:16:40_daq-tst-dev07:hsdioc_88.log. And here: /reg/neh/home/cpo/2020/06/18_18:51:48_daq-tst-dev07:hsdioc_1a.log

Tried running hsd134PVs with the "-E" option (ignore errors, so it would retry?) but still see failures /reg/neh/home/cpo/2020/07/02_18:27:00_daq-tst-dev07:hsdioc_89.log and /reg/neh/home/cpo/2020/07/02_18:27:00_daq-tst-dev07:hsdioc_1a.log. Wasn't able to follow all the "-E" logic yet. Feels like "-E" doesn't retry?

SOLVED: Running at 10Hz with no recording phase2 of disable not be seen by any hsds: /reg/neh/home/cpo/2020/06/16_15:26:24_drp-tst-acc06:control.log. can see control.py try to write xpm pv's for disable phase 2. Saw this again here when recording at 100Hz: /reg/neh/home/cpo/2020/06/18_15:37:36_drp-tst-acc06:control.log. Seemed not so reproducible at 10Hz, but reproducible at 100Hz. Saw it with slow update disabled here: /reg/neh/home/cpo/2020/06/18_18:33:51_daq-tst-dev03.pcdsn:control_gui.log. hsd logs show phase1 of disable but not phase2. also saw this without recording here: /reg/neh/home/cpo/2020/06/18_18:51:48_drp-tst-acc06:control.log. Following suggestions from Ric eventually traced this down to an os.system('cat /proc/datadev_*') line in hsd_config.py:hsd_config(). I don't understand why that triggered this issue.

PROPOSED SOLUTION: running 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s. I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99. Do we need to increase msgdelay? Matt says msgdelay is max 100 (corresponding to 1us). This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe). So could be the same issue. Matt will have SlowUpdates pay attention to dead time to avoid this.

UNSOLVED: Sept. 11 '20: The system (4 DRPs on dev004 and 2 HSDs on dev008) ran normally several times this morning. When a new run was started, phase 2 of Configure wasn't received by the 2 HSDs. Instead, their DRPs both died with 'Jump in complete l1Count'. Each DRP reported a different PulseID and TimeStamp, but in both cases the event was a SlowUpdate which should have been a Configure. hsdpva shows only the A side having a headercntof of 1. For both sides, msgdelayset is 98, msgdelayget is 0, headerfifow is 0, headerfifor is 16. Starting a new run proceded correctly with Configure properly received.

Not Critical

UNSOLVED (any rate): with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang. Reproducible 1 out of 3 attempts. Learned that Pause/Resume does not see this problem - must just disable triggers. When this happens, the monitoring shows that the DRP does not process all of the L1s. It's like a batch gets stuck. After that, nothing gets processed (transition or more L1s).

Found that SlowUpdates prevent the system from hanging. Ric suggested that the TEB's in-progress batch doesn't get released because the interval between the last L1 and the next datagram is greater than the size of the DRP's batch ringbuffer, which can accomodate 4 seconds of running. Thus, the ringbuffer is not empty, and the head would have to pass the tail in order to allocate another batch. This doesn't happen between transitions, because a transition flushes any in-progress batch.

(spurious error) UNSOLVED: early procmgr telnet timeout?

(ps-3.1.11) daq-tst-dev03:cnf$ procmgr start hsd.cnf
/reg/neh/home/cpo/git/lcls2/install/bin/procmgr: using config file 'hsd.cnf' to start
*** ERR: no prompt at drp-tst-acc06 port 29400
*** ERR: no prompt at drp-tst-dev016 port 29400
*** ERR: no prompt at drp-tst-dev020 port 29400
*** ERR: no prompt at drp-tst-dev021 port 29400
*** ERR: no prompt at drp-tst-dev022 port 29400
(ps-3.1.11) daq-tst-dev03:cnf$ telnet drp-tst-acc06 29400
Trying 172.21.148.224...
Connected to drp-tst-acc06.
Escape character is '^]'.
@@@ Welcome to procServ (procServ Process Server 2.6.0-SLAC)
@@@ Use ^X to kill the child, auto restart is ON, use ^T to toggle auto restart
@@@ procServ server PID: 174485
@@@ Server startup directory: /
@@@ Child startup directory: /tmp
@@@ Child started as: /bin/tcsh
@@@ Child "/bin/tcsh" PID: 174541
@@@ procServ server started at: Thu Jun  4 17:23:15 2020
@@@ Child "/bin/tcsh" started at: Thu Jun  4 17:23:15 2020
@@@ 0 user(s) and 0 logger(s) connected (plus you)
> 
telnet> q
Connection closed.
(ps-3.1.11) daq-tst-dev03:cnf$ date
Thu Jun  4 17:41:18 PDT 2020
(ps-3.1.11) daq-tst-dev03:cnf$

(diagnostic) POTENTIALLY SOLVED: groupca crashes with a p4p timeout. trace it back to pva get() working but put() failing:

(ps-3.1.11) daq-tst-dev03:cnf$ pvget DAQ:LAB2:XPM:2:PART:4:Master
DAQ:LAB2:XPM:2:PART:4:Master 2020-06-05 17:09:25.695  1 
(ps-3.1.11) daq-tst-dev03:cnf$ pvput DAQ:LAB2:XPM:2:PART:4:Master 1
Old : 2020-06-05 17:09:25.695  1 
Put timeout
(ps-3.1.11)

in turn trace it down to a crash in pyxpm-2:

Group: 4  Master: 1  RateSel: 0  DestSel: 8000  Ena: 1
Group: 3  Master: 1  RateSel: 3  DestSel: 8000  Ena: 0
Group: 7  Master: 1  RateSel: 4  DestSel: 8000  Ena: 0
Traceback (most recent call last):
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 457, in update
    self._links[i].update()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 136, in update
    updatePv(self._pv_remoteLinkId,self._app.remId.get())
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 114, in updatePv
    pv.post(value)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/server/raw.py", line 160, in post
    _SharedPV.post(self, self._wrap(value))
RuntimeError: bad_weak_ptr
Caught exception... retrying.

Sent email to Michael Davidsaver, Hugo Slepicka and the control group to see if they had seen this before. They had not. Michael wrote: "At a guess, a client disconnect will be involved." Michael wrote back later that he found a race condition that could explain it. Will be addressed in a future release.

(diagnostic) UNSOLVED: Possibly the same issue as above, but this time with hsdpva. See ~claus/2020/08/27_10:34:36_daq-tst-dev02:hsdpva.log.

hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Fri 31 Jul 2020 03:05:55 PM PDT by weaver
XPM:1:AMC1-0
terminate called after throwing an instance of 'std::bad_weak_ptr'
 what(): bad_weak_ptr

Recovered by telneting to the process and restarting using ^X.

On a subsequent restart of the .cnf, the hsdpva process died on signal 11 after printing the XPM:1:AMC1-0 line with no further output.

(diagnostic) UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time. at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed? (which causes him to latch dead time until receiving clear readout).

(diagnostic) UNSOLVED: Started a run with 5 HSD pairs. Got to state Running with GroupCa's NumL0Acc going to 23 before the system went into 100% dead time. GroupCA attributed it to HSD:dev07.89. xpmpva for XPM:1 shows no deadtime for any device in my group (3), while for XPM:2 the link to XPM:1 shows ~1.0. Perhaps the XPM:1 DeadTime tab is not updating at all since all Group columns show 0.0000. Found '172.21.148.110:34458 : Connection closed with RX socket error 113' in the pyxpm-1 log file. Restarted the process. No change. Restarted xpmpva. No change. However, now HSD:dev07.1a is shown to be the source of the deadtime in both groupca and the xpmpva deadtime tab for XPM:2. After 2 further restarts of the full system (due to deadtime), all 5 HSD pairs + timing system DRP ran fine without deadtime.

(diagnostic) UNSOLVED: got a segv in this log file: /reg/neh/home/cpo/2020/06/15_18:39:03_daq-tst-dev03.pcdsn:hsdpva.log (and again in /reg/neh/home/claus/2020/07/24_17:12:50_daq-tst-dev02:hsdpva.log, twice). restarting the process fixed it. Later, saw this log /reg/neh/home/cpo/2020/06/16_15:49:13_daq-tst-dev03.pcdsn:hsdpva.log:

Error processing Subscription event: 'XPM:2:AMC0-1'
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4
p/client/thread.py", line 124, in _handle
    self._cb(E)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 64, in mon
itor_cb
    callback(err=None)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 548, in up
date
    q = self.pv.get().value
AttributeError: 'ntstr' object has no attribute 'value'
hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Thu 12 Mar 2020 02:42:02 PM 
PDT by weaver
XPM:2:AMC0-1
Fatal Python error: GC object already tracked

(diagnostic) SOLVED: Message sent to mikhail/caf One small issue I noticed is that the progress bar got “stuck” at 11% when the configure was timing out (it was timing out for reasons that aren’t relevant to this message). You can see the logfile here:
/reg/neh/home/cpo/2020/06/16_15:49:13_daq-tst-dev03.pcdsn:control_gui.log

At 17:34:33 there is a line like:

2020-06-16 17:34:33,789 tst-CGWMain[14913]: <D> received progress msg: {'transition': 'configure', 'elapsed': 17, 'total': 150}

(note that 17/150=11% progress that I saw in control_gui). Then the GUI didn’t put out any other messages until the transition timed out at 17:36:46 (and correspondingly the progress bar didn’t update during that period).

Not critical, but would be nice to understand why the progress bar stopped updating. Any thoughts on this?

(spurious error) UNSOLVED: On procmgr restart tmo.cnf -v -p 3 Ric got:

...
sending ^Q to 'drp-tst-dev015:tmocam1_0' (drp-tst-dev015 port 29306) ... done
sending ^Q to 'drp-tst-dev004:tmocam2_0' (drp-tst-dev004 port 29304) ... done
sending ^Q to 'localhost:procstat' (localhost port 29307)    ... done
/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr: removed 'p3.cnf.running'
/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr: using config file '../../../../test/lcls2-200602/psdaq/psdaq/cnf/tmo.cnf' to start
*** ERR: failed to read procServ banner for 'groupca' on host localhost
Traceback (most recent call last):
  File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 316, in <module>
    rv = _start(sys.argv[0], args, runconfig, lastconfig, configfile, options, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 85, in _start
    handle = ProcMgr(configfile, options.platform, options.Xterm_list, options.xterm_list, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/ProcMgr.py", line 601, in __init__
    if ((self.tmpstatus != self.STATUS_NOCONNECT) and \
AttributeError: 'ProcMgr' object has no attribute 'tmpstatus'

SOLVED

SOLVED: hsdpva crash on startup. p4p get timeout too short? default is 5, try increase to 10. Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main
    ui.setupUi(MainWindow,args.base)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi
    maintab.addTab( HsdJesd         (title), 
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__
    pvTtl.get()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get
    self.__value__ = pvactx.get(self.pvname)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get
    raise TimeoutError()
TimeoutError

SOLVED: See 100% dead time from all 5 hsd's. Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1). this suggests that all buffers are in the kcu: (not stuck in software). disable times out. answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers using "pvput DAQ:LAB2:XPM:2:PART:4:L0Delay 98". setting it to 98 displays it as 99 for reasons I don't understand.

  Buffers In User : 0          (number of buffers in software)
  Buffers In Hw : 4095         (number of buffers in the kcu)
  Buffers In Pre-Hw Q : 61334  (number of buffers driver has available to hand to the kcu)
  Buffers In Rx Queue : 0      (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)

SOLVED (1MHz): running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?). removed dev010 but saw disable timeout on dev019:

Message from syslogd@drp-tst-dev010 at Jun  5 13:02:43 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [swapper/10:0]

problem goes away if we reduce the trigger rate to 71kHz. even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz. "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp. pgp driver and firmware haven't changed since February. caused by conda compiler work? no. can see problem with pgpread with both old/new compilers. maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable). Matt has been unable to reproduce with fakecam so far. Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing). Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups. Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together. Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

SOLVED: After two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable. The teb log files (/reg/neh/home/claus/2020/07/24_17:12:50_drp-tst-dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions. The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts. SlowUpdates were also occurring. On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

Interacting with Matt, I (Ric) saw that the hsdpva GUI's Timing tab showed msgdelayset to have value 91. This should normally be 98 or 99, but Matt said this shouldn't make a difference at 360 Hz. None the less, it feels like doing 'pvput DAQ:LAB2:XPM:2:PART:3:L0Delay 97' as per Chris's entry above, makes the problem harder to reproduce.

Matt asked me to watch the hsdpva Timing tab while the Disable is being timed out by the TEB. The headerfifow and headercntof values stay at zero for the entire time before Disable is selected in control_gui. About a second after Disable is selected, headerfifow starts to count at 1 Hz. When it reaches 16, headercntof jumps to 10. When headerfifow saturates at 31, headercntof becomes 11. Matt thought that the timing headers being counted at 1 Hz are SlowUpdates, so we tried running with them disabled. The problem would not reproduce in that case.

It seems strange that headerfifow would be counting SlowUpdates (or anything else, for that matter) after Disable is selected in control_gui. By the time of the first count, I'm pretty sure that all DRPs have printed a line indicating they've received phase 1 of Disable. Thus, control.py should have stopped issuing SlowUpdates. So, if these are not SlowUpdates, what are they? And why at 1 Hz?

This was due to a FIFO in the firmware that was capturing the ADC serial link lock status. The FIFO filled on an L1Accept and drained on any transition; non-L1Accept transitions didn't wait for an entry to be ready. Thus, a SlowUpdate shortly preceding an L1Accept could "steal" this bit, and the L1Accept would have to wait for the next trigger for completion. Disable prevents that next trigger from coming, thus deadlock. This explains why it only happens with SlowUpdates running and why the msgdelayset has an influence (the window for "stealing" is determined by that setting). The (firmware) fix is to capture and release that lock status only for L1Accepts. The proper disabling of SlowUpdates is addressed elsewhere.

TS (TimingSystem DRP)

SOLVED: From time to time, the following traceback appears during tmo.cnf testing (cf. ~/2020/07/15_14\:15\:25_drp-tst-dev017\:tmots_0.log):

*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 366, in put
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 135, in ts_connect
    connector = ts_connector(json_connect_info)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 47, in __init__
    self.xpm_link_enable()
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 111, in xpm_link_enable
    self.ctxt.put(pv_names,values)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 369, in put
    raise TimeoutError()
TimeoutError

In the above dump, one can see a reference to XPM:0 in the PVs being accessed. This is a nonexistant PV, which is what causes the traceback. The zero is related to the Opal issue described below.

Opal

SOLVED: There's something weird going on with the Opal, which worked fine yesterday. Ric has taken several tmo.cnf runs without any complaints from the opal DRP. All transitions go through fine, as do SlowUpdates. Yet readout group 7 is 100% dead. XPM:2 AMC0 channel 6 RemoteLinkId is 'undef' and the LinkRxErrs are counting. Attempts to reset the channel (either Rx or Tx) don't seem to have any effect. I'm pretty sure this channel is the one that's connected to the Opal. There's no indication that there's anything wrong with data taking: no errors or damage.

I'm not sure whether it's related, but the above error where the ts DRP is timing out trying to write to an XPM register is happening more frequently than usual.

Actually, after a few more tries, I find that the system runs as described above after a fresh 'procmgr start tmo.cnf', but cycling the state machine from Running through Reset back to Running causes the TimingSystem XPM register write failure. Without the Opal included, this does not happen. With further tries it's not so reproducible as I initially thought.

Matt explained that the KCU's Tx link was down while its Rx link was up, so transitions could be received, but the XPM's deadtime signal was latched asserted due to the KCU to XPM link being down. He further traced the KCU's Tx link being down is due to "the reset that gets asserted when loading the datadev driver that causes the timing link to go into this unrecoverable state." He contacted Ben for a proper fix.

UNSOLVED: after power cycling the timing system the opal stopped working. This log file shows an _last field that doesn't exist: /reg/neh/home/cpo/2020/07/27_14:41:20_drp-tst-dev003:tmoopal_0.log. I think that's because in cameralink_gateway/surf/protocols/clink/_UartOpal1000.py the _last attribute is only created if things are working. Starting up the standalone rogue gui (python git/cameralink-gateway/software/scripts/devGui.py --camType Opal1000) the timing link appears to be OK, but HSIO->PgpMon[0]->RxPhyReady (I believe this is the link to the camera) is toggling to False once per second and RxLocalLinkReady/RxRemLinkReady are false. In a previous time this occurred Larry said that there is a watchdog that tries to reset the link once per second, hence the toggling. Tried:

hitting rx/tx link resets in PgpMon[0]
ConfigureXpmMini->ConfigLCLSTimingV2 (don't think this can affect the link to the camera)
Tx/Rx link resets in xpm (don't think this can affect the link to the camera)
powercycle the drp node (drp-tst-dev003) and repeat the above
loading a yaml file (git/cameralink-gateway/software/config/Opal1000.yml)

Feels like we need to power cycle the FEB?

AMI

UNSOLVED: Got the following when running ami.cnf against a running tmo.cnf system that includes HSDs:

Traceback (most recent call last):
 File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/ami-worker", line 11, in <module>
   load_entry_point('ami', 'console_scripts', 'ami-worker')()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 387, in main
   flags)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 253, in run_worker
   return worker.run()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 159, in run
   for msg in self.src.events():
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 626, in events
   self._update(run)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 795, in _update
   self._update_group(detname, det_xface_name, det_attr_list, is_env_det)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 738, in _update_group
   group_types[attr] = self.data_types[attr_name]
KeyError: 'tmohsd:raw:peaks’

Message sent to Seshu and CPO.

BLD

UNSOLVED: In recent tmo.cnf runs with the BLD, the BLD has consistently been marking every event it handles with MissedData damage. Adding a print to Pgp::next() results in lines being printed like:

Missed BLD: PID 61939cc1184cc327, TH PID 000011aaf02c72, ts 398227b806e0c6f6
tst-drp_bld[200900]: <D> PGPReader  lane 0  size 432  hdr 0c000011aaf02c72.398227b806e0c6f6.a3110008

The first pulse ID value shown is received by Bld::next() from the multicast socket. It looks nothing like the TimingHeader pulse ID. On an earlier try, I got:

Missed BLD: PID 3980f65135993731, TH PID 000000c3effd0f
tst-drp_bld[93557]: <D> PGPReader  lane 0  size 432  hdr 0c000000c3effd0f.3980f65433113b5a.9f0f0008

In this case, part of the pulse ID value received from the multicast socket was in common with some of the upper 32 bits of the TimingHeader timestamp (3980f65).

Miscellaneous

UNSOLVED: On one attempt to record a run with tmo.cnf, the control_gui reported bld failing to respond to BeginRun. The teb log file (/reg/neh/home/claus/2020/07/24_19:15:28_drp-tst-dev016:teb0.log) shows the BeginRun event to be split. All contributors but the bld arrived in the teb within the 5 second event build timeout period. Later (not clear how much later) the bld contribution arrived, starting a new event, for which all the other contributions didn't show up within the 5 second timeout period (since they had already arrived for the previous event). Because the pulse ID of this event was the same as that of the previous event (i.e., didn't advance), the teb asserted.

hsdpva

UNSOLVED: It doesn't happen very often, but sometimes the hsdpva process segfaults. Seems very strange since it is python. I've (Ric) not been able to capture it in gdb yet.

Page tree

Failure Modes