Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

took out the resets in Digitizer.cc and didn't see any failures on the configure transition in about 40 tries, but did see one hsd (1A:A) get dead time with headercntof=2 which wasn't cleared until I put the resets back in Digitizer.cc?

 


UNSOLVEDSOLVED: Disable phase 2 failed in these log files: /reg/neh/home/cpo/2020/07/13_19:30:*drp*.log.  Added a print statement in PGPDetector.cc to explicitly print out if disable phase 2 received by segment level.  Maybe disable phase 2 conflicts with l1accept?  Same problem as described below "all 10 hsds didn't respond to Disable".

 

PROPOSED SOLUTION: Saw this error in /reg/neh/home/cpo/2020/06/16_15:49:13_drp-tst-dev020:tmohsd_4.log:

...

PROPOSED SOLUTIONrunning 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s.  I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99.   Do we need to increase msgdelay?  Matt says msgdelay is max 100 (corresponding to 1us).  This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe).  So could be the same issue.  Matt will have SlowUpdates pay attention to dead time to avoid this.

...


UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s).  : Sept. 11 '20:  The system (4 DRPs on dev004 and 2 HSDs on dev008) ran normally several times this morning.  When a new run was started, phase 2 of Configure wasn't received by the 2 HSDs.  Instead, their DRPs both died with 'Jump in complete l1Count'.  Each DRP reported a different PulseID and TimeStamp, but in both cases the event was a SlowUpdate which should have been a Configure.  hsdpva shows only the A side having a headercntof of 1.  For both sides, msgdelayset is 98, msgdelayget is 0, headerfifow is 0, headerfifor is 16.  Starting a new run proceded correctly with Configure properly received.   

Not Critical


UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s). 

Found that SlowUpdates prevent the system from hanging.  Ric suggested that the TEB's in-progress batch doesn't get released Found that SlowUpdates prevent the system from hanging.  Ric suggested that the in-progress batch gets stuck because the interval between the last L1 and the next datagram is greater than the batch intervalsize of the DRP's batch ringbuffer, which can accomodate 4 seconds of running.  Thus, the ringbuffer is not empty, and the head would have to pass the tail in order to allocate another batch.  This doesn't happen between transitions, because a transition flushes any in-progress batch.

...

Sent email to Michael Davidsaver, Hugo Slepicka and the control group to see if they had seen this before.  They had not.  Michael wrote:  "At a guess, a
client disconnect will be involved."  Michael wrote back later that he found a race condition that could explain it.  Will be addressed in a future release.

 

 

(diagnostic) UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time.  at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed?  (which causes him to latch dead time until receiving clear readout).

 

(diagnostic) UNSOLVED: got a segv in this log file: /reg/neh/home/cpo/2020/06/15_18:39:03 Possibly the same issue as above, but this time with hsdpva.  See ~claus/2020/08/27_10:34:36_daq-tst-dev03.pcdsndev02:hsdpva.log (and again in /reg/neh/home/claus/2020/07/24_17:12:50_daq-tst-dev02:hsdpva.log, twice).  restarting the process fixed it.  Later, saw this log /reg/neh/home/cpo/2020/06/16_15:49:13_daq-tst-dev03.pcdsn:hsdpva.log:

Code Block
Error processing Subscription event: 'XPM:2:AMC0-1'
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4
p/client/thread.py", line 124, in _handle
    self._cb(E)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 64, in mon
itor_cb
    callback(err=None)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 548, in up
date
    q = self.pv.get().value
AttributeError: 'ntstr' object has no attribute 'value'
hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Thu 12 Mar 2020 02:42:02 PM 
PDT by weaver
XPM:2:AMC0-1
Fatal Python error: GC object already tracked

perhaps related to https://stackoverflow.com/questions/23178606/debugging-python-fatal-error-gc-object-already-tracked?

 

.

 

Code Block
hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Fri 31 Jul 2020 03:05:55 PM PDT by weaver
XPM:1:AMC1-0
terminate called after throwing an instance of 'std::bad_weak_ptr'
 what(): bad_weak_ptr

 

Recovered by telneting to the process and restarting using ^X.

On a subsequent restart of the .cnf, the hsdpva process died on signal 11 after printing the XPM:1:AMC1-0 line with no further output.

 

(diagnostic) UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time.  at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed?  (which causes him to latch dead time until receiving clear readout).

 

(diagnostic) UNSOLVED: Started a run with 5 HSD pairs.  Got to state Running with GroupCa's NumL0Acc going to 23 before the system went into 100% dead time.  GroupCA attributed it to HSD:dev07.89.  xpmpva for XPM:1 shows no deadtime for any device in my group (3), while for XPM:2 the link to XPM:1 shows ~1.0.  Perhaps the XPM:1 DeadTime tab is not updating at all since all Group columns show 0.0000.  Found '172.21.148.110:34458 : Connection closed with RX socket error 113' in the pyxpm-1 log file.  Restarted the process.  No change.  Restarted xpmpva.  No change.  However, now HSD:dev07.1a is shown to be the source of the deadtime in both groupca and the xpmpva deadtime tab for XPM:2.  After 2 further restarts of the full system (due to deadtime), all 5 HSD pairs + timing system DRP ran fine without deadtime.

 

(diagnostic) UNSOLVED: got a segv in this log file: (diagnostic) SOLVED: Message sent to mikhail/caf One small issue I noticed is that the progress bar got “stuck” at 11% when the configure was timing out (it was timing out for reasons that aren’t relevant to this message).  You can see the logfile here:
/reg/neh/home/cpo/2020/06/1615_1518:4939:1303_daq-tst-dev03.pcdsn:control_gui.log
At 17:34:33 there is a line like:
2020-06-16 17:34:33,789 tst-CGWMain[14913]: <D> received progress msg: {'transition': 'configure', 'elapsed': 17, 'total': 150}
(note that 17/150=11% progress that I saw in control_gui).  Then the GUI didn’t put out any other messages until the transition timed out at 17:36:46 (and correspondingly the progress bar didn’t update during that period).
Not critical, but would be nice to understand why the progress bar stopped updating.  Any thoughts on this?

(spurious error) UNSOLVED: On procmgr restart tmo.cnf -v -p 3 Ric got:

 

hsdpva.log (and again in /reg/neh/home/claus/2020/07/24_17:12:50_daq-tst-dev02:hsdpva.log, twice).  restarting the process fixed it.  Later, saw this log /reg/neh/home/cpo/2020/06/16_15:49:13_daq-tst-dev03.pcdsn:hsdpva.log:

Code Block
Error processing Subscription event: 'XPM:2:AMC0-1'
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4
p/client/thread.py", line 124, in _handle
    self._cb(E)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 64, in mon
itor_cb
    callback(err=None)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 548, in up
date
    q = self.pv.get().value
AttributeError: 'ntstr' object has no attribute 'value'
hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Thu 12 Mar 2020 02:42:02 PM 
PDT by weaver
XPM:2:AMC0-1
Fatal Python error: GC object already tracked

perhaps related to https://stackoverflow.com/questions/23178606/debugging-python-fatal-error-gc-object-already-tracked?

 


(diagnostic) SOLVED: Message sent to mikhail/caf One small issue I noticed is that the progress bar got “stuck” at 11% when the configure was timing out (it was timing out for reasons that aren’t relevant to this message).  You can see the logfile here:
/reg/neh/home/cpo/2020/06/16_15:49:13_daq-tst-dev03.pcdsn:control_gui.log

At 17:34:33 there is a line like:

2020-06-16 17:34:33,789 tst-CGWMain[14913]: <D> received progress msg: {'transition': 'configure', 'elapsed': 17, 'total': 150}

(note that 17/150=11% progress that I saw in control_gui).  Then the GUI didn’t put out any other messages until the transition timed out at 17:36:46 (and correspondingly the progress bar didn’t update during that period).

Not critical, but would be nice to understand why the progress bar stopped updating.  Any thoughts on this?


(spurious error) UNSOLVED: On procmgr restart tmo.cnf -v -p 3 Ric got:

 

Code Block
...
sending ^Q to 'drp-tst-dev015:tmocam1_0' (drp-tst-dev015 port 29306) ... done
sending ^Q to 'drp-tst-dev004:tmocam2_0' (drp-tst-dev004 port 29304) ... done
sending ^Q to 'localhost:procstat' (localhost port 29307)    ... done
/reg/neh/home
Code Block
...
sending ^Q to 'drp-tst-dev015:tmocam1_0' (drp-tst-dev015 port 29306) ... done
sending ^Q to 'drp-tst-dev004:tmocam2_0' (drp-tst-dev004 port 29304) ... done
sending ^Q to 'localhost:procstat' (localhost port 29307)    ... done
/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr: removed 'p3.cnf.running'
/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr: using config file '../../../../test/lcls2-200602/psdaq/psdaq/cnf/tmo.cnf' to start
*** ERR: failed to read procServ banner for 'groupca' on host localhost
Traceback (most recent call last):
  File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 316, in <module>
    rv = _start(sys.argv[0], args, runconfig, lastconfig, configfile, options, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 85, in _start
    handle = ProcMgr(configfile, options.platform, options.Xterm_list, options.xterm_list, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaqinstall/psdaqbin/procmgr/ProcMgr.py", line 601, in __init__
    if ((self.tmpstatus != self.STATUS_NOCONNECT) and \
AttributeError: 'ProcMgr' object has no attribute 'tmpstatus'

SOLVED

SOLVED: hsdpva crash on startup.  p4p get timeout too short? default is 5, try increase to 10.  Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

Code Block
(ps-3.1.11) daq-tst-dev03:cnf$ cat : removed 'p3.cnf.running'
/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/procmgr: using config file '../../../../test/lcls2-200602/psdaq/psdaq/cnf/tmo.cnf' to start
*** ERR: failed to read procServ banner for 'groupca' on host localhost
Traceback (most recent call last):
  File "/reg/neh/home/claus/cpolclsii/2020daq/test/lcls2-200602/install/bin/procmgr", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 316, in <module>
    rv = _start(sys.argv[0], args, runconfig, lastconfig, configfile, options, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/procmgr", line 85, in _start
    handle = ProcMgr(configfile, options.platform, options.Xterm_list, options.xterm_list, procmgr_macro)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/procmgr/ProcMgr.py", line 601, in __init__
    if ((self.tmpstatus != self.STATUS_NOCONNECT) and \
AttributeError: 'ProcMgr' object has no attribute 'tmpstatus'

SOLVED

SOLVED: hsdpva crash on startup.  p4p get timeout too short? default is 5, try increase to 10.  Real solution: make hsdpvs (hsd pv server) long-lived procmgr process.

Code Block
(ps-3.1.11) daq-tst-dev03:cnf$ cat /reg/neh/home/cpo/2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log
# ID:      hsdpva
# PLATFORM:4
# HOST:    daq-tst-dev03.pcdsn
# CMDLINE: hsdpva DAQ:LAB2:HSD:DEV07_1A:A DAQ:LAB2:HSD:DEV07_1A:B DAQ:LAB2:HSD:DEV07_88:A DAQ:LAB2:HSD:DEV07_88:B DAQ:LAB2:HSD:DEV07_89:A DAQ:LAB2:HSD:DEV07_89:B DAQ:LAB2:HSD:DEV07_B1:A DAQ:LAB2:HSD:DEV07_B1:B DAQ:LAB2:HSD:DEV07_B2:A DAQ:LAB2:HSD:DEV07_B2:B
# TESTRELDIR:/reg/neh/home/cpo/git/lcls2/install
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 270, in get
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home/cpo/git/lcls2/install/bin/hsdpva", line 11, in <module>
    load_entry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 349, in main
    ui.setupUi(MainWindow,args.base)
# CONDA_REL:ps-3.1.11
# GIT_DESCRIBE:3.0.4-39-ga30d0131
@@@ Restarting child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log"
@@@    (as /bin/env)
@@@ The PID of new child "2020/06/04_17:39:45_daq-tst-dev03.pcdsn:hsdpva.log" is: 23772
@@@ @@@ @@@ @@@ @@@
5.12.3
Traceback (most recent call last):
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 313270, in setupUiget
    maintab.addTab( HsdJesd         (title), value, i = done.get(timeout=timeout)
  File "/reg/nehg/home4psdm/cposw/conda2/gitinst/lcls2/psdaq/psdaq/cas/hsdpvaenvs/ps-3.1.11/lib/python3.7/queue.py", line 245178, in get
    raise __init__
    pvTtl.get()Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home4home/cpo/git/lcls2/psdaqinstall/psdaq/cas/pvedit.pybin/hsdpva", line 7811, in get<module>
    self.load__value__ = pvactx.get(self.pvnameentry_point('psdaq', 'console_scripts', 'hsdpva')()
  File "/reg/gneh/psdmhome4/swcpo/conda2git/instlcls2/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/threadpsdaq/psdaq/cas/hsdpva.py", line 274349, in getmain
    raise TimeoutError()
TimeoutError

 

SOLVEDSee 100% dead time from all 5 hsd's.  Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1).  this suggests that all buffers are in the kcu:  (not stuck in software).  disable times out.  answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers using "pvput DAQ:LAB2:XPM:2:PART:4:L0Delay 98".  setting it to 98 displays it as 99 for reasons I don't understand.

Code Block
  Buffers In User : 0          (number of buffers in software)
  Buffers In Hw : 4095         (number of buffers in the kcu)
  Buffers In Pre-Hw Q : 61334  (number of buffers driver has available to hand to the kcu)
  Buffers In Rx Queue : 0      (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)

 

SOLVED (1MHz): running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?).  removed dev010 but saw disable timeout on dev019:

Code Block
Message from syslogd@drp-tst-dev010 at Jun  5 13:02:43 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [swapper/10:0]

problem goes away if we reduce the trigger rate to 71kHz.  even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz.  "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp.  pgp driver and firmware haven't changed since February.  caused by conda compiler work? no. can see problem with pgpread with both old/new compilers.  maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable).  Matt has been unable to reproduce with fakecam so far.  Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing).  Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups.  Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together.  Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

TS (TimingSystem DRP)

SOLVED: From time to time, the following traceback appears during tmo.cnf testing (cf. ~/2020/07/15_14\:15\:25_drp-tst-dev017\:tmots_0.log):

ui.setupUi(MainWindow,args.base)
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 313, in setupUi
    maintab.addTab( HsdJesd         (title), 
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/hsdpva.py", line 245, in __init__
    pvTtl.get()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/cas/pvedit.py", line 78, in get
    self.__value__ = pvactx.get(self.pvname)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/client/thread.py", line 274, in get
    raise TimeoutError()
TimeoutError

 

SOLVEDSee 100% dead time from all 5 hsd's.  Ric suggests looking at the Read Buffers section in /proc/datadev_0 (and 1).  this suggests that all buffers are in the kcu:  (not stuck in software).  disable times out.  answer (after working with Matt) msgdelay was set to 91 in hsdpva: too short so overwrote front-end buffers using "pvput DAQ:LAB2:XPM:2:PART:4:L0Delay 98".  setting it to 98 displays it as 99 for reasons I don't understand.

Code Block
  Buffers In User : 0          (number of buffers in software)
  Buffers In Hw : 4095         (number of buffers in the kcu)
  Buffers In Pre-Hw Q : 61334  (number of buffers driver has available to hand to the kcu)
  Buffers In Rx Queue : 0      (buffers transferred from kcu to driver, ready to be received by dmaReadBulkIndex)

 

SOLVED (1MHz): running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?).  removed dev010 but saw disable timeout on dev019:

Code Block
Message from syslogd@drp-tst-dev010 at Jun  5 13:02:43 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [swapper/10:0]

problem goes away if we reduce the trigger rate to 71kHz.  even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz.  "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp.  pgp driver and firmware haven't changed since February.  caused by conda compiler work? no. can see problem with pgpread with both old/new compilers.  maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable).  Matt has been unable to reproduce with fakecam so far.  Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing).  Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups.  Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together.  Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

 

SOLVEDAfter two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable.  The teb log files (/reg/neh/home/claus/2020/07/24_17:12:50_drp-tst-dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions.  The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts.  SlowUpdates were also occurring.  On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

 

Interacting with Matt, I (Ric) saw that the hsdpva GUI's Timing tab showed msgdelayset to have value 91.  This should normally be 98 or 99, but Matt said this shouldn't make a difference at 360 Hz.  None the less, it feels like doing 'pvput DAQ:LAB2:XPM:2:PART:3:L0Delay 97' as per Chris's entry above, makes the problem harder to reproduce.

 

Matt asked me to watch the hsdpva Timing tab while the Disable is being timed out by the TEB.  The headerfifow and headercntof values stay at zero for the entire time before Disable is selected in control_gui.  About a second after Disable is selected, headerfifow starts to count at 1 Hz.  When it reaches 16, headercntof jumps to 10.  When headerfifow saturates at 31, headercntof becomes 11.  Matt thought that the timing headers being counted at 1 Hz are SlowUpdates, so we tried running with them disabled.  The problem would not reproduce in that case.

 

It seems strange that headerfifow would be counting SlowUpdates (or anything else, for that matter) after Disable is selected in control_gui.  By the time of the first count, I'm pretty sure that all DRPs have printed a line indicating they've received phase 1 of Disable.  Thus, control.py should have stopped issuing SlowUpdates.  So, if these are not SlowUpdates, what are they?  And why at 1 Hz?

 

This was due to a FIFO in the firmware that was capturing the ADC serial link lock status.  The FIFO filled on an L1Accept and drained on any transition; non-L1Accept transitions didn't wait for an entry to be ready.  Thus, a SlowUpdate shortly preceding an L1Accept could "steal" this bit, and the L1Accept would have to wait for the next trigger for completion.  Disable prevents that next trigger from coming, thus deadlock.  This explains why it only happens with SlowUpdates running and why the msgdelayset has an influence (the window for "stealing" is determined by that setting).  The (firmware) fix is to capture and release that lock status only for L1Accepts.  The proper disabling of SlowUpdates is addressed elsewhere.

TS (TimingSystem DRP)

SOLVED: From time to time, the following traceback appears during tmo.cnf testing (cf. ~/2020/07/15_14\:15\:25_drp-tst-dev017\:tmots_0.log):

Code Block
languagetext
*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 366, in put
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
Code Block
languagetext
*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 366, in put
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 135, in ts_connect
    connector = ts_connector(json_connect_info)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 47, in __init__
    self.xpm_link_enable()
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 111, in xpm_link_enable
    self.ctxt.put(pv_names,values)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 369, in put
    raise TimeoutError()
TimeoutError

...

  • hitting rx/tx link resets in PgpMon[0]
  • ConfigureXpmMini->ConfigLCLSTimingV2 (don't think this can affect the link to the camera)
  •  TxTx/Rx link resets in rpm xpm (don't think this can affect the link to the camera)
  • powercycle the drp node (drp-tst-dev003) and repeat the above
  • loading a yaml file (git/cameralink-gateway/software/config/Opal1000.yml)

Feels like we need to power cycle the FEB?

AMI

UNSOLVED: Got the following when running ami.cnf against a running tmo.cnf system that includes HSDs:

Code Block
languagetext
Traceback (most recent call last):
 File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/ami-worker", line 11, in <module>
   load_entry_point('ami', 'console_scripts', 'ami-worker')()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 387, in main
   flags)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 253, in run_worker
   return worker.run()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 159, in run
   for msg in self.src.events():
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 626, in events
   self._update(run)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 626795, in events_update
   self._update(run_group(detname, det_xface_name, det_attr_list, is_env_det)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 795, in _update
   self._update_group(detname, det_xface_name, det_attr_list, is_env_det)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 738, in _update_group
   group_types[attr] = self.data_types[attr_name]
KeyError: 'tmohsd:raw:peaks’

Message sent to Seshu and CPO.

HSD

After two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable.  The teb log files (/reg/neh/home/claus/2020/07/24_17:12:50_drp-tst-dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions.  The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts.  On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

Miscellaneous

738, in _update_group
   group_types[attr] = self.data_types[attr_name]
KeyError: 'tmohsd:raw:peaks’

Message sent to Seshu and CPO.

BLD

UNSOLVED: In recent tmo.cnf runs with the BLD, the BLD has consistently been marking every event it handles with MissedData damage.  Adding a print to Pgp::next() results in lines being printed like:

Code Block
languagetext
Missed BLD: PID 61939cc1184cc327, TH PID 000011aaf02c72, ts 398227b806e0c6f6
tst-drp_bld[200900]: <D> PGPReader  lane 0  size 432  hdr 0c000011aaf02c72.398227b806e0c6f6.a3110008

The first pulse ID value shown is received by Bld::next() from the multicast socket.  It looks nothing like the TimingHeader pulse ID.  On an earlier try, I got:

Code Block
languagetext
Missed BLD: PID 3980f65135993731, TH PID 000000c3effd0f
tst-drp_bld[93557]: <D> PGPReader  lane 0  size 432  hdr 0c000000c3effd0f.3980f65433113b5a.9f0f0008

In this case, part of the pulse ID value received from the multicast socket was in common with some of the upper 32 bits of the TimingHeader timestamp (3980f65).

Miscellaneous

UNSOLVED: On one attempt to record a run with tmo.cnf, the control_gui reported bld failing to respond to BeginRun.  The teb log file (/reg/neh/home/claus/2020/07/24_19:15:28_drp-tst-dev016:teb0.log) shows the BeginRun event to be split.  All contributors but the bld arrived in the teb within the 5 second event build timeout period.  Later (not clear how much later) the bld contribution arrived, starting a new event, for which all the other contributions didn't show up within the 5 second timeout period (since they had already arrived for the previous event).  Because the pulse ID of this event was the same as that of the previous event (i.e., didn't advance), the teb asserted.

hsdpva

UNSOLVED: It doesn't happen very often, but sometimes the hsdpva process segfaults.  Seems very strange since it is python. I've (Ric) not been able to capture it in gdb yet.