Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

problem goes away if we reduce the trigger rate to 71kHz.  even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz.  "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp.  pgp driver and firmware haven't changed since February.  caused by conda compiler work? no. can see problem with pgpread with both old/new compilers.  maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable).  Matt has been unable to reproduce with fakecam so far.  Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing).  Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups.  Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together.  Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

TS (TimingSystem DRP)

SOLVED: From time to time, the following traceback appears during tmo.cnf testing (cf. ~/2020/07/15_14\:15\:25_drp-tst-dev017\:tmots_0.log):

Code Block
languagetext
*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 366, in put
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 135, in ts_connect
    connector = ts_connector(json_connect_info)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 47, in __init__
    self.xpm_link_enable()
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 111, in xpm_link_enable
    self.ctxt.put(pv_names,values)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 369, in put
    raise TimeoutError()
TimeoutError

In the above dump, one can see a reference to XPM:0 in the PVs being accessed.  This is a nonexistant PV, which is what causes the traceback.  The zero is related to the Opal issue described below.

Opal

SOLVED There's something weird going on with the Opal, which worked fine yesterday.  Ric has taken several tmo.cnf runs without any complaints from the opal DRP.  All transitions go through fine, as do SlowUpdates.  Yet readout group 7 is 100% dead.  XPM:2 AMC0 channel 6 RemoteLinkId is 'undef' and the LinkRxErrs are counting.  Attempts to reset the channel (either Rx or Tx) don't seem to have any effect.  I'm pretty sure this channel is the one that's connected to the Opal. There's no indication that there's anything wrong with data taking: no errors or damage.

...

Actually, after a few more tries, I find that the system runs as described above after a fresh 'procmgr start tmo.cnf', but cycling the state machine from Running through Reset back to Running causes the TimingSystem XPM register write failure.  Without the Opal included, this does not happen.  With further tries it's not so reproducible as I initially thought.

Matt explained that the KCU's Tx link was down while its Rx link was up, so transitions could be received, but the XPM's deadtime signal was latched asserted due to the KCU to XPM link being down.  He further traced the KCU's Tx link being down is due to "the reset that gets asserted when loading the datadev driver that causes the timing link to go into this unrecoverable state."  He contacted Ben for a proper fix.