Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

problem goes away if we reduce the trigger rate to 71kHz.  even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz.  "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp.  pgp driver and firmware haven't changed since February.  caused by conda compiler work? no. can see problem with pgpread with both old/new compilers.  maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable).  Matt has been unable to reproduce with fakecam so far.  Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing).  Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups.  Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together.  Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).

...

 

SOLVED:From time to time, the following traceback appears during tmo.cnf testing (cf. ~ After two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable.  The teb log files (/reg/neh/home/claus/2020/07/1524_14\17:15\12:2550_drp-tst-dev017\:tmots_0.log):dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions.  The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts.  SlowUpdates were also occurring.  On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

 

Interacting with Matt, I (Ric) saw that the hsdpva GUI's Timing tab showed msgdelayset to have value 91.  This should normally be 98 or 99, but Matt said this shouldn't make a difference at 360 Hz.  None the less, it feels like doing 'pvput DAQ:LAB2:XPM:2:PART:3:L0Delay 97' as per Chris's entry above, makes the problem harder to reproduce.

 

Matt asked me to watch the hsdpva Timing tab while the Disable is being timed out by the TEB.  The headerfifow and headercntof values stay at zero for the entire time before Disable is selected in control_gui.  About a second after Disable is selected, headerfifow starts to count at 1 Hz.  When it reaches 16, headercntof jumps to 10.  When headerfifow saturates at 31, headercntof becomes 11.  Matt thought that the timing headers being counted at 1 Hz are SlowUpdates, so we tried running with them disabled.  The problem would not reproduce in that case.

 

It seems strange that headerfifow would be counting SlowUpdates (or anything else, for that matter) after Disable is selected in control_gui.  By the time of the first count, I'm pretty sure that all DRPs have printed a line indicating they've received phase 1 of Disable.  Thus, control.py should have stopped issuing SlowUpdates.  So, if these are not SlowUpdates, what are they?  And why at 1 Hz?

 

This was due to a FIFO in the firmware that was capturing the ADC serial link lock status.  The FIFO filled on an L1Accept and drained on any transition; non-L1Accept transitions didn't wait for an entry to be ready.  Thus, a SlowUpdate shortly preceding an L1Accept could "steal" this bit, and the L1Accept would have to wait for the next trigger for completion.  Disable prevents that next trigger from coming, thus deadlock.  This explains why it only happens with SlowUpdates running and why the msgdelayset has an influence (the window for "stealing" is determined by that setting).  The (firmware) fix is to capture and release that lock status only for L1Accepts.  The proper disabling of SlowUpdates is addressed elsewhere.

TS (TimingSystem DRP)

SOLVED: From time to time, the following traceback appears during tmo.cnf testing (cf. ~/2020/07/15_14\:15\:25_drp-tst-dev017\:tmots_0.log):

Code Block
languagetext
*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
Code Block
languagetext
*** setting xpm link enables ['DAQ:LAB2:XPM:1:LinkGroupMask4', 'DAQ:LAB2:XPM:1:LinkGroupMask2', 'DAQ:LAB2:XPM:1:LinkGroupMask8', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:0:LinkGroupMask255', 'DAQ:LAB2:XPM:1:LinkGroupMask5', 'DAQ:LAB2:XPM:1:LinkGroupMask9'] [8, 4, 8, 128, 128, 4, 4]
Traceback (most recent call last):
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 366, in put
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/queue.py", line 178, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 135, in ts_connect
    connector = ts_connector(json_connect_info)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 47, in __init__
    self.xpm_link_enable()
  File "/reg/nehg/home5psdm/claussw/lclsiiconda2/daqinst/testenvs/lcls2-200602/psdaq/psdaq/configdb/ts_connectps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 111366, in xpm_link_enable
    self.ctxt.put(pv_names,valuesput
    value, i = done.get(timeout=timeout)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/threadqueue.py", line 369178, in putget
    raise TimeoutError()
TimeoutError

In the above dump, one can see a reference to XPM:0 in the PVs being accessed.  This is a nonexistant PV, which is what causes the traceback.  The zero is related to the Opal issue described below.

Opal

SOLVED There's something weird going on with the Opal, which worked fine yesterday.  Ric has taken several tmo.cnf runs without any complaints from the opal DRP.  All transitions go through fine, as do SlowUpdates.  Yet readout group 7 is 100% dead.  XPM:2 AMC0 channel 6 RemoteLinkId is 'undef' and the LinkRxErrs are counting.  Attempts to reset the channel (either Rx or Tx) don't seem to have any effect.  I'm pretty sure this channel is the one that's connected to the Opal. There's no indication that there's anything wrong with data taking: no errors or damage.

I'm not sure whether it's related, but the above error where the ts DRP is timing out trying to write to an XPM register is happening more frequently than usual.

Actually, after a few more tries, I find that the system runs as described above after a fresh 'procmgr start tmo.cnf', but cycling the state machine from Running through Reset back to Running causes the TimingSystem XPM register write failure.  Without the Opal included, this does not happen.  With further tries it's not so reproducible as I initially thought.

Matt explained that the KCU's Tx link was down while its Rx link was up, so transitions could be received, but the XPM's deadtime signal was latched asserted due to the KCU to XPM link being down.  He further traced the KCU's Tx link being down is due to "the reset that gets asserted when loading the datadev driver that causes the timing link to go into this unrecoverable state."  He contacted Ben for a proper fix.

 

UNSOLVED: after power cycling the timing system the opal stopped working.  This log file shows an _last field that doesn't exist: /reg/neh/home/cpo/2020/07/27_14:41:20_drp-tst-dev003:tmoopal_0.log.  I think that's because in cameralink_gateway/surf/protocols/clink/_UartOpal1000.py the _last attribute is only created if things are working.  Starting up the standalone rogue gui (python git/cameralink-gateway/software/scripts/devGui.py --camType Opal1000) the timing link appears to be OK, but HSIO->PgpMon[0]->RxPhyReady (I believe this is the link to the camera) is toggling to False once per second and RxLocalLinkReady/RxRemLinkReady are false.  In a previous time this occurred Larry said that there is a watchdog that tries to reset the link once per second, hence the toggling.  Tried:

  • hitting rx/tx link resets in PgpMon[0]
  • ConfigureXpmMini->ConfigLCLSTimingV2 (don't think this can affect the link to the camera)
  • Tx/Rx link resets in xpm (don't think this can affect the link to the camera)
  • powercycle the drp node (drp-tst-dev003) and repeat the above
  • loading a yaml file (git/cameralink-gateway/software/config/Opal1000.yml)

Feels like we need to power cycle the FEB?

AMI

UNSOLVED: Got the following when running ami.cnf against a running tmo.cnf system that includes HSDs:

Code Block
languagetext
Traceback (most recent call last):
 File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/ami-worker", line 11, in <module>
   load_entry_point('ami', 'console_scripts', 'ami-worker')()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 387, in main
   flags)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 253, in run_worker
   return worker.run()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 159, in run
   for msg in self.src.events():
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 626, in events
   self._update(run)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 795, in _update
   self._update_group(detname, det_xface_name, det_attr_list, is_env_det)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 738, in _update_group
   group_types[attr] = self.data_types[attr_name]
KeyError: 'tmohsd:raw:peaks’

Message sent to Seshu and CPO.

HSD

UNSOLVED: After two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable.  The teb log files (/reg/neh/home/claus/2020/07/24_17:12:50_drp-tst-dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions.  The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts.  SlowUpdates were also occurring.  On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

Interacting with Matt, I (Ric) saw that the hsdpva GUI's Timing tab showed msgdelayset to have value 91.  This should normally be 98 or 99, but Matt said this shouldn't make a difference at 360 Hz.  None the less, it feels like doing 'pvput DAQ:LAB2:XPM:2:PART:3:L0Delay 97' as per Chris's entry above, makes the problem harder to reproduce.

Matt asked me to watch the hsdpva Timing tab while the Disable is being timed out by the TEB.  The headerfifow and headercntof values stay at zero for the entire time before Disable is selected in control_gui.  About a second after Disable is selected, headerfifow starts to count at 1 Hz.  When it reaches 16, headercntof jumps to 10.  When headerfifow saturates at 31, headercntof becomes 11.  Matt thought that the timing headers being counted at 1 Hz are SlowUpdates, so we tried running with them disabled.  The problem would not reproduce in that case.

Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 135, in ts_connect
    connector = ts_connector(json_connect_info)
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 47, in __init__
    self.xpm_link_enable()
  File "/reg/neh/home5/claus/lclsii/daq/test/lcls2-200602/psdaq/psdaq/configdb/ts_connect.py", line 111, in xpm_link_enable
    self.ctxt.put(pv_names,values)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.12/lib/python3.7/site-packages/p4p/client/thread.py", line 369, in put
    raise TimeoutError()
TimeoutError

In the above dump, one can see a reference to XPM:0 in the PVs being accessed.  This is a nonexistant PV, which is what causes the traceback.  The zero is related to the Opal issue described below.

Opal

SOLVED There's something weird going on with the Opal, which worked fine yesterday.  Ric has taken several tmo.cnf runs without any complaints from the opal DRP.  All transitions go through fine, as do SlowUpdates.  Yet readout group 7 is 100% dead.  XPM:2 AMC0 channel 6 RemoteLinkId is 'undef' and the LinkRxErrs are counting.  Attempts to reset the channel (either Rx or Tx) don't seem to have any effect.  I'm pretty sure this channel is the one that's connected to the Opal. There's no indication that there's anything wrong with data taking: no errors or damage.

I'm not sure whether it's related, but the above error where the ts DRP is timing out trying to write to an XPM register is happening more frequently than usual.

Actually, after a few more tries, I find that the system runs as described above after a fresh 'procmgr start tmo.cnf', but cycling the state machine from Running through Reset back to Running causes the TimingSystem XPM register write failure.  Without the Opal included, this does not happen.  With further tries it's not so reproducible as I initially thought.

Matt explained that the KCU's Tx link was down while its Rx link was up, so transitions could be received, but the XPM's deadtime signal was latched asserted due to the KCU to XPM link being down.  He further traced the KCU's Tx link being down is due to "the reset that gets asserted when loading the datadev driver that causes the timing link to go into this unrecoverable state."  He contacted Ben for a proper fix.

 

UNSOLVED: after power cycling the timing system the opal stopped working.  This log file shows an _last field that doesn't exist: /reg/neh/home/cpo/2020/07/27_14:41:20_drp-tst-dev003:tmoopal_0.log.  I think that's because in cameralink_gateway/surf/protocols/clink/_UartOpal1000.py the _last attribute is only created if things are working.  Starting up the standalone rogue gui (python git/cameralink-gateway/software/scripts/devGui.py --camType Opal1000) the timing link appears to be OK, but HSIO->PgpMon[0]->RxPhyReady (I believe this is the link to the camera) is toggling to False once per second and RxLocalLinkReady/RxRemLinkReady are false.  In a previous time this occurred Larry said that there is a watchdog that tries to reset the link once per second, hence the toggling.  Tried:

  • hitting rx/tx link resets in PgpMon[0]
  • ConfigureXpmMini->ConfigLCLSTimingV2 (don't think this can affect the link to the camera)
  • Tx/Rx link resets in xpm (don't think this can affect the link to the camera)
  • powercycle the drp node (drp-tst-dev003) and repeat the above
  • loading a yaml file (git/cameralink-gateway/software/config/Opal1000.yml)

Feels like we need to power cycle the FEB?

AMI

UNSOLVED: Got the following when running ami.cnf against a running tmo.cnf system that includes HSDs:

Code Block
languagetext
Traceback (most recent call last):
 File "/reg/neh/home/claus/lclsii/daq/test/lcls2-200602/install/bin/ami-worker", line 11, in <module>
   load_entry_point('ami', 'console_scripts', 'ami-worker')()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 387, in main
   flags)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 253, in run_worker
   return worker.run()
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/worker.py", line 159, in run
   for msg in self.src.events():
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 626, in events
   self._update(run)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 795, in _update
   self._update_group(detname, det_xface_name, det_attr_list, is_env_det)
 File "/reg/neh/home5/claus/lclsii/daq/ami/ami/data.py", line 738, in _update_group
   group_types[attr] = self.data_types[attr_name]
KeyError: 'tmohsd:raw:peaks’

Message sent to Seshu and CPO.It seems strange that headerfifow would be counting SlowUpdates (or anything else, for that matter) after Disable is selected in control_gui.  By the time of the first count, I'm pretty sure that all DRPs have printed a line indicating they've received phase 1 of Disable.  Thus, control.py should have stopped issuing SlowUpdates.  So, if these are not SlowUpdates, what are they?  And why at 1 Hz?

BLD

UNSOLVED: In recent tmo.cnf runs with the BLD, the BLD has consistently been marking every event it handles with MissedData damage.  Adding a print to Pgp::next() results in lines being printed like:

...