Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
Group: 4  Master: 1  RateSel: 0  DestSel: 8000  Ena: 1
Group: 3  Master: 1  RateSel: 3  DestSel: 8000  Ena: 0
Group: 7  Master: 1  RateSel: 4  DestSel: 8000  Ena: 0
Traceback (most recent call last):
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 457, in update
    self._links[i].update()
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 136, in update
    updatePv(self._pv_remoteLinkId,self._app.remId.get())
  File "/reg/neh/home4/cpo/git/lcls2/psdaq/psdaq/pyxpm/pvstats.py", line 114, in updatePv
    pv.post(value)
  File "/reg/g/psdm/sw/conda2/inst/envs/ps-3.1.11/lib/python3.7/site-packages/p4p/server/raw.py", line 160, in post
    _SharedPV.post(self, self._wrap(value))
RuntimeError: bad_weak_ptr
Caught exception... retrying.


UNSOLVED SOLVED (1MHz): running 5 hsd nodes at 1mhz saw this on dev010 and node became unresponsive (disable timed out?).  removed dev010 but saw disable timeout on dev019:

...

problem goes away if we reduce the trigger rate to 71kHz.  even "ls" on the drp node will hang until the trigger rate is reduced to 71kHz.  "ls" hanging is reproducible even when running only 1 of the two hsd's on a drp.  pgp driver and firmware haven't changed since February.  caused by conda compiler work? no. can see problem with pgpread with both old/new compilers.  maybe it's having 2 hsd's on a drp (interrupts may still be firing even when we don't run the second hsd drp executable).  Matt has been unable to reproduce with fakecam so far.  Update: found that even running 1 hsd that interrupts for both hsd's were firing (saw counters in /proc/datadev_1 incrementing).  Followed Matt's advice and "unconfigure" disabled the second hsd's interrupts, and in this configuration there were no more soft lockups.  Checked that it behaved with datadev_0 or datadev_1 running individually, just not both together.  Solution: add an interrupt-hold-off parameter to the firmware to throttle maximum interrupt rate (in addition to the existing pattern of only-interrupt-when-queue-goes-non-empty).


UNSOLVED (1MHz): with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.

...

PROPOSED SOLUTIONrunning 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s.  I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99.   Do we need to increase msgdelay?  Matt says msgdelay is max 100 (corresponding to 1us).  This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe).  So could be the same issue.  Matt will have SlowUpdates pay attention to dead time to avoid this.