Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PROPOSED SOLUTIONrunning 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s.  I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99.   Do we need to increase msgdelay?  Matt says msgdelay is max 100 (corresponding to 1us).  This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe).  So could be the same issue.  Matt will have SlowUpdates pay attention to dead time to avoid this.


UNSOLVED: Sept. 11 '20:  The system (4 DRPs on dev004 and 2 HSDs on dev008) ran normally several times this morning.  When a new run was started, phase 2 of Configure wasn't received by the 2 HSDs.  Instead, their DRPs both died with 'Jump in complete l1Count'.  Each DRP reported a different PulseID and TimeStamp, but in both cases the event was a SlowUpdate which should have been a Configure.  hsdpva shows only the A side having a headercntof of 1.  For both sides, msgdelayset is 98, msgdelayget is 0, headerfifow is 0, headerfifor is 16.  Starting a new run proceded correctly with Configure properly received.   

Not Critical


UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s). 

Found that SlowUpdates prevent the system from hanging.  Ric suggested that the the TEB's in-progress batch gets stuck doesn't get released because the interval between the last L1 and the next datagram is greater than the size of the DRP's batch ringbuffer, which can accomodate 4 seconds of running.  Thus, the batch intervalringbuffer is not empty, and the head would have to pass the tail in order to allocate another batch.  This doesn't happen between transitions, because a transition flushes any in-progress batch.

...

Sent email to Michael Davidsaver, Hugo Slepicka and the control group to see if they had seen this before.  They had not.  Michael wrote:  "At a guess, a client disconnect will be involved."  Michael wrote back later that he found a race condition that could explain it.  Will be addressed in a future release.

 

 

(diagnostic) UNSOLVED: Possibly the same issue as above, but this time with hsdpva.  See ~claus/2020/08/27_10:34:36_daq-tst-dev02:hsdpva.log.

 

Code Block
hsd_6400m: Vivado v2019.1, rdsrv300 (x86_64), Built Fri 31 Jul 2020 03:05:55 PM PDT by weaver
XPM:1:AMC1-0
terminate called after throwing an instance of 'std::bad_weak_ptr'
 what(): bad_weak_ptr

 

 

 

Recovered by telneting to the process and restarting using ^X.

On a subsequent restart of the .cnf, the hsdpva process died on signal 11 after printing the XPM:1:AMC1-0 line with no further output. 

 

(diagnostic) UNSOLVED: according to hsdpva only one hsd was producing dead time (due to msgdelay being set too short) but xpmpva showed all hsd's having dead time.  at other times xpmpva seems to be attributing dead time correctly, so it's somehow an intermittent issue, perhaps only happening when Ben detects his front-end buffers have overflowed?  (which causes him to latch dead time until receiving clear readout).

...