Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • PCIe issues in which various cards don't come up rightI think we’ve seen this with Enet, IB and KCUs now
  • ATCA cards not powering up correctly; maybe a sequencing issue?
  • The incredibly long boot times of some nodes sometimes

HSD

Critical

UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s).

 

SOLVEDwhen doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback).  These files suggest that threads are being deleted in ~PGPDetector destructor before being joined.  Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread.  See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory).  The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection".  But we see that error message many times in that log file (every time we move up and down the state machine I think).  To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).  printed out the thread id and it looks the same for the call to join and the door.  Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file).  Update: Ric thinks he may have fixed it in a recent push (he said it was a problem with psmetric std::vectors doing "move" when vectors of monitored variables were increased in size, creating zero pointers).  Since then I haven't been able to duplicate the error running up/down the state machine for about an hour so calling it fixed.

 

UNSOLVEDhsd configure times out the first time, but works second time.  Matt says that he seen the transition not received at all by the hsd.  Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation?  log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log.  What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2.  How I think it's supposed to work:  tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll).  both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process).  both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07).  hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY).  tmohsd_0/1 both watch for their own A:READY and B:READY for each channel.  maybe resets take a while to recover?  maybe a "reentrancy" problem in hsdioc?Could it be the clear readout?  But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).


SOLVEDwhen doing repeat configure transitions I would get core files from a subset of tmohsd processes (see ~cpo/problems/drp_segv for traceback).  These files suggest that threads are being deleted in ~PGPDetector destructor before being joined.  Ric thinks this core file may be a red herring, and created when we receive an infiniband disconnect that triggers a call to PGPDetector::shutdown() from the wrong thread.  See for example /reg/neh/home/cpo/2020/06/30_17:39:15_drp-tst-dev020:tmohsd_2.log (or grep for "signal 11" in *tmohsd* in that directory).  The error Ric sees that worries him is this one: "tst-drp[109274]: <C> Receiver thread lost connection".  But we see that error message many times in that log file (every time we move up and down the state machine I think).  To do: see if the thread doing the join is different than the PGPDetector destructor thread (creating an intermittent race condition).  printed out the thread id and it looks the same for the call to join and the door.  Saw a different traceback in PGPReader from psmetric stuff (also listed in ~cpo/problems/drp_segv file).  Update: Ric thinks he may have fixed it in a recent push (he said it was a problem with psmetric std::vectors doing "move" when vectors of monitored variables were increased in size, creating zero pointers).  Since then I haven't been able to duplicate the error running up/down the state machine for about an hour so calling it fixed.

 

UNSOLVEDhsd configure times out the first time, but works second time.  Matt says that he seen the transition not received at all by the hsd.  Phase 1 completes in Digitizer.cc but phase2 maybe doesn't start, consistent with Matt's observation?  log file in /reg/neh/home/cpo/2020/06/12_17:41:55_drp-tst-dev010:tmohsd_0.log.  What I observed: both tmohsd_0/1 completed phase 1 but failed to receive phase 2.  How I think it's supposed to work:  tmohsd_0/1 each write some kcu registers (resetting tx/rx/user/qpll).  both tmohsd_0/1 set config epics vars in its own pva server ("hsdpvs" process).  both of these servers communicate with one "hsdioc" (running hsd134PVs c++ executable on dev07).  hsdioc programs both channels of the hsd ("fmc" in the code) and sets READY (e.g. DAQ:LAB2:HSD:DEV07_1A:B:READY).  tmohsd_0/1 both watch for their own A:READY and B:READY for each channel.  maybe resets take a while to recover?  maybe a "reentrancy" problem in hsdioc?Could it be the clear readout?  But we have a 1s delay in control.py after clear readout (sent on both configure/beginrun).

In these log files saw that only one half of the hsd didn't get phase2: /reg/neh/home/cpo/2020/06/15_18:39:03_drp-tst-dev021:tmohsd_7.log /reg/neh/home/cpo/2020/06/15_18:39:03_drp-tst-dev021:tmohsd_6.log.

took out the resets in Digitizer.cc and didn't see any failures on the configure transition in about 40 tries, but did see one hsd (1A:A) get dead time with headercntof=2 which wasn't cleared until I put the resets back in Digitizer.cc?

 

UNSOLVED: Disable phase 2 failed in these log filesIn these log files saw that only one half of the hsd didn't get phase2: /reg/neh/home/cpo/2020/0607/1513_1819:3930:*drp*.log.  Added a print statement in PGPDetector.cc to explicitly print out if disable phase 2 received by segment level.  Maybe disable phase 2 conflicts with l1accept?

 

PROPOSED SOLUTION: Saw this error in 03_drp-tst-dev021:tmohsd_7.log /reg/neh/home/cpo/2020/06/16_15_18:3949:0313_drp-tst-dev021dev020:tmohsd_64.log.

took out the resets in Digitizer.cc and didn't see any failures on the configure transition in about 40 tries, but did see one hsd (1A:A) get dead time with headercntof=2 which wasn't cleared until I put the resets back in Digitizer.cc?

 

UNSOLVED: Disable phase 2 failed in these log files: /reg/neh/home/cpo/2020/07/13_19:30:*drp*.log.  Added a print statement in PGPDetector.cc to explicitly print out if disable phase 2 received by segment level.  Maybe disable phase 2 conflicts with l1accept?

 

:

 

Code Block
ERROR:root:Web server error: HTTPSConnectionPool(host='pswww.slac.stanford.edu', port=443): Read timed out. (read timeout=3.05)

 Murali proposes to increase 3.05s timeout in psdaq/configdb/ configdb.py or webconfigdb.py (not sure which).


PROPOSED SOLUTION: Saw hsd config timeouts.  A second (old) hsdpvs process was running, resulting in this warning:

 

Code Block
[ClientChannel::Impl] warning : More than one channel with name 'DAQ:LAB2:HSD:DEV07_88:A:FWBUILD' detected, connected to: 172.21.148.110:53472, ignored: 172.21.148.150:51349

This resulted in the intermittent failures, since the drp would sometimes connect to the wrong wrong and fail to learn that the configuration was complete.  Have sent mail to controls group to see if we can turn this into an error.


UNSOLVED: PROPOSED SOLUTION:  Saw this error in /reg/neh/home/cpo/2020/06/1615_1514:4939:1326_drpdaq-tst-dev020dev07:tmohsdhsdioc_489.log when starting up processes after a power-cycle (for new hsd firmware for the interrupt-holdoff):

 

Code Block
ERROR:root:Web server error: HTTPSConnectionPool(host='pswww.slac.stanford.edu', port=443): Read timed out. (read timeout=3.05)

 Murali proposes to increase 3.05s timeout in psdaq/configdb/ configdb.py or webconfigdb.py (not sure which).

PROPOSED SOLUTION: Saw hsd config timeouts.  A second (old) hsdpvs process was running, resulting in this warning:

 

Code Block
[ClientChannel::Impl] warning : More than one channel with name 'DAQ:LAB2:HSD:DEV07_88:A:FWBUILD' detected, connected to: 172.21.148.110:53472, ignored: 172.21.148.150:51349

This resulted in the intermittent failures, since the drp would sometimes connect to the wrong wrong and fail to learn that the configuration was complete.  Have sent mail to controls group to see if we can turn this into an error.

ADC Initial Lane Alignment Failed!
reg7 = 0xC

 

 hsd "89" was on the slave XPM:1.  another hsd "88" was on the same XPM:1 but worked.  all links looked locked on the xpm side.   saw this again in /reg/neh/home/cpo/2020/06/16_18:19:24_daq-tst-dev07:hsdioc_89.log (same digitizer, but also saw 88 fail below).  Note that this is after I commented out all the configure-resets in Digitizer.cc (that happened on Jun 16 at 14:54).  The various digitizers look similar:

 

Image Added

 

Restarting the hsdioc_89 process "fixed" the problem.  Also saw this in hsdioc_88: /reg/neh/home/cpo/2020/06/18_18:16:40_daq-tst-dev07:hsdioc_88.log.  And here: /reg/neh/home/cpo/2020/06/18_18:51:48_daq-tst-dev07:hsdioc_1a.log

Tried running hsd134PVs with the "-E" option (ignore errors, so it would retry?) but still see failures UNSOLVED: Saw this error in /reg/neh/home/cpo/2020/0607/1502_1418:3927:2600_daq-tst-dev07:hsdioc_89.log when starting up processes after a power-cycle (for new hsd firmware for the interrupt-holdoff):

 

Code Block
ADC Initial Lane Alignment Failed!
reg7 = 0xC

 

and /reg/neh/home/cpo/2020/07/02_18:27:00_daq-tst-dev07:hsdioc_1a.log.  Wasn't able to follow all the "-E" logic yet.  Feels like "-E" doesn't retry?

 

SOLVED: Running at 10Hz with no recording phase2 of disable not be seen by any hsds:  hsd "89" was on the slave XPM:1.  another hsd "88" was on the same XPM:1 but worked.  all links looked locked on the xpm side.   saw this again in /reg/neh/home/cpo/2020/06/16_1815:1926:24_daqdrp-tst-dev07acc06:hsdioc_89.log (same digitizer, but also saw 88 fail below).  Note that this is after I commented out all the configure-resets in Digitizer.cc (that happened on Jun 16 at 14:54).  The various digitizers look similar:

 

Image Removed

 

Restarting the hsdioc_89 process "fixed" the problem.  Also saw this in hsdioc_88: control.log.  can see control.py try to write xpm pv's for disable phase 2.  Saw this again here when recording at 100Hz:  /reg/neh/home/cpo/2020/06/18_1815:1637:4036_daqdrp-tst-dev07acc06:hsdioc_88control.log.  And  Seemed not so reproducible at 10Hz, but reproducible at 100Hz.  Saw it with slow update disabled here: /reg/neh/home/cpo/2020/06/18_18:33:51:48_daq-tst-dev07dev03.pcdsn:hsdioccontrol_1a.logTried running hsd134PVs with the "-E" option (ignore errors, so it would retry?) but still see failures /gui.log.  hsd logs show phase1 of disable but not phase2.  also saw this without recording here: /reg/neh/home/cpo/2020/0706/0218_18:2751:0048_daqdrp-tst-dev07acc06:hsdioc_89.log and /reg/neh/home/cpo/2020/07/02_18:27:00_daq-tst-dev07:hsdioc_1a.log.  Wasn't able to follow all the "-E" logic yet.  Feels like "-E" doesn't retry?

 

SOLVED: Running at 10Hz with no recording phase2 of disable not be seen by any hsds: /reg/neh/home/cpo/2020/06/16_15:26:24_drp-tst-acc06:control.log.  can see control.py try to write xpm pv's for disable phase 2.  Saw this again here when recording at 100Hz:  /reg/neh/home/cpo/2020/06/18_15:37:36_drp-tst-acc06:control.log.  Seemed not so reproducible at 10Hz, but reproducible at 100Hz.  Saw it with slow update disabled here: /reg/neh/home/cpo/2020/06/18_18:33:51_daq-tst-dev03.pcdsn:control_gui.log.  hsd logs show phase1 of disable but not phase2.  also saw this without recording here: /reg/neh/home/cpo/2020/06/18_18:51:48_drp-tst-acc06:control.log.  Following suggestions from Ric eventually traced this down to an os.system('cat /proc/datadev_*') line in hsd_config.py:hsd_config().  I don't understand why that triggered this issue.

PROPOSED SOLUTIONrunning 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s.  I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99.   Do we need to increase msgdelay?  Matt says msgdelay is max 100 (corresponding to 1us).  This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe).  So could be the same issue.  Matt will have SlowUpdates pay attention to dead time to avoid this.

...

control.log.  Following suggestions from Ric eventually traced this down to an os.system('cat /proc/datadev_*') line in hsd_config.py:hsd_config().  I don't understand why that triggered this issue.


PROPOSED SOLUTIONrunning 1 hsd (datadev_0) and after a while the deadtime goes to 100% from that hsd, but timepausecnt is zero for both hsd’s.  I think it's caused by ben's headercntof latching to 1 on the “A” hsd, even though msgdelay is set to 99.   Do we need to increase msgdelay?  Matt says msgdelay is max 100 (corresponding to 1us).  This could be caused by slow updates ignoring dead time if we get long dead time due to the soft-lockup issue above (now solved by Matt, we believe).  So could be the same issue.  Matt will have SlowUpdates pay attention to dead time to avoid this.

Not Critical


UNSOLVED (any rate)with both the fake cam and hsd Matt saw that if he ran at 1MHz then disabled for a few seconds then reenabled that all the buffers were stuck in software and system would hang.  Reproducible 1 out of 3 attempts.  Learned that Pause/Resume does not see this problem - must just disable triggers.  When this happens, the monitoring shows that the DRP does not process all of the L1s.  It's like a batch gets stuck.  After that, nothing gets processed (transition or more L1s). 

Found that SlowUpdates prevent the system from hanging.  Ric suggested that the in-progress batch gets stuck because the interval between the last L1 and the next datagram is greater than the batch interval.  This doesn't happen between transitions, because a transition flushes any in-progress batch.


(spurious error) UNSOLVEDearly procmgr telnet timeout?

...