Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On Nov. 18, 2020 saw a failure mode where hsd_4 did not respond on configure phase 2.  Matt tracked this down to the fact that the kcu→hsd links weren't locked (visible with "kcuStatus" and "hsdpva").  Note that kcuStatus is currently not an installed binary: has to be run from the build area.  This was caused by the fact that we were only running one-half of a pair of hsd's, and the other half is responsible for setting the clock frequencies on the kcu, which is required for link-lock.  We set the clock frequencies by hand running "kcuStatus -I" (I for Initialize, I guess) on the drp node.  Matt is thinking about a more robust solution.

On May 24, 2023 saw a failure where 3 out of 4 hsd's in rix failed configure phase2.  The failing channels did not increment txcntsum in the pgp tab in hsdpva.  The timing links looked OK, showing the usual 919kHz (should be 929?) of timing frames.  Restarting hsdioc's didn't help.  hsdpva timrxrst, timpllrst, reset didn't help either.  xpmpva TxLinkReset didn't help.  Eventually recovered by restarting hsdioc and waiting longer, so I think I didn't wait long enough in previous attempts?  Also saw many errors like this in the hsdioc logs when it was broken:

Code Block
DAQ:RIX:HSD:1_1A:A:MONTIMING putDone Error: Disconnect
DAQ:RIX:HSD:1_1A:A:MONTIMING putDone Error: Disconnect
DAQ:RIX:HSD:1_1A:A:MONTIMING putDone Error: Disconnect

When we attempted to test the DAQ in SRCF with a couple of HSDs, we initially had trouble getting phase 2 of transitions through.  The two "sides" 1_DA:A and B behaved differently with pgpread.  Sometimes some events came through one side but not the other, but with significant delay from when the groupca Events tab Run box was checked and not when the transition buttons were clicked.  Also some of the entries in hsdpva's tabs were odd (Buffers:raw:freesz reset to 65535 for one, 4094 for the other).  Some work had been done on the PV gateway.  hsdioc on daq-tmo-hsd-01 had been restarted and got into a bad state.  Restarting it again cleared up the problem.

Changing the number of DMA buffers (cfgRxCount) in kcu.service can sometimes lead to the node hanging.  In one case, after recovery from the hang using IPMI power cycling, the tdetsim service was started instead of the kcu service.  After fixing that and starting the kcu service, the KCU was still unhappy.  kcuStatus showed links unlocked and rx/txClkFreq values at 131.394 instead of the required 156.171.  After power cycling again, kcuStatus reported normal values.  We then found the hsdioc on daq-tmo-hsd-01 had become unresponsive.  After restarting it, the HSD DAQ ran normally.

...