Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Feels like we need to power cycle the FEB?

AMI

UNSOLVED: Got the following when running ami.cnf against a running tmo.cnf system that includes HSDs:

...

Message sent to Seshu and CPO.

HSD

UNSOLVED: After two tmo.cnf runs consisting of pvcam, epics, bld, ts, 3 fakecams, 10 hsds, which lasted more than a few minutes, all 10 hsds didn't respond to Disable.  The teb log files (/reg/neh/home/claus/2020/07/24_17:12:50_drp-tst-dev016:teb0.log, /reg/neh/home/claus/2020/07/24_19:01:01_drp-tst-dev016:teb0.log) show two L1Accepts and the Disable were timed out due to missing all HSD contributions.  The HSDs were being triggered at 360 Hz, which matches the time difference between the L1Accepts.  SlowUpdates were also occurring.  On another run attempt lasting no more than a minute or so, the Disable (and subsequent transitions) proceded correctly.

BLD

Interacting with Matt, I (Ric) saw that the hsdpva GUI's Timing tab showed msgdelayset to have value 91.  This should normally be 98 or 99, but Matt said this shouldn't make a difference at 360 Hz.  None the less, it feels like doing 'pvput DAQ:LAB2:XPM:2:PART:3:L0Delay 97' as per Chris's entry above, makes the problem harder to reproduce.

Matt asked me to watch the hsdpva Timing tab while the Disable is being timed out by the TEB.  The headerfifow and headercntof values stay at zero for the entire time before Disable is selected in control_gui.  About a second after Disable is selected, headerfifow starts to count at 1 Hz.  When it reaches 16, headercntof jumps to 10.  When headerfifow saturates at 31, headercntof becomes 11.  Matt thought that the timing headers being counted at 1 Hz are SlowUpdates, so we tried running with them disabled.  The problem would not reproduce in that case.

It seems strange that headerfifow would be counting SlowUpdates (or anything else, for that matter) after Disable is selected in control_gui.  By the time of the first count, I'm pretty sure that all DRPs have printed a line indicating they've received phase 1 of Disable.  Thus, control.py should have stopped issuing SlowUpdates.  So, if these are not SlowUpdates, what are they?  And why at 1 Hz?

BLD

UNSOLVED: In recent tmo.cnf runs with the BLD, the BLD has consistently been marking every event it handles with MissedData damage.  Adding a print to Pgp::next(In recent tmo.cnf runs with the BLD, the BLD has consistently been marking every event it handles with MissedData damage.  Adding a print to Pgp::next() results in lines being printed like:

...

In this case, part of the pulse ID value received from the multicast socket was in common with some of the upper 32 bits of the TimingHeader timestamp (3980f65).

Miscellaneous

UNSOLVED: On one attempt to record a run with tmo.cnf, the control_gui reported bld failing to respond to BeginRun.  The teb log file (/reg/neh/home/claus/2020/07/24_19:15:28_drp-tst-dev016:teb0.log) shows the BeginRun event to be split.  All contributors but the bld arrived in the teb within the 5 second event build timeout period.  Later (not clear how much later) the bld contribution arrived, starting a new event, for which all the other contributions didn't show up within the 5 second timeout period (since they had already arrived for the previous event).  Because the pulse ID of this event was the same as that of the previous event (i.e., didn't advance), the teb asserted.

hsdpva

UNSOLVED: It doesn't happen very often, but sometimes the hsdpva process segfaults.  Seems very strange since it is python. I've (Ric) not been able to capture it in gdb yet.