Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • I was poking around with daqPipes just to familiarize myself with it and I was looking at the crash this morning at around 8.30.  I noticed that at 8.25.00 the opal queue is at 100% nd teb0 is starting to give bad signs (again at ID0, from the bit mask) However, if I make steps of 1 second, I see that it seems to recover, with the queue occupancy dropping to 98, 73 then 0. However, a few seconds later the drp batch pool for all the hsd lists are blocked. I would like to ask you (answer when you have time, it is just for me to understand): is this the usual Opal problem that we see? Why does it seem to recover before the batch pool blocks? I see that the first batch pool to be exhausted is the opal one. Is this somehow related?

  • I’ve still been trying to understand that one myself, but keep getting interrupted to work on something else, so here is my perhaps half baked thought:  Whatever the issue is that blocks the Opal from writing, eventually goes away and so it can drain.  The problem is that that is so late that the TEB has started timing out a (many?) partially built event(s).  Events for which there is no contributor don’t produce a result for the missing contributor, so if that contributor (sorry, DRP) tried to produce a contribution, it never gets an answer, which is needed to release the input batch and PGP DMA buffer.  Then when the system unblocks, a SlowUpdate (perhaps, could be an L1A, too, I think) comes along with a timestamp so far in the future that it wraps around the batch pool, a ring buffer.  This blocks because there might already be an older contribution there that is waiting to be released.  It scrambles my brain to think about, so apologies if it isn’t clear.  I’m trying to think of a more robust way to do it, but haven’t gotten very far yet.
  • One possibility might be for the contributor/DRP to time out the input buffer in EbReceiver, so that if a result matching that input never arrives, the input buffer and PGP buffer are released.  This could produce some really complicated failure modes that are hard to debug, because the system wouldn’t stop.  Chris discouraged me from going down that path for fear of making things more complicated, rightly so, I think.
  • If a contribution is missing, the *EBs time it out (4 seconds, IIRR), then mark the event with DroppedContribution damage.  The Result dgram (TEB only, and it contains the trigger decision) receives this damage and is sent to all contributors that the TEB heard from for that event.  Sending it to contributors it didn’t hear from might cause problems because they might have crashed.  Thus, if there’s damage raised by the TEB, it appears in all contributions that the DRPs write to disk and send to the monitoring.  This is the way you can tell in the TEB Performance grafana plots whether the DRP or the TEB is raising the damage.

  • Ok thank you. But when you say: "If that contributor (sorry, DRP) tried to produce a contribution, it never gets an answer, which is needed to release the input batch and PGP DMA buffer".  I guess you mean that the DRP tries to collect a contribution for a contributor that maybe is not there. But why would the DRP try to do that? It should know about the damage from the TEB's trigger decision, right? (Do not feel compelled to answer immediately, when you have time)

  • The TEB continues to receive contributions even when there’s an incomplete event in its buffers.  Thus, if an event or series of events is incomplete, but a subsequent event does show up as complete, all those earlier incomplete events are marked with DroppedContribution and flushed, with the assumption they will never complete.  This happens before the timeout time expires.  If the missing contributions then show up anyway (the assumption was wrong), they’re out of time order, and thus dropped on the floor (rather than starting a new event which will have the originally found contributors in it missing (Aaah, my fingers have form knots!), causing a split event (don’t know if that’s universal terminology)).  A split event is often indicative of the timeout being too short.

  • The problem is that the DRP’s Result thread, what we call EbReceiver, if that thread gets stuck, like in the Opal case, for long enough, it will backpressure into the TEB so that it hangs in trying to send some Result to the DRPs.  (I think the half-bakedness of my idea is starting to show…)  Meanwhile, the DRPs have continued to produce Input Dgrams and sent them to the TEB, until they ran out of batch buffer pool.  That then backs up in the KCU to the point that the Deadtime signal is asserted.  Given the different contribution sizes, some DRPs send more Inputs than others, I think.  After long enough, the TEB’s timeout timer goes off, but because it’s paralyzed by being stuck in the Result send(), nothing happens (the TEB is single threaded) and the system comes to a halt.  At long last, the Opal is released, which allows the TEB to complete sending its ancient Result, but then continues on to deal with all those timed out events.  All those timed out events actually had Inputs which now start to flow, but because the Results for those events have already been sent to the contributors that produced their Inputs in time, the DRPs that produced their Inputs late never get to release their Input buffers.

A thought from Ric on how shame buffer sizes are calculated:

  • If the buffer for the Configure dgram (in /usr/lib/systemd/system/tdetsim.service) is specified to be larger, it wins.  Otherwise it’s the Pebble buffer size.  The Pebble buffer size is derived from the tdetsim.service cfgSize, unless it’s overridden by the pebbleBufSize kwarg to the drp executable

XPM

The timing system should come back up.  If not, use a machine with afs (psdev01, pslab03?)

...