Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • I was poking around with daqPipes just to familiarize myself with it and I was looking at the crash this morning at around 8.30.  I noticed that at 8.25.00 the opal queue is at 100% nd teb0 is starting to give bad signs (again at ID0, from the bit mask) However, if I make steps of 1 second, I see that it seems to recover, with the queue occupancy dropping to 98, 73 then 0. However, a few seconds later the drp batch pool for all the hsd lists are blocked. I would like to ask you (answer when you have time, it is just for me to understand): is this the usual Opal problem that we see? Why does it seem to recover before the batch pool blocks? I see that the first batch pool to be exhausted is the opal one. Is this somehow related?

  • I’ve still been trying to understand that one myself, but keep getting interrupted to work on something else, so here is my perhaps half baked thought:  Whatever the issue is that blocks the Opal from writing, eventually goes away and so it can drain.  The problem is that that is so late that the TEB has started timing out a (many?) partially built event(s).  Events for which there is no contributor don’t produce a result for the missing contributor, so if that contributor (sorry, DRP) tried to produce a contribution, it never gets an answer, which is needed to release the input batch and PGP DMA buffer.  Then when the system unblocks, a SlowUpdate (perhaps, could be an L1A, too, I think) comes along with a timestamp so far in the future that it wraps around the batch pool, a ring buffer.  This blocks because there might already be an older contribution there that is waiting to be released.  It scrambles my brain to think about, so apologies if it isn’t clear.  I’m trying to think of a more robust way to do it, but haven’t gotten very far yet.
  • One possibility might be for the contributor/DRP to time out the input buffer in EbReceiver, so that if a result matching that input never arrives, the input buffer and PGP buffer are released.  This could produce some really complicated failure modes that are hard to debug, because the system wouldn’t stop.  Chris discouraged me from going down that path for fear of making things more complicated, rightly so, I think.
  • If a contribution is missing, the *EBs time it out (4 seconds, IIRR), then mark the event with DroppedContribution damage.  The Result dgram (TEB only, and it contains the trigger decision) receives this damage and is sent to all contributors that the TEB heard from for that event.  Sending it to contributors it didn’t hear from might cause problems because they might have crashed.  Thus, if there’s damage raised by the TEB, it appears in all contributions that the DRPs write to disk and send to the monitoring.  This is the way you can tell in the TEB Performance grafana plots whether the DRP or the TEB is raising the damage.

XPM

The timing system should come back up.  If not, use a machine with afs (psdev01, pslab03?)

...