Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Some thoughts from Ric and Valerio on the opal file-writing problem on. (11/30/2020):

  • I was poking around with daqPipes just to familiarize myself with it and I was looking at the crash this morning at around 8.30.  I noticed that at 8.25.00 the opal queue is at 100% nd teb0 is starting to give bad signs (again at ID0, from the bit mask) However, if I make steps of 1 second, I see that it seems to recover, with the queue occupancy dropping to 98, 73 then 0. However, a few seconds later the drp batch pool for all the hsd lists are blocked. I would like to ask you (answer when you have time, it is just for me to understand): is this the usual Opal problem that we see? Why does it seem to recover before the batch pool blocks? I see that the first batch pool to be exhausted is the opal one. Is this somehow related?

  • I’ve still been trying to understand that one myself, but keep getting interrupted to work on something else, so here is my perhaps half baked thought:  Whatever the issue is that blocks the Opal from writing, eventually goes away and so it can drain.  The problem is that that is so late that the TEB has started timing out a (many?) partially built event(s).  Events for which there is no contributor don’t produce a result for the missing contributor, so if that contributor (sorry, DRP) tried to produce a contribution, it never gets an answer, which is needed to release the input batch and PGP DMA buffer.  Then when the system unblocks, a SlowUpdate (perhaps, could be an L1A, too, I think) comes along with a timestamp so far in the future that it wraps around the batch pool, a ring buffer.  This blocks because there might already be an older contribution there that is waiting to be released.  It scrambles my brain to think about, so apologies if it isn’t clear.  I’m trying to think of a more robust way to do it, but haven’t gotten very far yet.
  • One possibility might be for the contributor/DRP to time out the input buffer in EbReceiver, so that if a result matching that input never arrives, the input buffer and PGP buffer are released.  This could produce some really complicated failure modes that are hard to debug, because the system wouldn’t stop.  Chris discouraged me from going down that path for fear of making things more complicated, rightly so, I think.

...