Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

We propose to use a two-phase approach for each transition (after CONNECT) in the DRP FSM, inspired by the LCLS-I approach

  • The first phase is handled by a ZMQ broadcast, so configures can happen in parallel
  • The second phase is handled in the timing system thread.  This phase "sweeps" out the results from the first phase
  • The control level sends out both phases immediately
  • The FSM runs in the ZMQ thread, except for event(), which runs in the timing system thread
  • the second phase after the first phase is completed
  • We will try to run as much code as possible in the ZMQ thread in order to make the TS thread "sweep" as quick as possibleAll other code in the timing-system thread (apart from event()) should not be detector-specific
  • The timing system thread is responsible for all xtc writing
  • If a DRP has N segment-level workers, only one of them will receive the timing system transition
  • The completion of the first phase of the transition is "hidden" from the control level (i.e. it only sees the completion of the second phase)
  • Since the mon nodes quickly cache the relevant transition, their "completion" is ignored in this process
  • All timeouts for the two phases are done by the control level
    • each node's first-phase
    For each non-event transition:
    • the timing thread waits for a result from the ZMQ thread, with optional JSON that should be appended to the output xtc
    • each transition (maybe just configure and configUpdate) specifies a timeout value, perhaps with the CONNECT collection messagetimeouts are used by
    • hopefully the second phase doesn't need a transition-dependent timeout, but if it does it will be specified in a similar manner to the first phase
  • The ZMQ thread should inform the timing-system thread to timeout the message from the corresponding ZMQ thread
  • the control level to timeout the responses from the various DRP timing system threads.  we believe the control level can set the value to the maximum of all the timeout values (plus overhead)

In this diagram the ZMQ threads are on the left and the timing system (TS) threads are on the right.

  • "SegWork" is a segment-level worker thread (one of many)
  • "Eb" corresponds to the infiniband event-builder
  • "EbRcvr" corresponds to the thread in the DRP node that receives the dgram response from the Eb.

 

  • of its config JSON, so it can be appended to the XTC
  • The timing-system thread's "complete" message is transmitted via the ZMQ thread, since that thread has knowledge of the appropriate sockets.

Some implementation details:

  •  I think this is done with the "inprocSend" ZMQ context in DrpBase.cc.
  • The phase1 response to the control level from the drp nodes is in PGPDetectorApp.cc:handlePhase1()


Image Added

MEB Discussion

April 15, 2022: claus, caf, cpo

Ric found that in UED the disable transitions were being delayed by several seconds, queueing up a few of them and creating buffering problems for the meb and difficult-to-understand crashes (perhaps because we only have 1 buffer for the disable transition?).  We discussed two options to address this, allowing the meb to participate in the control.py decision about when to execute the next transition:

option (1) is having the meb participate in the phase2 sweep (like teb)
  - more work for ric
  - have to generate the "inproc" (complete) message
  - complication: has to handle slowupdate in a special way
  - more self-contained
  - ric worries that meb buffers may not be promptly returned to the drp: maybe wouldn't work?

option (2) is meb becomes like a drp: generate it's on phase2 complete and send to control.py via ZMQ "inproc" message
  - more work for caf
  - could more precisely identify the meb as being a problem if meb crashed
  - touches both drp and control.py code

does the above decision affect speed of phase2?
- i think the answer is no: meb doesn't do anything in phase2

tentative decision is to try option (2)Image Removed