You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 30 Next »

Data Flow and Monitoring

A useful diagram of the pieces of a DRP node that should be monitored via prometheus is here:  https://docs.google.com/presentation/d/1LvvUsV4A1F-7cao4t6mzr8z2qFFZEIwLKR4pXazU3HE/edit?usp=sharing

Generic

  • "clearreadout" and "clear"
  • need to reset the Rx/Tx link (in that order!) for XPM front-panel (note: have learned that RxLink reset can cause link CRC errors (see below) and have to do a TxLink reset to fix.  so order is important). The TxLink reset causes the link to retrain using K characters
  • look for deadtime
  • check that the "partition" window (with the trigger-enable checkbox) is talking to the right XPM: look in the XPM window label, which is something like DAQ:LAB2:XPM:N, where N is the XPM number.  A symptom of this number being incorrect is that the L0InpRate/L0AccRate remain at zeros when triggers are enabled.  This number is a unique identifier within a hierarchy of XPMs.
  • XPM is not configured to forward triggers ("LinkEnable" for that link on the XPM GUI)
  • L0Delay set to 99
  • DST Select (in PART window) set to "DontCare" (could be Dontcare/Internal)
  • check processes in lab3-base.cnf are running
  • run psdaq/build/psdaq/pgp/kcu1500/app/kcuStatus and kcuDmaStatus in kcuDmaStatus "blockspause" and "blocksfree" determine whether or not deadtime is set. if blocksfree drops below blockspause then it will assert deadtime. in hsd window "pgp last rx opcode" 0 means no backpressure, 1 means backpressure. Watch for locPause non zero which causes deadtime.
  • check for multiple drp executables
  • clearReadout broadcasts a message to receiving  kcu's telling them to reset timing-header FIFOs.
  • if running "drp" executable, check that lane mask is correct
  • if events are showing up "sporadically" look for CRC errors from "kcuSim -s -d /dev/datadev_0".  We have seen this caused by doing an XPM RxLink reset without a later TxLink reset.
  • to see if events are flowing from the hardware to the software: "cat /proc/datadev_0" and watch "Tot Buffer Use" counter under "Read Buffers"

HSD

  • look at configured parameters using (for example) "hsdpva DAQ:LAB2:HSD:DEV06_3D:A"
  • for kcu firmware that is built to use both QSFP links, the naming of the qsfp's is swapped.  i.e. the qsfp that is normally called /dev/datadev_0 is now called /dev/datadev_1
  • HSD is not configured to do anything (Check the HSD config tab for no channels enabled)
  • if hsd timing frames are not being received at 926kHz (status here), click TxLink Reset in XPM window.  Typically when this is an issue the receiving rate is ~20kHz.
  • The HSD configuration partition number does not match platform number in .cnf file (Check the HSD "base" tab)
  • also check that HEADERCNTL0 is incrementing in "status" tab of HSD cfg window.
  • in hsd window timepausecnt is number of clocks we are dead
  • in hsd expert window "full threshold(events)" sets threshold for hsd deadtime
  • in hsd status window "fex free events" and "raw free events" are the current free events.
  • in hsd status window "write fifo count" is number of timing headers waiting for HSD data to associate.
  • "readcntsum" on hsd timing tab goes up when we send a transition OR L1Accepts. "trigcntsum" counts L1Accepts only.
  • check kcuStatus for "locPause" non-zero (a low level pgp FIFO being full).  If this happens then:  configure hsd, clear readout, reboot drp node with KCU

update on variables in hsdpva gui from Matt (06/05/2020):

  • timing tab
    • timpausecnt (clock ticks of dead time)
    • trigcnt: triggers over last second
    • trigcntsum: total l1accept
    • readcntsum: number of events readout total
    • msgdelayset: (units: mhz clock ticks) should be 98 or 99 (there is an "off by one" in the epics variable, and what the hsd sees).  if too short, trigger decision is made too early, and there isn't enough buffering in hsd (breaks the buffer model)
    • headercntof: should always be zero (ben's module).  non-zero indicates that too many l1accepts have been sent and we've overwritten the available buffering
    • headercntl0: sum of number of headers received by ben's module.
    • headerfifor: the watermark for when ben's stuff asserts dead time
    • fulltotrig/nfulltotrig:  counters to determine the round trip time from sending an event to getting back the full signal (depends on fiber lengths, for example).  nfulltotrig is same thing but with opposite logic.
  • pgp tab
    • loclinkrdy/remlinkrdy should be ready
    • tx/rx clk frequencies should be 156.
    • txcnt: counts per second of things being send, but only 16 bits so doesn't really display right value
    • txcntsum is transitions and l1accepts (only 16 bits so rolls over frequently)
  • buffers tab
    • freesz: units of rows of adc readout (40samples*2bytes).
    • freeevt: number of free events.  if they go below thresholds set in config tab: dead time.
  • flow tab
    • fmask: bit mask of streams remaining to contribute to event
    • fcurr: current stream being read
    • frdy: bit mask of streams ready to be read
    • srdy: downstream slave (byte packer) is ready
    • mrdy: b0 is master valid assert, b1 is PGP transmit queue (downstream of byte packer) ready
    • rdaddr: current stream cache read pointer
    • npend: next buffer to write
    • ntrig: next buffer to trigger
    • nread: next buffer to read
    • pkoflow: overflow count in byte packer
    • oflow: current stream cache write pointer

Fake Camera

  • use this to check state of the pgp link, and the readout group, size and link mask (instead of kcuStatus): kcuSim -s -d /dev/datadev_0
  • use this to configure readout group, size, link mask: kcuSimValid -d /dev/datadev_0 -c 1 -C 2,320,0xf
    • I think this hangs because it's trying to validate a number of events (specified with the -c argument?)

TEB/MEB

(conversation with Ric on 06/16/20 on TEB grafana page)

BypCt:  number bypassing the TEB

BtWtg: boolean saying whether we're waiting to allocate a batch

TxPdg (MEB, TEB, DRP): boolean.  libfabric saying try again to send to the designated destination (meb, teb, drp)

RxPdg (MEB, TEB, DRP): same as above but for Rx.

(T(eb)M(eb))CtbOInFlt: incremented on a send, decremented on a receive (hence "in flight") 

In tables at the bottom: ToEvtCnt is number of events timed out by teb

WrtCnt MonCnt PsclCnt: the trigger decisions

TOEvtCnt TIEvtCnt: O is outbound from drp to teb, I is inbound from teb to drp

Look in teb log file for timeout messages.  To get contributor id look for messages like this in drp:

/reg/neh/home/cpo/2020/06/16_18:19:24_drp-tst-dev010:tmohsd_0.log:Parameters of Contributor ID 8:

XPM

The timing system should come back up.  If not, use a machine with afs (psdev01, pslab03?)
  • source /afs/slac/g/reseng/IPMC/env.csh
  • fru_deactivate shm-tst-lab2-atca01
  • fru_deactivate shm-tst-lab2-atca02
  • fru_deactivate shm-tst-lab2-atca03
  • fru_activate shm-tst-lab2-atca01
  • fru_activate shm-tst-lab2-atca02
  • fru_activate shm-tst-lab2-atca03

Then have to restart the base processes so that the timing system can reload it's default constants from the database (e.g. msgdelays which protect the buffering)

This error message ("Establishing link...") means that the initial ethernet communication between the ATCA board the pyxpm software is not establishing a connection.  That either means the network is down for some reason, or the board is already connected to another instance of pyxpm somewhere.
'amcc_dump_bsi --all shm-tst-lab2-atca02/2' (from psdev) will tell you if the slot 2 xpm (XPM:2) ethernet link is up.

Not Understood Failure Modes

  • XPM wasn't able to read the remote ID of the hsd in dev03  (5/16/19)
  • kcuStatus for receiving hsd data in dev024 showed pgp links unlocked, with tx/rx clock frequencies at zero.  Power cycled to recover (5/16/19)
  • No labels