Data Flow and Monitoring
A useful diagram of the pieces of a DRP node that should be monitored via prometheus is here: https://docs.google.com/presentation/d/1LvvUsV4A1F-7cao4t6mzr8z2qFFZEIwLKR4pXazU3HE/edit?usp=sharing
Generic
- "clearreadout" and "clear"
- need to reset the Rx/Tx link (in that order!) for XPM front-panel (note: have learned that RxLink reset can cause link CRC errors (see below) and have to do a TxLink reset to fix. so order is important). The TxLink reset causes the link to retrain using K characters
- look for deadtime
- check that the "partition" window (with the trigger-enable checkbox) is talking to the right XPM: look in the XPM window label, which is something like DAQ:LAB2:XPM:N, where N is the XPM number. A symptom of this number being incorrect is that the L0InpRate/L0AccRate remain at zeros when triggers are enabled. This number is a unique identifier within a hierarchy of XPMs.
- XPM is not configured to forward triggers ("LinkEnable" for that link on the XPM GUI)
- L0Delay set to 99
- DST Select (in PART window) set to "DontCare" (could be Dontcare/Internal)
- check processes in lab3-base.cnf are running
- run psdaq/build/psdaq/pgp/kcu1500/app/kcuStatus and kcuDmaStatus in kcuDmaStatus "blockspause" and "blocksfree" determine whether or not deadtime is set. if blocksfree drops below blockspause then it will assert deadtime. in hsd window "pgp last rx opcode" 0 means no backpressure, 1 means backpressure. Watch for locPause non zero which causes deadtime.
- check for multiple drp executables
- clearReadout broadcasts a message to receiving kcu's telling them to reset timing-header FIFOs.
- if running "drp" executable, check that lane mask is correct
- if events are showing up "sporadically" look for CRC errors from "kcuSim -s -d /dev/datadev_0". We have seen this caused by doing an XPM RxLink reset without a later TxLink reset.
- to see if events are flowing from the hardware to the software: "cat /proc/datadev_0" and watch "Tot Buffer Use" counter under "Read Buffers"
HSD
- look at configured parameters using (for example) "hsdpva DAQ:LAB2:HSD:DEV06_3D:A"
- for kcu firmware that is built to use both QSFP links, the naming of the qsfp's is swapped. i.e. the qsfp that is normally called /dev/datadev_0 is now called /dev/datadev_1
- HSD is not configured to do anything (Check the HSD config tab for no channels enabled)
- if hsd timing frames are not being received at 926kHz (status here), click TxLink Reset in XPM window. Typically when this is an issue the receiving rate is ~20kHz.
- The HSD configuration partition number does not match platform number in .cnf file (Check the HSD "base" tab)
- also check that HEADERCNTL0 is incrementing in "status" tab of HSD cfg window.
- in hsd window timepausecnt is number of clocks we are dead
- in hsd expert window "full threshold(events)" sets threshold for hsd deadtime
- in hsd status window "fex free events" and "raw free events" are the current free events.
- in hsd status window "write fifo count" is number of timing headers waiting for HSD data to associate.
- "readcntsum" on hsd timing tab goes up when we send a transition OR L1Accepts. "trigcntsum" counts L1Accepts only.
- check kcuStatus for "locPause" non-zero (a low level pgp FIFO being full). If this happens then: configure hsd, clear readout, reboot drp node with KCU
update on variables in hsdpva gui from Matt (06/05/2020):
- timing tab
- timpausecnt (clock ticks of dead time)
- trigcnt: triggers over last second
- trigcntsum: total l1accept
- readcntsum: number of events readout total
- msgdelayset: (units: mhz clock ticks) should be 98 or 99 (there is an "off by one" in the epics variable, and what the hsd sees). if too short, trigger decision is made too early, and there isn't enough buffering in hsd (breaks the buffer model)
- headercntof: should always be zero (ben's module). non-zero indicates that too many l1accepts have been sent and we've overwritten the available buffering
- headercntl0: sum of number of headers received by ben's module.
- headerfifor: the watermark for when ben's stuff asserts dead time
- fulltotrig/nfulltotrig: counters to determine the round trip time from sending an event to getting back the full signal (depends on fiber lengths, for example). nfulltotrig is same thing but with opposite logic.
- pgp tab
- loclinkrdy/remlinkrdy should be ready
- tx/rx clk frequencies should be 156.
- txcnt: counts per second of things being send, but only 16 bits so doesn't really display right value
- txcntsum is transitions and l1accepts (only 16 bits so rolls over frequently)
- buffers tab
- freesz: units of rows of adc readout (40samples*2bytes).
- freeevt: number of free events. if they go below thresholds set in config tab: dead time.
- flow tab
- fmask: bit mask of streams remaining to contribute to event
- fcurr: current stream being read
- frdy: bit mask of streams ready to be read
- srdy: downstream slave (byte packer) is ready
- mrdy: b0 is master valid assert, b1 is PGP transmit queue (downstream of byte packer) ready
- rdaddr: current stream cache read pointer
- npend: next buffer to write
- ntrig: next buffer to trigger
- nread: next buffer to read
- pkoflow: overflow count in byte packer
- oflow: current stream cache write pointer
Fake Camera
- use this to check state of the pgp link, and the readout group, size and link mask (instead of kcuStatus): kcuSim -s -d /dev/datadev_0
- use this to configure readout group, size, link mask: kcuSimValid -d /dev/datadev_0 -c 1 -C 2,320,0xf
- I think this hangs because it's trying to validate a number of events (specified with the -c argument?)
TEB/MEB
(conversation with Ric on 06/16/20 on TEB grafana page)
BypCt: number bypassing the TEB
BtWtg: boolean saying whether we're waiting to allocate a batch
TxPdg (MEB, TEB, DRP): boolean. libfabric saying try again to send to the designated destination (meb, teb, drp)
RxPdg (MEB, TEB, DRP): same as above but for Rx.
(T(eb)M(eb))CtbOInFlt: incremented on a send, decremented on a receive (hence "in flight")
In tables at the bottom: ToEvtCnt is number of events timed out by teb
WrtCnt MonCnt PsclCnt: the trigger decisions
TOEvtCnt TIEvtCnt: O is outbound from drp to teb, I is inbound from teb to drp
Look in teb log file for timeout messages. To get contributor id look for messages like this in drp:
/reg/neh/home/cpo/2020/06/16_18:19:24_drp-tst-dev010:tmohsd_0.log:Parameters of Contributor ID 8:
XPM
- source /afs/slac/g/reseng/IPMC/env.csh
- fru_deactivate shm-tst-lab2-atca01
- fru_deactivate shm-tst-lab2-atca02
- fru_deactivate shm-tst-lab2-atca03
- fru_activate shm-tst-lab2-atca01
- fru_activate shm-tst-lab2-atca02
- fru_activate shm-tst-lab2-atca03
Matt says for the NEH system: For slot 2,
source /afs/slac/g/reseng/IPMC/env.sh
fru_deactivate shm-daq-drp01/2
fru_activate shm-daq-drp01/2
This doesn't work for the new crate in TMO (don't know why yet), so we have to do:
ssh root@shm-tmo-daq01
clia deactivate board 2
clia activate board 2
Then have to restart the base processes so that the timing system can reload it's default constants from the database (e.g. msgdelays which protect the buffering)
Opal
debugging 100% deadtime with larry
https://github.com/slaclab/cameralink-gateway#pgp-channel-mapping
Look in Hsio->PgpRxAxisMon[0]->Ch[1]->FrameCnt to see frames from FEB.
Ch[0] is register for FEB
Ch[2] is the UART
ClinkPcie->Application->AppLane[0]->EventBuilder->DataCnt[0] increments with EITHER a trigger message or a transition message
DataCnt[1] increments when we have camera data. Should be equivalent to the PGP FrameCnt above
Above only increment after the event-builder has done its job
Setting Bypass to 1 turns off the event builder (bit mask, per-lane):
bit 0 to 1 bypasses timing
bit 1 to 1 bypasses the camera
tdest lane mapping:
https://github.com/slaclab/cameralink-gateway/blob/master/firmware/common/rtl/AppLane.vhd#L109-L111
Tried hitting
Commands->ClinkDevRoot->ClinkPcie->Application->Applane[0]->Eventbuilder->CntRst
didn't help
When we get struck
TriggerEventBuffer[0]=>FifoPause goes True
TriggerEventBuffer[0]=>XpmPause goes True
setting EventBuilder->Blowoff True resets
Using groupca to send a transition causes
TriggerEventBuffer[0]=>TransitionCount to increment, and
TriggerEventBuffer[0]=>FifoWrCount to increment
Check that TrigerEventBuffer[0]->Partition is set to right readout group
Try toggling between XpmMini and LCLS2 timing
Check TimingRx->RimingFrameRx->RxLinkUp is 0x1
Check ClinkTop->LinkLockedA=True
Check ClkInFreq and ClinkClkFreq are 80Mhz
ClinkTop->Ch[0]:
LinkMode Base
DataMode 12Bit
FrameMode Frame
TapCount 0x2
DataEn True
BaudRate 57600
FrameCount increments with L1Accepts, but not if deadtime.
Not Understood Failure Modes
- XPM wasn't able to read the remote ID of the hsd in dev03 (5/16/19)
- kcuStatus for receiving hsd data in dev024 showed pgp links unlocked, with tx/rx clock frequencies at zero. Power cycled to recover (5/16/19)