You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 58 Next »

Data Flow and Monitoring

A useful diagram of the pieces of a DRP node that should be monitored via prometheus is here:  https://docs.google.com/presentation/d/1LvvUsV4A1F-7cao4t6mzr8z2qFFZEIwLKR4pXazU3HE/edit?usp=sharing

Generic

  • "clearreadout" and "clear"
  • need to reset the Rx/Tx link (in that order!) for XPM front-panel (note: have learned that RxLink reset can cause link CRC errors (see below) and have to do a TxLink reset to fix.  so order is important). The TxLink reset causes the link to retrain using K characters
  • look for deadtime
  • check that the "partition" window (with the trigger-enable checkbox) is talking to the right XPM: look in the XPM window label, which is something like DAQ:LAB2:XPM:N, where N is the XPM number.  A symptom of this number being incorrect is that the L0InpRate/L0AccRate remain at zeros when triggers are enabled.  This number is a unique identifier within a hierarchy of XPMs.
  • XPM is not configured to forward triggers ("LinkEnable" for that link on the XPM GUI)
  • L0Delay set to 99
  • DST Select (in PART window) set to "DontCare" (could be Dontcare/Internal)
  • check processes in lab3-base.cnf are running
  • run psdaq/build/psdaq/pgp/kcu1500/app/kcuStatus and kcuDmaStatus in kcuDmaStatus "blockspause" and "blocksfree" determine whether or not deadtime is set. if blocksfree drops below blockspause then it will assert deadtime. in hsd window "pgp last rx opcode" 0 means no backpressure, 1 means backpressure. Watch for locPause non zero which causes deadtime.
  • check for multiple drp executables
  • clearReadout broadcasts a message to receiving  kcu's telling them to reset timing-header FIFOs.
  • if running "drp" executable, check that lane mask is correct
  • if events are showing up "sporadically" look for CRC errors from "kcuSim -s -d /dev/datadev_0".  We have seen this caused by doing an XPM RxLink reset without a later TxLink reset.
  • to see if events are flowing from the hardware to the software: "cat /proc/datadev_0" and watch "Tot Buffer Use" counter under "Read Buffers"

HSD

  • look at configured parameters using (for example) "hsdpva DAQ:LAB2:HSD:DEV06_3D:A"
  • for kcu firmware that is built to use both QSFP links, the naming of the qsfp's is swapped.  i.e. the qsfp that is normally called /dev/datadev_0 is now called /dev/datadev_1
  • HSD is not configured to do anything (Check the HSD config tab for no channels enabled)
  • if hsd timing frames are not being received at 926kHz (status here), click TxLink Reset in XPM window.  Typically when this is an issue the receiving rate is ~20kHz.
  • The HSD configuration partition number does not match platform number in .cnf file (Check the HSD "base" tab)
  • also check that HEADERCNTL0 is incrementing in "status" tab of HSD cfg window.
  • in hsd window timepausecnt is number of clocks we are dead
  • in hsd expert window "full threshold(events)" sets threshold for hsd deadtime
  • in hsd status window "fex free events" and "raw free events" are the current free events.
  • in hsd status window "write fifo count" is number of timing headers waiting for HSD data to associate.
  • "readcntsum" on hsd timing tab goes up when we send a transition OR L1Accepts. "trigcntsum" counts L1Accepts only.
  • check kcuStatus for "locPause" non-zero (a low level pgp FIFO being full).  If this happens then:  configure hsd, clear readout, reboot drp node with KCU

update on variables in hsdpva gui from Matt (06/05/2020):

  • timing tab
    • timpausecnt (clock ticks of dead time)
    • trigcnt: triggers over last second
    • trigcntsum: total l1accept
    • readcntsum: number of events readout total
    • msgdelayset: (units: mhz clock ticks) should be 98 or 99 (there is an "off by one" in the epics variable, and what the hsd sees).  if too short, trigger decision is made too early, and there isn't enough buffering in hsd (breaks the buffer model)
    • headercntof: should always be zero (ben's module).  non-zero indicates that too many l1accepts have been sent and we've overwritten the available buffering
    • headercntl0: sum of number of headers received by ben's module.
    • headerfifor: the watermark for when ben's stuff asserts dead time
    • fulltotrig/nfulltotrig:  counters to determine the round trip time from sending an event to getting back the full signal (depends on fiber lengths, for example).  nfulltotrig is same thing but with opposite logic.
  • pgp tab
    • loclinkrdy/remlinkrdy should be ready
    • tx/rx clk frequencies should be 156.
    • txcnt: counts per second of things being send, but only 16 bits so doesn't really display right value
    • txcntsum is transitions and l1accepts (only 16 bits so rolls over frequently)
  • buffers tab
    • freesz: units of rows of adc readout (40samples*2bytes).
    • freeevt: number of free events.  if they go below thresholds set in config tab: dead time.
  • flow tab
    • fmask: bit mask of streams remaining to contribute to event
    • fcurr: current stream being read
    • frdy: bit mask of streams ready to be read
    • srdy: downstream slave (byte packer) is ready
    • mrdy: b0 is master valid assert, b1 is PGP transmit queue (downstream of byte packer) ready
    • rdaddr: current stream cache read pointer
    • npend: next buffer to write
    • ntrig: next buffer to trigger
    • nread: next buffer to read
    • pkoflow: overflow count in byte packer
    • oflow: current stream cache write pointer

Fake Camera

  • use this to check state of the pgp link, and the readout group, size and link mask (instead of kcuStatus): kcuSim -s -d /dev/datadev_0
  • use this to configure readout group, size, link mask: kcuSimValid -d /dev/datadev_0 -c 1 -C 2,320,0xf
    • I think this hangs because it's trying to validate a number of events (specified with the -c argument?)

TEB/MEB

(conversation with Ric on 06/16/20 on TEB grafana page)

BypCt:  number bypassing the TEB

BtWtg: boolean saying whether we're waiting to allocate a batch

TxPdg (MEB, TEB, DRP): boolean.  libfabric saying try again to send to the designated destination (meb, teb, drp)

RxPdg (MEB, TEB, DRP): same as above but for Rx.

(T(eb)M(eb))CtbOInFlt: incremented on a send, decremented on a receive (hence "in flight") 

In tables at the bottom: ToEvtCnt is number of events timed out by teb

WrtCnt MonCnt PsclCnt: the trigger decisions

TOEvtCnt TIEvtCnt: O is outbound from drp to teb, I is inbound from teb to drp

Look in teb log file for timeout messages.  To get contributor id look for messages like this in drp:

/reg/neh/home/cpo/2020/06/16_18:19:24_drp-tst-dev010:tmohsd_0.log:Parameters of Contributor ID 8:

XPM

The timing system should come back up.  If not, use a machine with afs (psdev01, pslab03?)
  • source /afs/slac/g/reseng/IPMC/env.csh
  • fru_deactivate shm-tst-lab2-atca01
  • fru_deactivate shm-tst-lab2-atca02
  • fru_deactivate shm-tst-lab2-atca03
  • fru_activate shm-tst-lab2-atca01
  • fru_activate shm-tst-lab2-atca02
  • fru_activate shm-tst-lab2-atca03

Matt says for the NEH system: For slot 2,

source /afs/slac/g/reseng/IPMC/env.sh
fru_deactivate shm-daq-drp01/2
fru_activate shm-daq-drp01/2

This doesn't work for the new crate in TMO (don't know why yet), so we have to do:

ssh root@shm-tmo-daq01
clia deactivate board 2
clia activate board 2

Then have to restart the base processes so that the timing system can reload it's default constants from the database (e.g. msgdelays which protect the buffering)

This error message ("Establishing link...") means that the initial ethernet communication between the ATCA board the pyxpm software is not establishing a connection.  That either means the network is down for some reason, or the board is already connected to another instance of pyxpm somewhere.
'amcc_dump_bsi --all shm-tst-lab2-atca02/2' (from psdev) will tell you if the slot 2 xpm (XPM:2) ethernet link is up.

Opal

Larry's instructions are here:

https://github.com/slaclab/cameralink-gateway#example-of-starting-up-opal1000

running standalone rogue gui:

python ~/git/cameralink-gateway/software/scripts/devGui.py --camType Opal1000 --enLclsI False --enLclsII True --startupMode True

warnings reprogramming FEB fpga:

(note it looks like we use the "1ch" firmware for opal since it's base camlink?)

(rogue_pre-release) [cpo@lcls-pc83236 software]$ python software/scripts/updateFebFPGA.py --mcs ~/ClinkFebPgp2b_1ch-0x02000000-20191210103130-ruckman-d6618cc.mcs --lane 0

https://github.com/slaclab/rogue

to update the pcie kcu1500 board:

python software/scripts/updatePcieFpga.py --path /reg/neh/home/cpo/junk/

Debugging 100% Deadtime with Larry and Ben

https://github.com/slaclab/cameralink-gateway#pgp-channel-mapping

Look in Hsio->PgpRxAxisMon[0]->Ch[1]->FrameCnt to see frames from FEB.
Ch[0] is register for FEB
Ch[2] is the UART

ClinkPcie->Application->AppLane[0]->EventBuilder->DataCnt[0] increments with EITHER a trigger message or a transition message
DataCnt[1] increments when we have camera data. Should be equivalent to the PGP FrameCnt above
Above only increment after the event-builder has done its job

Setting Bypass to 1 turns off the event builder (bit mask, per-lane):
bit 0 to 1 bypasses timing
bit 1 to 1 bypasses the camera

tdest lane mapping:

https://github.com/slaclab/cameralink-gateway/blob/master/firmware/common/rtl/AppLane.vhd#L109-L111

Tried hitting
Commands->ClinkDevRoot->ClinkPcie->Application->Applane[0]->Eventbuilder->CntRst
didn't help

When we get struck
TriggerEventBuffer[0]=>FifoPause goes True
TriggerEventBuffer[0]=>XpmPause goes True

setting EventBuilder->Blowoff True resets

Using groupca to send a transition causes these under Hsio->TimingRx->TriggerEventManager:
TriggerEventBuffer[0]=>TransitionCount to increment, and
TriggerEventBuffer[0]=>FifoWrCount to increment

XpmOverflow=True is latched version of FifoOverflow.  Need ClearReadout to reset.

Check that TrigerEventBuffer[0]->Partition is set to right readout group

Try toggling between XpmMini and LCLS2 timing:  I think this fixed the kcu1500→XPM link not locking one time.

Check TimingRx->RimingFrameRx->RxLinkUp is 0x1 (link to XPM)

Check Hsio->PgpMon[0]→RemLinkReady=True (link to FEB, I think)

Check ClinkTop→LinkLockedA=True (link to opal)
Check ClkInFreq and ClinkClkFreq are 80Mhz (camlink frequencies)

ClinkTop->Ch[0]:
LinkMode Base
DataMode 12Bit
FrameMode Frame
TapCount 0x2
DataEn True
BaudRate 57600
FrameCount increments with L1Accepts, but not if deadtime.

ClinkTop→TrigCtrl[0]:

TrigCnt counts, but not if dead time.

Tried hitting all the resets on the rogue GUI's "commands" tab (CntRst, TimerRst, HardRst, SoftRst) but didn't help with 100% deadtime.  Powercycling the KCU1500 node.  After that the XPM link wouldn't lock at all.  Was able to get it to re-lock by hitting the resets here ... wasn't watching the right column on xpmpva, so I don't know which one did the trick (but a second power-cycle later on suggests tat it was the TxPhyReset):

Run pgpread like this (-c 1 says to use virtual-channel 1, for camera images, I think, and -r tells it to use rogue mode since it's getting an EventBatcherHeader not a TimingSystemHeader):

psdaq/build/drp/pgpread -d /dev/datadev_0 -c 1 -r

REMEMBER:  can run DAQ (or at least pgpread) and rogue GUI at the same time since rogue disables opening data stream virtual channel with this linehttps://github.com/slaclab/lcls2-pgp-fw-lib/blob/d57de0f1da359b3f1ab8d899968c022f92b58a28/python/lcls2_pgp_fw_lib/hardware/shared/_Root.py#L22.  LATER: the drp couldn't seem to open /dev/datadev_0 VC 0 with this error when rogue GUI was running, but maybe the data (VC 1?) is OK:

rogue.GeneralError: AxiStreamDma::AxiStreamDma: General Error: Failed to open device file /dev/datadev_0 with dest 0x0! Another process may already have it open!


See 1Hz of SlowUpdates going through (transition id 10).  The SlowUpdates cause TransactionCnt to increment at 1Hz (zero'd by toggling blowoff):

Disabled slow updates, but didn't help with 100% dead time.  Do blowoff (to empty queues) and clear readout (to clear latched XpmOverflow which in turn latches deadtime?).  Then pgpread sees transitions and some l1accepts (although haltingly?) before deadtime kicks in.

Tried the FifoReset here in addition to blowoff and clearreadout but don't see any l1accepts in pgpread:

Do see these registers counting (RxFrameCount, TxFrameCount).  Perhaps the rogue register reads?

Tried this UserRst with blowoff/clearreadout, but doesn't help:

Also tried FpgaReload above and running the DAQ (including configure, since presumably the FEB lost its configuration) but still 100% deadtime.

Tried power cycling camera.  Camlink converter box LED was GREEN after power cycling, but still 100% deadtime.

Power cycled the camlink converter box.  LED was RED after power cycling.  Failed to configure.  Power cycling the camera.  Camlink converter box LED stayed RED.  After configuring went to GREEN.  Makes it feel like there might be a power-ordering issue:  after cycling camlink converter box need to cycle the camera and reconfigure.  Still get deadtime.  Checked camlink cables:  screws needed tightening, but felt ok.

Now can't even get a transition through.

Measure light levels:

  • camlink converter to kcu (1310nm): -5.2dBm at box,  -4.7dBm at KCU
  • XPM to kcu (850nm): -8.2dBm at XPM, can't measure at KCU because of quad fiber

Powercycled cmp011 again.  After resetting the XPM link (toggled XpmMiniTiming/ConfigLCLS2Timing and hit TxPhyReset) I can configure.  Back in the mode where we get 100% deadtime, but I can get transitions and a few events through by doing Blowoff/ClearReadout.

Tried increasing the TriggerPulseWidth to 100us (was 0.092us?) but doesn't help with deadtime:

This TxOpCodeCount under Hsio→PgpMon[0] seems to only increase with L1Accepts.  We are getting far fewer of these than the L1Accept rate says we should.  Suggests we're dropping camera frames?

Possible breakthrough:  if I power cycle the camera and run without reconfiguring then L1Accepts show up in pgpreader at the correct rate.  Feels like our configuration is breaking the camera?  Compared the power-on defaults (https://www.slac.stanford.edu/exp/npa/manuals/opal1000man.pdf) to our config.  Don't see an obvious difference in UART settings except we have VR=1 (vertical remap) and OVL=1 (overlay frame counter and integration time on early pixel data)?.  Other differences (BL(black level), DPE (defect pixel correction) OFS (output offset)) feel like they don't matter.  Tried setting VR=0 and OVL=0 but it didn't help.  Set them back to 1.

Horrible workaround:  if I configure the camera, then powercycle, it appears to run.  

The XPM link went bad again, and recovered with TxPhyReset.

Ben pointed out that in the case of failure, these L0Count/L1AcceptCount counters increment at the right rate, but TxOpCodeCount doesn't, so not all L1's are getting translated into TxOpCode's (!), so it's a problem in the KCU.  Appears to contradict the camera-powercycle approach above, but perhaps that is caused by the exposure time getting set smaller so it can handle messed up delays?  (see L0Delay solution below):

Questions for Larry/Ben:

  • does clearreadout trigger blowoff and fiforeset? (and reset the latched xpmoverflow)
  • why would TxPhyReset help the xpm link?
  • is TxOpCodeCount the right thing to watch?
  • why doesn't send string work for the opal?  Answer from Larry:  need to prepend an "@", e.g. "@SN?" seems to work.

Try moving back to old timing values.  Current settings:

(ps-3.1.16) tmo-daq:lcls2$ pvget DAQ:NEH:XPM:0:XTPG:CuDelay
DAQ:NEH:XPM:0:XTPG:CuDelay 2020-10-21 09:33:26.010  134682 
(ps-3.1.16) tmo-daq:lcls2$ pvget DAQ:NEH:XPM:0:PART:4:L0Delay
DAQ:NEH:XPM:0:PART:4:L0Delay 2020-10-21 09:33:26.334  10 

Matt writes that the pre-Oct.-15 values were:

These were the values prior to 10/15
pvput DAQ:NEH:XPM:0:PART:3:L0Delay 90 (readout group 3)
pvput DAQ:NEH:XPM:0:XTPG:CuDelay 160000
Changes after that are in the tmoc00118 logbook.

This appeared to fix the problem!:

(ps-3.1.16) tmo-daq:lcls2$ pvput DAQ:NEH:XPM:0:PART:4:L0Delay 90
Old : 2020-10-21 09:33:26.334  10 
New : 2020-10-21 12:48:30.590  90 
(ps-3.1.16) tmo-daq:lcls2$ pvput DAQ:NEH:XPM:0:XTPG:CuDelay 160000
Old : 2020-10-21 09:33:26.010  134682 
New : 2020-10-21 12:48:58.977  160000 

It turns out the CuTiming didn't mater, only the L0Delay change fixed it.  From the opal log file:

I guess this is output from the logfile when it fails:  triggerDelay 17810
And when it succeeds: triggerDelay 1810 (after changing L0Delay to 90)

Made it feel like we're running out of bits in firmware for triggerDelay, but (a) rogue reads back the register correctly (b) Matt says he thinks we have 32-bits.

Not Understood Failure Modes

  • XPM wasn't able to read the remote ID of the hsd in dev03  (5/16/19)
  • kcuStatus for receiving hsd data in dev024 showed pgp links unlocked, with tx/rx clock frequencies at zero.  Power cycled to recover (5/16/19)
  • No labels