Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Larry thinks that these are in the raw units read out from the device (mW) and says that to convert to dBm use the following formula: 10*log(10)(val/1mW).  For example, 0.6 corresponds to -2.2dBm.  The same information is now displayed with xpmpva in the "SFPs" tab.

Code Block
(ps-4.1.2) tmo-daq:scripts> pvget DAQ:NEH:XPM:0:SFPSTATUS
DAQ:NEH:XPM:0:SFPSTATUS 2021-01-13 14:36:15.450    
LossOfSignal ModuleAbsent TxPower RxPower
           0            0  6.5535  6.5535
           1            0  0.5701  0.0001
           0            0  0.5883  0.7572
           0            0  0.5746  0.5679
           0            0  0.8134   0.738
           0            0  0.6844    0.88
           0            0  0.5942  0.4925
           0            0  0.5218  0.7779
           1            0   0.608  0.0001
           0            0  0.5419  0.3033
           1            0  0.6652  0.0001
           0            0  0.5177  0.8751
           1            1       0       0
           0            0  0.7723   0.201

Programming Firmware

From Matt.  He says the current production version (which still suffers from xpm-link-glitch storms) is 0x030504. The git repo with firmware is here:

https://github.com/slaclab/l2si-xpm

Please remember to stop the pyxpm process associated with the xpm before proceeding.
Connect to tmo-daq as tmoopr and use procmgr stop neh_base.cnf pyxpm-xx.

Code Block
ssh drp-neh-ctl01. (with ethernet access to ATCA switch: or drp-srcf-mon001 for production hutches)
~weaver/FirmwareLoader/rhel6/FirmwareLoader  -a <XPM_IPADDR> <MCS_FILE>.      (binary copied from afs)

ssh psdev
source /cds/sw/package/IPMC/env.sh
fru_deactivate shm-fee-daq01/<SLOT>
fru_activate shm-fee-daq01/<SLOT>

The MCS_FILE can be found at:
/cds/home/w/weaver/mcs/xpm/xpm-0x03060000-20231009210826-weaver-a0031eb.mcs
/cds/home/w/weaver/mcs/xpm/xpm_noRTM-0x03060000-20231010072209-weaver-a0031eb.mcs


Incorrect Fiducial Rates

In Jan. 2023 Matt saw a failure mode where xpmpva showed 2kHz fiducial rate instead of the expected 930kHz.  This was traced to an upstream accelerator timing distribution module being uninitialized.

(later):  This is now understood to be due to an overflow of a delay FIFO in the receiver (XPM) under link error conditions.  XPM firmware has been updated to clear this error state.

In April 2023, DAQs run on SRCF machines had 'PGPReader: Jump in complete l1Count' errors.  Matt found XPM:0 receiving 929kHz of fiducials but only transmitting 22.5kHz, which he thought was due to CRC errors on its input.  Also XPM:0's FbClk seemed frozen.  Matt said:

I could see the outbound fiducials were 22.5kHz by clicking one of the outbound ports LinkLoopback on.  The received rate on that outbound link is then the outbound fiducial rate.

At least now we know this error state is somewhere within the XPM and not upstream.

The issue was cleared up by resetting XPM:0 with fru_deactivate/activate to clear up a bad state.

Note that when the XPMs are in a good state, the following values should be seen:

  • Global tab:
    • RecClk: 185 MHz
    • FbClk: 185 MHz
  • UsTiming tab:
    • RxClks: 185 MHz
    • RxLinkUp: 1
    • CrcErrs: 0
    • RxDecErrs: 0
    • RxDspErrs: 0
    • FIDs: 929 kHz
    • SOFs: 929 kHz
    • EOFs: 929 kHz

No RxRcv/RxErr Frames in xpmpva

If RxRcv/RxErr frames are stuck in xpmpva it may be that the network interface to the ATCA crate is not set up for jumbo frames.

Link Issues

If XPM links don't lock, here are some past causes:

  • check that transceivers (especially QSFP, which can be difficult) are fully plugged in.
  • for opal detectors:
    • use devGui to toggle between xpmmini/LCLS2 timing (Matt has added this to the opal config script, but to the part that executes at startup time)
    • hit TxPhyReset in the devGui (this is now done in the opal drp executable)
  • if timing frames are stuck in a camlink node hitting TxPhyPllReset started the timing frame counters going (and it lighter-weight than xpmmini→lcls2 timing toggle)
  • on a TDet node found "kcusim -T" (reset timing PLL) made a link lock
  • for timing system detectors: run "kcuSim -s -d /dev/datadev_1", this should also be done when one runs a drp process on the drp node (to initialize the timing registers).  the drp executable in this case doesn't need any transitions.
  • hit Tx/Rx reset on xpmpva gui (AMC tabs).
  • use loopback fibers (or click a loopback checkbox in xpmpva) to determine which side has the problem
  • try swapping fibers in the BOS to see if the problem is on the xpm side or the kcu side
  • we saw once where we have to power cycle a camlink drp node to make the xpm timing link lock.  Matt suggests that perhaps hitting PLL resets in the rogue gui could be a more delicate way of doing this.
  • (old information with the old/broken BOS) Valerio and Matt had noticed that the BOS sometimes lets its connections deteriorate. To fix:
    • ssh root@osw-daq-calients320
    • omm-ctrl --reset

Timing Frames Not Properly Received

  • do TXreset on appropriate port
  • toggling between xpmmini and lcls2 timing can fix (we have put this in the code now, previously was lcls1-to-lcls2 timing toggle in the code)
  • sometimes xpm's have become confused and think they are receiving 26MHz timing frames when they should be 0.9MHz (this can be seen in the upstream-timing tab of xpmpva ("UsTiming").  you can determine which xpm is responsible by putting each link in loopback mode: if it is working properly you should see 0.9MHz of rx frames in loopback mode (normally 20MHz of frames in normal mode).  Proceed upstream until you find a working xpm, then do tx resets (and rx?) downstream to fix them, 

Network Connection Difficulty

Saw this error on Nov. 2 2021 in lab3 over and over:

Code Block
WARNING:pyrogue.Device.UdpRssiPack.rudpReg:host=10.0.2.102, port=8193 -> Establishing link ...

Matt writes:

That error could mean that some other pyxpm process is connected to it.  Using ping should show if the device is really off the network, which seems to be the case.  You can also use "amcc_dump_bsi --all shm-tst-lab2-atca02" to see the status of the ATCA boards from the shelf manager's view.  (source /afs/slac/g/reseng/IPMC/env.sh[csh] or source /cds/sw/package/IPMC/env.sh[csh])  It looks like the boards in slots 2 and 4 had lost ethernet connectivity (with the ATCA switch) but should be good now.  None of the boards respond to ping, so I'm guessing its the ATCA switch that's failed.  The power on that board can also be cycled with "fru_deactivate, fru_activate".  I did that, and now they all respond to ping.

Firmware Varieties and Switching Between Internal/External Timing

NOTE: these instructions only apply for XPM boards running "xtpg" firmware.  This is the only version that supports internal timing for the official XPM boards.  It has a software-selectable internal/external timing using the "CuInput" variable.  KCU1500's running the xpm firmware have a different image for internal timing with "Gen" in the name (see /cds/home/w/weaver/mcs/xpm/*Gen*, which currently contains only a KCU1500 internal-timing version).

If the xpm board is in external mode in the database we believe we have to reinitialize the database by running:

python pyxpm_db.py --inst tmo --name DAQ:NEH:XPM:10 --prod --user tmoopr --password pcds --alias XPM

CuInput flag (DAQ:NEH:XPM:0:XTPG:CuInput) is set to 1 (for internal timing) instead of 0 (external timing with first RTM SFP input, presumably labelled "EVR[0]" on the RTM, but we are not certain) or 3 (second RTM SFP timing input labelled "EVR[1]" on the RTM).

Matt says there are three types of XPM firmware: (1) an XTPG version which requires an RTM input (2) a standard XPM version which requires RTM input (3) a version which gets its timing input from AMC0 port 0 (with "noRTM" in the name).  The xtpg version can take lcls1 input timing and convert to lcls2 or can generate internal lcls2 timing.  Now that we have switched the tmo/rix systems to lcls2 timing this version is not needed anymore: the "xpm" firmware version should be used.  The one exception is the detector group running in MFX from LCLS1 timing which currently uses xpm7 running xtpg firmware.


This file puts xpm-0 in internal timing mode: https://github.com/slac-lcls/lcls2/blob/master/psdaq/psdaq/cnf/internal-neh-base.cnf.  Note that in internal timing mode the L0Delay (per-readout-group) seems to default to 90.  Fix it with pvput DAQ:NEH:XPM:0:PART:0:L0Delay 80".

One should switch back to external mode by setting CuInput to 0 in xpmpva CuTiming tab.  Still want to switch to external-timing cnf file after this is done.  Check that the FiducialErr box is not checked (try ClearErr to see if it fixes).  If this doesn't clear it can be a sign that ACR has put it "wrong divisor" on their end.

...