Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A useful diagram of the pieces of a DRP node that should be monitored via prometheus is here:  https://docs.google.com/presentation/d/1LvvUsV4A1F-7cao4t6mzr8z2qFFZEIwLKR4pXazU3HE/edit?usp=sharing

Updating TPR Firmware

See: SLAC EVR and TPR

Code Block
sudo /reg/g/pcds/package/slac_evr_upgrade/UPGRADE.sh /dev/tpra /reg/g/pcds/package/slaclab/evr-card-g2/images/latest-tpr

Generic

  • "clearreadout" and "clear"
  • need to reset the Rx/Tx link (in that order!) for XPM front-panel (note: have learned that RxLink reset can cause link CRC errors (see below) and have to do a TxLink reset to fix.  so order is important). The TxLink reset causes the link to retrain using K characters
  • look for deadtime
  • check that the "partition" window (with the trigger-enable checkbox) is talking to the right XPM: look in the XPM window label, which is something like DAQ:LAB2:XPM:N, where N is the XPM number.  A symptom of this number being incorrect is that the L0InpRate/L0AccRate remain at zeros when triggers are enabled.  This number is a unique identifier within a hierarchy of XPMs.
  • XPM is not configured to forward triggers ("LinkEnable" for that link on the XPM GUI)
  • L0Delay set to 99
  • DST Select (in PART window) set to "DontCare" (could be Dontcare/Internal)
  • check processes in lab3-base.cnf are running
  • run psdaq/build/psdaq/pgp/kcu1500/app/kcuStatus and kcuDmaStatus in kcuDmaStatus "blockspause" and "blocksfree" determine whether or not deadtime is set. if blocksfree drops below blockspause then it will assert deadtime. in hsd window "pgp last rx opcode" 0 means no backpressure, 1 means backpressure. Watch for locPause non zero which causes deadtime.
  • check for multiple drp executables
  • clearReadout broadcasts a message to receiving  kcu's telling them to reset timing-header FIFOs.
  • if running "drp" executable, check that lane mask is correct
  • if events are showing up "sporadically" look for CRC errors from "kcuSim -s -d /dev/datadev_0".  We have seen this caused by doing an XPM RxLink reset without a later TxLink reset.
  • >for the pgp driver this parameter needs to be increased in /etc/sysctl.conf:

    Code Block
    [root@drp-neh-cmp005 cpo]# grep vm /etc/sysctl.conf 
    vm.max_map_count=1000000
    [root@drp-neh-cmp005 cpo]# 
    
    

...

Matt's document showing the location of each hsd in the tmo chassis: https://docs.google.com/document/d/1SzPwrJsoJR0brlQG-mCNILPFh8njXYHGrQz7tl39Thw/edit?usp=sharing

Supermicro manual for hsd chassis: https://www.supermicro.com/manuals/superserver/4U/MNL-2107.pdf

Supermicro document discussing pcie root complexes for some different systems: https://www.supermicro.com/products/system/4U/4029/PCIe-Root-Architecture.cfm

General Debugging

  • look at configured parameters using (for example) "hsdpva DAQ:LAB2:HSD:DEV06_3D:A"
  • for kcu firmware that is built to use both QSFP links, the naming of the qsfp's is swapped.  i.e. the qsfp that is normally called /dev/datadev_0 is now called /dev/datadev_1
  • HSD is not configured to do anything (Check the HSD config tab for no channels enabled)
  • if hsd timing frames are not being received at 929kHz (status here), click TxLink Reset in XPM window.  Typically when this is an issue the receiving rate is ~20kHz.
  • The HSD readoutGroup number does not match platform number in .cnf file (Check the HSD "Config" tab)
  • also check that HEADERCNTL0 is incrementing in "Timing" tab of HSD cfg window.
  • in hsd Timing tab timpausecnt is number of clocks we are dead (156.25MHz clock ticks).  dead-time fraction is timpausecnt/156.25e6
  • in hsd expert window "full threshold(events)" sets threshold for hsd deadtime
  • in hsd Buffer tab "fex free events" and "raw free events" are the current free events.
  • in hsd status window "write fifo count" is number of timing headers waiting for HSD data to associate.
  • "readcntsum" on hsd timing tab goes up when we send a transition OR L1Accepts. "trigcntsum" counts L1Accepts only.
  • "txcntsum" on PGP tab goes up when we send a transition or l1accepts.
  • check kcuStatus for "locPause" non-zero (a low level pgp FIFO being full).  If this happens then:  configure hsd, clear readout, reboot drp node with KCU
  • if links aren't locking in hsdpva use "kcuStatus" to check that the tx/rx clock frequencies are 156MHz.  If not (we have seen lower rates like 135MHz) a node power cycle (to reload the KCU FPGA) can fix this.  Matt writes: "kcuStatus should have an option to reset the clock to its factory input before attempting to program it to the standard input."  It looks like there is a "kcuStatus -R" which "kcuStatus -h" says should reset the clock to 156MHz, but cpo tried this twice and it seems to be stuck at 131MHz still.
  • If the drp doesn't complete rollcall and the log file shows messages about PADDR_U being zero, restarting the corresponding hsdioc process may help.

...