A useful diagram of the pieces of a DRP node that should be monitored via prometheus is here: https://docs.google.com/presentation/d/1LvvUsV4A1F-7cao4t6mzr8z2qFFZEIwLKR4pXazU3HE/edit?usp=sharing

Updating TPR Firmware

See: SLAC EVR and TPR

Code Block
sudo /reg/g/pcds/package/slac_evr_upgrade/UPGRADE.sh /dev/tpra /reg/g/pcds/package/slaclab/evr-card-g2/images/latest-tpr

Generic

"clearreadout" and "clear"
need to reset the Rx/Tx link (in that order!) for XPM front-panel (note: have learned that RxLink reset can cause link CRC errors (see below) and have to do a TxLink reset to fix. so order is important). The TxLink reset causes the link to retrain using K characters
look for deadtime
check that the "partition" window (with the trigger-enable checkbox) is talking to the right XPM: look in the XPM window label, which is something like DAQ:LAB2:XPM:N, where N is the XPM number. A symptom of this number being incorrect is that the L0InpRate/L0AccRate remain at zeros when triggers are enabled. This number is a unique identifier within a hierarchy of XPMs.
XPM is not configured to forward triggers ("LinkEnable" for that link on the XPM GUI)
L0Delay set to 99
DST Select (in PART window) set to "DontCare" (could be Dontcare/Internal)
check processes in lab3-base.cnf are running
run psdaq/build/psdaq/pgp/kcu1500/app/kcuStatus and kcuDmaStatus in kcuDmaStatus "blockspause" and "blocksfree" determine whether or not deadtime is set. if blocksfree drops below blockspause then it will assert deadtime. in hsd window "pgp last rx opcode" 0 means no backpressure, 1 means backpressure. Watch for locPause non zero which causes deadtime.
check for multiple drp executables
clearReadout broadcasts a message to receiving kcu's telling them to reset timing-header FIFOs.
if running "drp" executable, check that lane mask is correct
if events are showing up "sporadically" look for CRC errors from "kcuSim -s -d /dev/datadev_0". We have seen this caused by doing an XPM RxLink reset without a later TxLink reset.

>for the pgp driver this parameter needs to be increased in /etc/sysctl.conf:

Code Block
[root@drp-neh-cmp005 cpo]# grep vm /etc/sysctl.conf vm.max_map_count=1000000 [root@drp-neh-cmp005 cpo]#

...

Matt's document showing the location of each hsd in the tmo chassis: https://docs.google.com/document/d/1SzPwrJsoJR0brlQG-mCNILPFh8njXYHGrQz7tl39Thw/edit?usp=sharing.

Supermicro manual for hsd chassis: https://www.supermicro.com/manuals/superserver/4U/MNL-2107.pdf

Supermicro document discussing pcie root complexes for some different systems: https://www.supermicro.com/products/system/4U/4029/PCIe-Root-Architecture.cfm

General Debugging

look at configured parameters using (for example) "hsdpva DAQ:LAB2:HSD:DEV06_3D:A"
for kcu firmware that is built to use both QSFP links, the naming of the qsfp's is swapped. i.e. the qsfp that is normally called /dev/datadev_0 is now called /dev/datadev_1
HSD is not configured to do anything (Check the HSD config tab for no channels enabled)
if hsd timing frames are not being received at 929kHz (status here), click TxLink Reset in XPM window. Typically when this is an issue the receiving rate is ~20kHz.
The HSD readoutGroup number does not match platform number in .cnf file (Check the HSD "Config" tab)
also check that HEADERCNTL0 is incrementing in "Timing" tab of HSD cfg window.
in hsd Timing tab timpausecnt is number of clocks we are dead (156.25MHz clock ticks). dead-time fraction is timpausecnt/156.25e6
in hsd expert window "full threshold(events)" sets threshold for hsd deadtime
in hsd Buffer tab "fex free events" and "raw free events" are the current free events.
in hsd status window "write fifo count" is number of timing headers waiting for HSD data to associate.
"readcntsum" on hsd timing tab goes up when we send a transition OR L1Accepts. "trigcntsum" counts L1Accepts only.
"txcntsum" on PGP tab goes up when we send a transition or l1accepts.
check kcuStatus for "locPause" non-zero (a low level pgp FIFO being full). If this happens then: configure hsd, clear readout, reboot drp node with KCU
if links aren't locking in hsdpva use "kcuStatus" to check that the tx/rx clock frequencies are 156MHz. If not (we have seen lower rates like 135MHz) a node power cycle (to reload the KCU FPGA) can fix this. Matt writes: "kcuStatus should have an option to reset the clock to its factory input before attempting to program it to the standard input." It looks like there is a "kcuStatus -R" which "kcuStatus -h" says should reset the clock to 156MHz, but cpo tried this twice and it seems to be stuck at 131MHz still.
If the drp doesn't complete rollcall and the log file shows messages about PADDR_U being zero, restarting the corresponding hsdioc process may help.

...

Page tree

Versions Compared

Old Version 597

New Version Current

Key

Updating TPR Firmware

General Debugging

Page tree

Page History

Versions Compared

Old Version 597

New Version Current

Key

Updating TPR Firmware

General Debugging