PENDING important improvement: Julian has timing link reset fixes (for all detectors that receive timing). We need to update all that firmware everywhere. Note: VHDL interface has changed (in lcls_timing_core) so it's more work (register map is the same). Could solve many problems below? List of firmware: tdet, wave8, camlink, epixhr, epixm, epixuhr, hsd, hrencoder, xpm+varieties, tpr (others?). The data systems group should deploy the tdet firmware everywhere in production as a first-pass test. If that is successful then we do everything else. (non-production firmware build from Julian can be found here: /cds/home/j/jumdz/mcs/DrpTDet-0x04000400-20240413131956-jumdz-dirty
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 0 (andor_norm_0) rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 1 (andor_dir_0) rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 2 (manta_0) rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 3 (mono_encoder_0) |
(ps-4.6.3) drp-srcf-cmp004:software$ cat /proc/datadev_0 -------------- Axi Version ---------------- Firmware Version : 0xffffffff ScratchPad : 0xffffffff Up Time Count : 4294967295 Device ID : 0xffffffff Git Hash : ffffffffffffffffffffffffffffffffffffffff DNA Value : 0xffffffffffffffffffffffffffffffff Build String : ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������[�d� -------------- General HW ----------------- Int Req Count : 4294967295 Hw Dma Wr Index : 4294967295 Sw Dma Wr Index : 3136 Hw Dma Rd Index : 4294967295 Sw Dma Rd Index : 323 Missed Wr Requests : 4294967295 Missed IRQ Count : 27819533 Continue Count : 0 Address Count : 4096 Hw Write Buff Count : 4095 Hw Read Buff Count : 0 Cache Config : 0xffffffff Desc 128 En : 1 Enable Ver : 0xffffffff Driver Load Count : 255 IRQ Hold : 4294967295 BG Enable : 0x0 -------------- General -------------------- Dma Version : 0x6 Git Version : 5.17.3 -------------- Read Buffers --------------- Buffer Count : 1048572 Buffer Size : 8192 Buffer Mode : 2 Buffers In User : 0 Buffers In Hw : 4095 Buffers In Pre-Hw Q : 1044477 Buffers In Rx Queue : 0 Missing Buffers : 0 Min Buffer Use : 2 Max Buffer Use : 227890 Avg Buffer Use : 1116 Tot Buffer Use : 1170295872 -------------- Write Buffers --------------- Buffer Count : 16 Buffer Size : 8192 Buffer Mode : 2 Buffers In User : 0 Buffers In Hw : 0 Buffers In Pre-Hw Q : 16 Buffers In Sw Queue : 0 Missing Buffers : 0 Min Buffer Use : 5141 Max Buffer Use : 5142 Avg Buffer Use : 5141 Tot Buffer Use : 82259 (ps-4.6.3) drp-srcf-cmp004:software$ drp-srcf-cmp002:~$ cat /proc/datadev_0 -------------- Axi Version ---------------- Firmware Version : 0xffffffff ScratchPad : 0xffffffff Up Time Count : 4294967295 Device ID : 0xffffffff Git Hash : ffffffffffffffffffffffffffffffffffffffff DNA Value : 0xffffffffffffffffffffffffffffffff Build String : �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������A |
XPM RxDspErrs
rates)xpmpva
shows XPM5 looks mostly healthy except for the RxLinkUp
TxLinkReset
to restore RxLinkUp
(on 2024/04/11 ~09:15).These are the results of the tests that have been conducted in the FEE alcove to determine if the XPM glitch can be reproduced.
Every test is run from a starting behavior where the DAQ can allocate, configure, run, and disable.
Whenever the DAQ does not follow the starting behavior remedies are applied to recover it.
xpm10 and 11 connections
XPM firmware 3.5.4
Opal_config.py has xpm mini – timing2 hack
action | result | remedy | result |
---|---|---|---|
Remove XPM10 fiber timing in the back while DAQ running | *** XpmDetector: timing link ID is ffffffff = 4294967295^M | TxlinkReset of cmp015 in XPM11 | DAQ recovers |
Repeat XPM10 fiber timing removal removal | DAQ cannot disable | --- | DAQ recovers by itself at restart |
Repeat XPM10 fiber timing removal removal | --- | --- | no issue |
Repeat XPM10 fiber timing removal removal | DAQ cannot disable | --- | DAQ recovers by itself at restart |
Remove XPM10 fiber timing in the back while DAQ stopped | --- | --- | DAQ starts with no issue |
Repeat XPM10 fiber timing removal removal while DAQ stopped | --- | --- | DAQ starts with no issue |
Remove transceiver from XPM10 in the back (DAQ stopped) | --- | --- | DAQ starts with no issue |
Remove transceiver from XPM10 in the back (DAQ started) | --- | --- | DAQ starts with no issue |
timing 1 shutsdown by itself | TXlinkReset on XPM10 for XPM11 | DAQ recovers | |
Remove fiber on XPM10 to XPM11 | --- | --- | DAQ starts with no issue |
Remove transceiver on XPM10 to XPM11 | --- | --- | DAQ starts with no issue |
Remove fiber on XPM11 AMC0 port 0 | --- | --- | DAQ starts with no issue |
Remove transceiver on XPM11 AMC 0 port0 | --- | --- | DAQ starts with no issue |
opal disappears from the list f detectors | restart DAQ | DAQ starts with no issue | |
power cycle xpm10 via switch only AMC0 | XPM 11 looses timing node | Restart pyxpm 10 and 11 restart pyxpm 11 | DAQ restarts but opal shutsdown |
opal still shutdown | devGui xpmmini timing v2 | no avail | |
Stop pyxpm 10 and 11 | DAQ starts with no issue |
action | issue found | error stat | remedy |
---|---|---|---|
stop pyxpm 10 and 11 | no issue has been detected | 0/10 | --- |
stop pyxpm 10 and 11 | at first xpmpva DAQ:NEH:XPM:11 does not come up Then Opal shutsdown | 3/20 | stop pyxpm 10 and 11 |
Perhaps fixed by Matt in later firmware version?
XPM11 glitches between 4pm and after 6pm and also around 10:10 am the next day
xpm11 to xpm_noRTM-0x030601000-20231011111938-weaver-645bee8.mcs
xpm10 to xpm-0x030601000-20231011111954-weaver-645bee8.mcs
XPM firmware 3.6.0 (?3.6.1?)
Opal config does not have xpm mini -timing2 hack
cnf file uses -D fakecam for additional timing nodes
action | issue found | stat | remedy |
---|---|---|---|
stop pyxpm 10 and 11 | 9/20 | --- | |
stop pyxpm 10 and 11 | Opal fails in configuration | 5/20 | reboot timing nodes |
stop pyxpm 10 and 11 | groupca and xpmpva are shutdown at startup | 4/20 | ctrl-x in the terminal successfully restart them |
example of the timing shift in the timing nodes (before -D fakecam).
action | issue found | stat | remedy |
---|---|---|---|
stop pyxpm 10 and 11 fru-deactivate and activate xpm 11 restart pyxpm 10 and 11 start DAQ | bucket issue | 1/10 | rebooting timing node cmp001 |
rebooting timing node cmp001 | no issue | 0/5 | |
remove fiber from xpm10 to xpm11 fiber 10 times | no issue | 0/10 | |
Removing fiber from xpm10 to timing 1 fiber 10 times for 5 seconds (Amc1 port0) | no issue | 0/10 | |
Removing fiber from xpm11 to opal fiber 10 times for 5 seconds (Amc1 port1) | no issue | 0/10 |
New opal_config.py: remove sleep while requesting mini/v2 introduce check for RxId instead with timeout of 10 repeats.
action | issue found | stat | remedy |
---|---|---|---|
test power cycle she-fee-daq01/2 10 times | bucket issues | 3/10 | power cycling the xpm10 (txlinkreset didn't fix) |
RTM disconnected in increades timing 5min 10 min 40 min 2hours | bucket issue (2 hours) | 1/4 | power cycle of xpm10 (txlinkreset didn't fix) |
from drp-neh-ctl002
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.104 /cds/home/w/weaver/mcs/xpm//xpm_noRTM-0x03050400-20230409095511-weaver-dirty.mcs
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.102 /cds/home/w/weaver/mcs/xpm//xpm-0x03050400-20230419122542-weaver-c6987c4.mcs
then fru-restart from psdev xpm 10 and xpm 11 in sequence, not together.
first light presents a problem with XPMPVA and GROUPCA
xpmpva XPM11 does not work
fru-restart bring it back alive
OPAL does not respond to roll call,
TXLINKRESET XPM11 in XPM10 brings OPAL back to live
action | issue found | stat | remedy |
---|---|---|---|
remove fiber in RTM and restart DAQ | TXLINKRESET timing1 (on rollcall) TXLINKRESET timing1,2,3 (on alloc) | 2/10 | TxLinkReset solved rollcall TxLinkReset and RxLinkReset on xpm10,11 timing1,2,3,4 and opal |
just restart the DAQ | Opal RxId issue RxLinkReset on timing4 (on rollcall) rxid issue on connect opal | 3/10 | TxLinkReset |
observing the logs there are several instances of :
21_10:02:16_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3245855222 3245855222 resetting. Iteration: 1
21_10:02:16_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3249109743 3249109743 resetting. Iteration: 2
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3561772053 3561772053 resetting. Iteration: 1
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3565026528 3565026528 resetting. Iteration: 2
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3568281227 3568281227 resetting. Iteration: 3
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3571538383 3571538383 resetting. Iteration: 4
21_10:36:38_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 0 0 resetting. Iteration: 1
21_10:36:38_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3255210 3255210 resetting. Iteration: 2
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 94190235 94190235 resetting. Iteration: 1
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 97444648 97444648 resetting. Iteration: 2
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 100699006 100699006 resetting. Iteration: 3
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 103957466 103957466 resetting. Iteration: 4
switching XPM firmware back to 3.6.1
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.102 /cds/home/w/weaver/mcs/xpm/xpm-0x030601000-20231011111954-weaver-645bee8.mcs
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.104 /cds/home/w/weaver/mcs/xpm/xpm_noRTM-0x030601000-20231011111938-weaver-645bee8.mcs
action | issue found | stat | remedy |
---|---|---|---|
startup DAQ | none | 0/10 |
no instances of "*** Timing link stuck" in the logs
Power-on the tixel computer (the equivalent of cmp005) with the fiber unplugged, then we plugged in the fiber and it didn’t lock until we did xpmmini→lcls2. It appears that yanking the timing fiber can cause disturbances in the system, but they are not repeatable 100% of the time. XPMs Power spikes can set the DAQ in a behavior similar to the XPM glitch, but only if pyxpms are running. To be repeated.
Upgrading XPM firmware seems to have mitigated all the issues (to 3.6.0 from 3.5.4). The bucket issue becomes more prominent, probably because other issues are not happening. This issue appears when power cycling the xpm11. Also, xpmmini issue could appear when connecting already powered up nodes.
testing double offence.
rebooting a node with cameralink without the fiber connected and connect the fiber after
action | issue found | stat | remedy |
---|---|---|---|
rebooting cmp005 with timing fiber disconnected from xpm, then connect fiber when cmp is back on line | none | 0/5 | xpmpva does not see the opal until the daq is booted up. No ISSUES. |
Nov. 16, 23 with mona, dan, weaver, caf, claus, melchior, cpo
proposal:
- move ric/mona/christos to xpm10 (for the future)
- give riccardo the whole system for the day and he messes with xpm10
- add startupMode=1 kwarg to opal
new xpm firmware (leaving xpm10 alone, no xpmmini->lcls2 hack):
riccardo can't reproduce the errors, except for bucket skipping
(txlinkreset fixed it for matt, but not riccardo and ric)
old xpm firmware (also messing with xpm10 with xpmmini->lcls2): riccardo could reproduce
xpm link glitch and txlinkreset (once) and (likely) xpmmini issue
theories:
- maybe ConfigLclsTimingV2 isn't reliable (should perhaps poll
on something like rxid!=0xffffffff)
- either new xpm firmware makes things better
- or we need to mess with xpm10 to reproduce problems
- or we're unlucky and can't reproduce (or we're not doing the right
things to reproduce)
- might need a minimum length of time to tickle the issues (matt says
try 30 minutes to 1 hour)
matt has an idea for bucket-jumps. could direct julian.
(from mtg on Nov. 27, 2023)
(Julian, matt, Riccardo, cpo)
To do: