To-Do List

PENDING important improvement: Julian has timing link reset fixes (for all detectors that receive timing).  We need to update all that firmware everywhere.  Note: VHDL interface has changed (in lcls_timing_core) so it's more work (register map is the same). Could solve many problems below? List of firmware: tdet, wave8, camlink, epixhr, epixm, epixuhr, hsd, hrencoder, xpm+varieties, tpr (others?).  The data systems group should deploy the tdet firmware everywhere in production as a first-pass test.  If that is successful then we do everything else. (non-production firmware build from Julian can be found here: /cds/home/j/jumdz/mcs/DrpTDet-0x04000400-20240413131956-jumdz-dirty

rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 0 (andor_norm_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 1 (andor_dir_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 2 (manta_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 3 (mono_encoder_0)

Summary Of Testing

These are the results of the tests that have been conducted in the FEE alcove to determine if the XPM glitch can be reproduced.
Every test is run from a starting behavior where the DAQ can allocate, configure, run, and disable.
Whenever the DAQ does not follow the starting behavior remedies are applied to recover it.


xpm10 and 11 connections



XPM schematics

Testing Details

In 2023/10/24 :

XPM firmware 3.5.4
Opal_config.py has xpm mini – timing2 hack

actionresultremedyresult
Remove XPM10 fiber timing in the back
while DAQ running

*** XpmDetector: timing link ID is ffffffff = 4294967295^M
Timing 1 shutsdown

TxlinkReset of cmp015 in XPM11DAQ recovers
Repeat XPM10 fiber timing removal removal

DAQ cannot disable

---DAQ recovers by itself at restart
Repeat XPM10 fiber timing removal removal

---

---no issue
Repeat XPM10 fiber timing removal removal

DAQ cannot disable

---DAQ recovers by itself at restart
Remove XPM10 fiber timing in the back
while DAQ stopped

---

---DAQ starts with no issue
Repeat XPM10 fiber timing removal removal while DAQ stopped

---

---DAQ starts with no issue
Remove transceiver from XPM10 in the back (DAQ stopped)

---

---DAQ starts with no issue
Remove transceiver from XPM10 in the back (DAQ started)

---

---DAQ starts with no issue

timing 1 shutsdown by itself

TXlinkReset on XPM10 for XPM11DAQ recovers
Remove fiber on XPM10 to XPM11

---

---DAQ starts with no issue
Remove transceiver on XPM10 to XPM11

---

---DAQ starts with no issue
Remove fiber on XPM11 AMC0 port 0

---

---DAQ starts with no issue
Remove transceiver on XPM11 AMC 0 port0

---

---DAQ starts with no issue

opal disappears from the list f detectors

restart DAQDAQ starts with no issue
power cycle xpm10 via switch only AMC0

XPM 11 looses timing node
Opal not in the list of detectors

Restart pyxpm 10 and 11
Power cycle xpm 11 with handles
fru-deactivate xpm11 (3 times)
fru-deactivate xpm10

restart pyxpm 11


DAQ restarts but opal shutsdown

opal still shutdown

devGui xpmmini timing v2
TxLinkReset
Opal still not back BadDetector Paddr
Xpmpva died xpm11

no avail


Stop pyxpm 10 and 11
fru-deactivate 10 and 11
strat pyxpm 10 and 11

DAQ starts with no issue


In 2023/10/26:

actionissue founderror statremedy

stop pyxpm 10 and 11
fru-deactivate and activate xpm 11
restart pyxpm 10 and 11
start DAQ

no issue has been detected0/10---

stop pyxpm 10 and 11
fru-deactivate and activate xpm 10 
restart pyxpm 10 and 11
start DAQ

at first xpmpva DAQ:NEH:XPM:11 does not come up
Then Opal shutsdown
3/20

stop pyxpm 10 and 11
fru deactivate activate 10 
fru deactivate activate 11 (in order)
devGui switch xpm mini/timing2
if needed restart opal from terminal

Observation Of Front-Panel XPM Link Glitch With Version 3.5.4

Perhaps fixed by Matt in later firmware version?

XPM11 glitches between 4pm and after 6pm and also around 10:10 am the next day

2023/10/27: updating firmware

xpm11  to xpm_noRTM-0x030601000-20231011111938-weaver-645bee8.mcs
xpm10 to xpm-0x030601000-20231011111954-weaver-645bee8.mcs


XPM firmware 3.6.0 (?3.6.1?)
Opal config does not have xpm mini -timing2 hack
cnf file uses -D fakecam for additional timing nodes

in 2023/10/30

actionissue foundstatremedy

stop pyxpm 10 and 11
fru-deactivate and activate xpm 10
restart pyxpm 10 and 11
start DAQ


9/20---

stop pyxpm 10 and 11
fru-deactivate and activate xpm 10 
restart pyxpm 10 and 11
start DAQ

Opal fails in configuration

5/20

reboot timing nodes

stop pyxpm 10 and 11
fru-deactivate and activate xpm 10 
restart pyxpm 10 and 11
start DAQ

groupca and xpmpva are shutdown at startup4/20

ctrl-x in the terminal successfully restart them 


example of the timing shift in the timing nodes (before -D fakecam).

in 2023/11/14

actionissue foundstatremedy
stop pyxpm 10 and 11
fru-deactivate and activate xpm 11
restart pyxpm 10 and 11
start DAQ
bucket issue1/10rebooting timing node cmp001
rebooting timing node cmp001no issue0/5

remove fiber from xpm10 to xpm11 fiber 10 times
for 5 seconds (amc0 port1)

no issue0/10

Removing fiber from xpm10 to timing 1 fiber 10 times for 5 seconds (Amc1 port0)


no issue0/10

Removing fiber from xpm11 to opal fiber 10 times for 5 seconds (Amc1 port1)

no issue0/10

in 2023/11/17

New opal_config.py: remove sleep while requesting mini/v2 introduce check for RxId instead with timeout of 10 repeats.

actionissue foundstatremedy
test power cycle she-fee-daq01/2 10 timesbucket issues3/10power cycling the xpm10 (txlinkreset didn't fix)
RTM disconnected in increades timing 
5min 10 min 40 min 2hours 
bucket issue (2 hours)1/4power cycle of xpm10 (txlinkreset didn't fix)


in 2023/11/21 switching XPM firmware

from drp-neh-ctl002

~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.104 /cds/home/w/weaver/mcs/xpm//xpm_noRTM-0x03050400-20230409095511-weaver-dirty.mcs
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.102 /cds/home/w/weaver/mcs/xpm//xpm-0x03050400-20230419122542-weaver-c6987c4.mcs

then fru-restart from psdev xpm 10 and xpm 11 in sequence, not together.

first light presents a problem with XPMPVA and GROUPCA

xpmpva XPM11 does not work 
fru-restart bring it back alive
OPAL does not respond to roll call,
TXLINKRESET XPM11 in XPM10 brings OPAL back to live

actionissue foundstatremedy
remove fiber in RTM and restart DAQTXLINKRESET timing1 (on rollcall)
TXLINKRESET timing1,2,3 (on alloc)
2/10TxLinkReset solved rollcall
TxLinkReset and RxLinkReset on xpm10,11 timing1,2,3,4 and opal
just restart the DAQOpal RxId issue
RxLinkReset on timing4 (on rollcall)
rxid issue on connect opal
3/10TxLinkReset  

observing the logs there are several instances of :
21_10:02:16_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3245855222 3245855222 resetting. Iteration: 1
21_10:02:16_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3249109743 3249109743 resetting. Iteration: 2
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3561772053 3561772053 resetting. Iteration: 1
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3565026528 3565026528 resetting. Iteration: 2
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3568281227 3568281227 resetting. Iteration: 3
21_10:11:36_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3571538383 3571538383 resetting. Iteration: 4
21_10:36:38_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 0 0 resetting. Iteration: 1
21_10:36:38_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 3255210 3255210 resetting. Iteration: 2
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 94190235 94190235 resetting. Iteration: 1
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 97444648 97444648 resetting. Iteration: 2
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 100699006 100699006 resetting. Iteration: 3
21_11:34:43_drp-neh-cmp005:fee_fzpopal_0.log:*** Timing link stuck: 103957466 103957466 resetting. Iteration: 4


switching XPM firmware back to 3.6.1

~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.102 /cds/home/w/weaver/mcs/xpm/xpm-0x030601000-20231011111954-weaver-645bee8.mcs
~weaver/FirmwareLoader/rhel6/FirmwareLoader -a 10.0.5.104 /cds/home/w/weaver/mcs/xpm/xpm_noRTM-0x030601000-20231011111938-weaver-645bee8.mcs

actionissue foundstatremedy
startup DAQnone0/10

no instances of "*** Timing link stuck" in the logs


Power-on the tixel computer (the equivalent of cmp005) with the fiber unplugged, then we plugged in the fiber and it didn’t lock until we did xpmmini→lcls2. It appears that yanking the timing fiber can cause disturbances in the system, but they are not repeatable 100% of the time. XPMs Power spikes can set the DAQ in a behavior similar to the XPM glitch, but only if pyxpms are running. To be repeated.

Upgrading XPM firmware seems to have mitigated all the issues (to 3.6.0 from 3.5.4). The bucket issue becomes more prominent, probably because other issues are not happening. This issue appears when power cycling the xpm11. Also, xpmmini issue could appear when connecting already powered up nodes.


in 2023/11/27

testing double offence.
rebooting a node with cameralink without the fiber connected and connect the fiber after

actionissue foundstatremedy
rebooting cmp005 with timing fiber disconnected from xpm, then connect fiber when cmp is back on linenone0/5xpmpva does not see the opal until the daq is booted up. No ISSUES.

Brainstorming Session

Nov. 16, 23 with mona, dan, weaver, caf, claus, melchior, cpo

proposal:

- move ric/mona/christos to xpm10 (for the future)
- give riccardo the whole system for the day and he messes with xpm10
- add startupMode=1 kwarg to opal

new xpm firmware (leaving xpm10 alone, no xpmmini->lcls2 hack):
riccardo can't reproduce the errors, except for bucket skipping
(txlinkreset fixed it for matt, but not riccardo and ric)

old xpm firmware (also messing with xpm10 with xpmmini->lcls2): riccardo could reproduce
xpm link glitch and txlinkreset (once) and (likely) xpmmini issue

theories:
- maybe ConfigLclsTimingV2 isn't reliable (should perhaps poll
  on something like rxid!=0xffffffff) 
- either new xpm firmware makes things better
- or we need to mess with xpm10 to reproduce problems
- or we're unlucky and can't reproduce (or we're not doing the right
  things to reproduce)
- might need a minimum length of time to tickle the issues (matt says
  try 30 minutes to 1 hour)

matt has an idea for bucket-jumps.  could direct julian.

Results from Julian

Going Forward

(from mtg on Nov. 27, 2023)

Touch Base on Jan. 5, 2024

(Julian, matt, Riccardo, cpo)

To do: