Page History
...
PENDING important improvement: Julian has timing link reset fixes (for all detectors that receive timing). We need to update all that firmware everywhere. Note: VHDL interface has changed (in lcls_timing_core) so it's more work (register map is the same). Could solve many problems below? List of firmware: tdet, wave8, camlink, epixhr, epixm, epixuhr, hsd, hrencoder, xpm+varieties, tpr (others?). The data systems group should deploy the tdet firmware everywhere in production as a first-pass test. If that is successful then we do everything else. (firmware non-production firmware build from Julian can be found here: /cds/home/j/jumdz/mcs/DrpTDet-0x04000400-20240413131956-jumdz-dirty
- (important) eye-scans for all transceivers
- hsd eyescan status on May 15, 2024: data links working, but the timing link scan needs work?
- xpm eyescan is documented on debugging daq (in pyxpm folder)
- Julian can hopefully add the kcu eyescan to debugging daq
- Let's put all the eyescan software in psdaq/psdaq/eyescan
- wave8 may not work because we don't have the rogue package in lcls2
- work on high-speed-digitizer timing robustness using teststand
- occasional need to restart hsdioc process
- kcu1500 can lose link and hsd loses/regains power, and can only be recovered by power cycling cmp node
- check wave8 timing robustness
- (done) program hsd firmware over pcie?
- (important) manufacture new xpm boards (4 for txi)
- Minh is testing new cards on May 1, 2024: gave two cards to Julian on May 15, 2024. Julian is going to check.
- do we need another xpm/crate order for mfx? (separate from LCLS-II-HE?). go from mfx hutch back to 208 or the mezzanine?
- could use xpm7 in room 208. but would like a crate longterm
- on May 1, 2024 a crate has not been ordered yet (and none for HE either)
- (important) reproduce/fix timing nodes assigning wrong timestamp to configure transition by 1 or 2 buckets
- matt thinks this is on the receiver side: some fifos that carry daq data separate from timing data. matt thinks perhaps we have to connect the resets to those fifos.
- have seen this is hsd/wave8. see both being problematic after a power outage here: /cds/home/opr/tmoopr/2024/03/04_17:11:56_drp-srcf-cmp030:teb0.log (and Riccardo saw it in his tests, below)
- (important) (perhaps done by fixing reset logic?) reproduce/fix link-lock failure on timing system KCUs
- after Julian's fixes in late 2023 on April 7 we had a failure where cmp002 kcu wouldn't lock to its timing link. power cycling "fixed" the problem. However, cm002 kcu has had other issues (see below)
- (important) saw xpm5 link not recover on its own
- Saw this on April 10, 2024 (see below for details)
- make pyxpm processes robust to timing outages?
- (done) ensure that Matt's latest xpm firmware fixes the xpm link-glitch storms
- (perhaps done by fixing reset logic ?) reproduce/fix TxLinkReset workaround
- on May 1, 2024 it feels like we may have fixed this?
- (perhaps done by fixing reset logic?) reproduce/fix xpmmini-to-lcls2timing workaround
- on May 1, 2024 it feels like we may have fixed this?
- (done, fixed with equalizer 0x3 setting) check/fix loopback fiber problem in production xpms in room 208
- also saw two incidents in April 2024 where "cat /proc/datadev_0" showed all 1's (0xffffffff) everywhere as well as nonsensensical string values. Triggered by timing outages? One of the instances was on cmp002 and I think the other one was on another node that I don't recall.
- May 1, 2024: cpo recollection that we saw this twice on cmp002
- in all cases "fixed" by power cycling
- Matt says: means one can't read anything on the pcie bus. Not clear who the culprit is. clock is used from the pci bus for register reads.
- (important) TPR readout group intermittently wrong
- matt thinks this is a design flaw with a delay fifo in the timing receiver that's not present in all designs (present in TPR and ATCA on controls systems, but NOT xpm)
- (also after Julian's fixes in late 2023) this file shows a failure mode of a tdet kcu1500 on drp-srcf-cmp010 where its pulse-ids were off by one pulse-id ("bucket jumping" problem that Riccardo reproduced on the teststand): teb log file showing the cmp010 problem: /cds/home/opr/rixopr/scripts/logfiles/2024/04/08_11:58:28_drp-srcf-cmp013:teb0.log. Powercycling "fixed" the problem. Split event partial-output from that log (two Andor's on cmp010 timestamps were incorrect, since all other detectors showed 0x8ff3 at the end). A similar failure on drp-srcf-cmp025 can be seen here: /cds/home/opr/rixopr/scripts/logfiles/2024/04/13_12:43:08_drp-srcf-cmp013:teb0.log. There was a timing outage two days previously, I believe.
...
Overview
Content Tools