Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PENDING important improvement: Julian has timing link reset fixes (for all detectors that receive timing).  We need to update all that firmware everywhere.  Note: VHDL interface has changed (in lcls_timing_core) so it's more work (register map is the same). Could solve many problems below? List of firmware: tdet, wave8, camlink, epixhr, epixm, epixuhr, hsd, hrencoder, xpm+varieties, tpr (others?).  The data systems group should deploy the tdet firmware everywhere in production as a first-pass test.  If that is successful then we do everything else. (firmware non-production firmware build from Julian can be found here: /cds/home/j/jumdz/mcs/DrpTDet-0x04000400-20240413131956-jumdz-dirty

  • (important) eye-scans for all transceivers
    • hsd eyescan status on May 15, 2024: data links working, but the timing link scan needs work?
    • xpm eyescan is documented on debugging daq (in pyxpm folder)
    • Julian can hopefully add the kcu eyescan to debugging daq
    • Let's put all the eyescan software in psdaq/psdaq/eyescan
      • wave8 may not work because we don't have the rogue package in lcls2
  • work on high-speed-digitizer timing robustness using teststand
    • occasional need to restart hsdioc process
    • kcu1500 can lose link and hsd loses/regains power, and can only be recovered by power cycling cmp node
  • check wave8 timing robustness
  • (done) program hsd firmware over pcie?
  • (important) manufacture new xpm boards (4 for txi)
    • Minh is testing new cards on May 1, 2024:  gave two cards to Julian on May 15, 2024.  Julian is going to check.
    • do we need another xpm/crate order for mfx? (separate from LCLS-II-HE?).  go from mfx hutch back to 208 or the mezzanine?
      • could use xpm7 in room 208.  but would like a crate longterm
      • on May 1, 2024 a crate has not been ordered yet (and none for HE either)
  • (important) reproduce/fix timing nodes assigning wrong timestamp to configure transition by 1 or 2 buckets
    • matt thinks this is on the receiver side: some fifos that carry daq data separate from timing data.  matt thinks perhaps we have to connect the resets to those fifos.
    • have seen this is hsd/wave8.  see both being problematic after a power outage here: /cds/home/opr/tmoopr/2024/03/04_17:11:56_drp-srcf-cmp030:teb0.log (and Riccardo saw it in his tests, below)
  • (important) (perhaps done by fixing reset logic?) reproduce/fix link-lock failure on timing system KCUs
    • after Julian's fixes in late 2023 on April 7 we had a failure where cmp002 kcu wouldn't lock to its timing link.  power cycling "fixed" the problem.  However, cm002 kcu has had other issues (see below)
  • (important) saw xpm5 link not recover on its own
    • Saw this on April 10, 2024 (see below for details)
  • make pyxpm processes robust to timing outages?
  • (done) ensure that Matt's latest xpm firmware fixes the xpm link-glitch storms
  • (perhaps done by fixing reset logic ?) reproduce/fix TxLinkReset workaround
    • on May 1, 2024 it feels like we may have fixed this?
  • (perhaps done by fixing reset logic?) reproduce/fix xpmmini-to-lcls2timing workaround
    • on May 1, 2024 it feels like we may have fixed this?
  • (done, fixed with equalizer 0x3 setting) check/fix loopback fiber problem in production xpms in room 208
  • also saw two incidents in April 2024 where "cat /proc/datadev_0" showed all 1's (0xffffffff) everywhere as well as nonsensensical string values.  Triggered by timing outages?  One of the instances was on cmp002 and I think the other one was on another node that I don't recall.
    • May 1, 2024: cpo recollection that we saw this twice on cmp002
    • in all cases "fixed" by power cycling
    • Matt says: means one can't read anything on the pcie bus.  Not clear who the culprit is.  clock is used from the pci bus for register reads.
  • (important) TPR readout group intermittently wrong
    • matt thinks this is a design flaw with a delay fifo in the timing receiver that's not present in all designs (present in TPR and ATCA on controls systems, but NOT xpm)
  • (also after Julian's fixes in late 2023) this file shows a failure mode of a tdet kcu1500 on drp-srcf-cmp010 where its pulse-ids were off by one pulse-id ("bucket jumping" problem that Riccardo reproduced on the teststand): teb log file showing the cmp010 problem: /cds/home/opr/rixopr/scripts/logfiles/2024/04/08_11:58:28_drp-srcf-cmp013:teb0.log.  Powercycling "fixed" the problem.  Split event partial-output from that log (two Andor's on cmp010 timestamps were incorrect, since all other detectors showed 0x8ff3 at the end).  A similar failure on drp-srcf-cmp025 can be seen here: /cds/home/opr/rixopr/scripts/logfiles/2024/04/13_12:43:08_drp-srcf-cmp013:teb0.log.  There was a timing outage two days previously, I believe.

...