Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • eye-scans for all transceivers
  • work on high-speed-digitizer timing robustness using teststand
    • occasional need to restart hsdioc process
    • kcu1500 can lose link and hsd loses/regains power, and can only be recovered by power cycling cmp node
  • check wave8 timing robustness
  • (almost done) program hsd firmware over pcie?
  • manufacture new xpm boards (4 for txi)
    • do we need another xpm/crate order for mfx? (separate from LCLS-II-HE?).  go from mfx hutch back to 208 or the mezzanine?
  • reproduce/fix timing nodes assigning wrong timestamp to configure transition by 1 or 2 buckets
    • matt thinks this is on the receiver side: some fifos that carry daq data separate from timing data.  matt thinks perhaps we have to connect the resets to those fifos.
    • have seen this is hsd/wave8.  see both being problematic after a power outage here: /cds/home/opr/tmoopr/2024/03/04_17:11:56_drp-srcf-cmp030:teb0.log (and Riccardo saw it in his tests, below)
  • (perhaps done by fixing reset logic?) reproduce/fix link-lock failure on timing system KCUs
  • make pyxpm processes robust to timing outages?
  • (done) ensure that Matt's latest xpm firmware fixes the xpm link-glitch storms
  • (perhaps done by fixing reset logic ?) reproduce/fix TxLinkReset workaround
  • (perhaps done by fixing reset logic?) reproduce/fix xpmmini-to-lcls2timing workaround
  • (done, fixed with equalizer 0x3 setting) check/fix loopback fiber problem in production xpms in room 208
  • after Julian's fixes in late 2023 on April 7 we had a failure where cmp002 kcu wouldn't lock to its timing link.  power cycling "fixed" the problem.
  • also saw two incidents in April 2024 where "cat /proc/datadev_0" showed all 1's (0xffffffff) everywhere as well as nonsensensical string values.  Likely triggered by timing outages?
  • (also after Julian's fixes in late 2023) this file shows a failure mode of a tdet kcu1500 on drp-srcf-cmp010 where its timestamps pulse-ids were off by one pulse-id ("bucket jumping" problem that Riccardo reproduced on the teststand): teb log file showing the cmp010 problem: /cds/home/opr/rixopr/scripts/logfiles/2024/04/08_11:58:28_drp-srcf-cmp013:teb0.log.  Powercycling "fixed" the problem.  Split event partial-output from that log (two Andor's on cmp010 timestamps were incorrect, since all other detectors showed 0x8ff3 at the end).  A similar failure on drp-srcf-cmp025 can be seen here: /cds/home/opr/rixopr/scripts/logfiles/2024/04/13_12:43:08_drp-srcf-cmp013:teb0.log.  There was a timing outage two days previously, I believe.

...