Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. (important) eye-scans for all transceivers
    1. hsd eyescan status on May 15, 2024: data links working, but the timing link scan needs work?
    2. xpm eyescan is documented on debugging daq (in pyxpm folder)
    3. Julian can hopefully add the kcu eyescan to debugging daq
    4. Let's put all the eyescan software in psdaq/psdaq/eyescan
      1. wave8 may not work because we don't have the rogue package in lcls2
  2. (important) eye-scan for hsd jesd links?
    1. in progress
  3. work on high-speed-digitizer timing robustness using teststand
    1. occasional need to restart hsdioc process
    2. kcu1500 can lose link and hsd loses/regains power, and can only be recovered by power cycling cmp node
  4. check wave8 timing robustness
  5. (done) program hsd firmware over pcie?
  6. (important) manufacture new xpm boards (4 for txi)
    1. Minh is testing new cards on May 1, 2024:  gave two cards to Julian on May 15, 2024.  Julian is going to check.
    2. do we need another xpm/crate order for mfx? (separate from LCLS-II-HE?).  go from mfx hutch back to 208 or the mezzanine?
      1. could use xpm7 in room 208.  but would like a crate longterm
      2. on May 1, 2024 a crate has not been ordered yet (and none for HE either)
  7. (important) reproduce/fix timing nodes assigning wrong timestamp to configure transition by 1 or 2 buckets
    1. matt thinks this is on the receiver side: some fifos that carry daq data separate from timing data.  matt thinks perhaps we have to connect the resets to those fifos.
    2. have seen this is hsd/wave8.  see both being problematic after a power outage here: /cds/home/opr/tmoopr/2024/03/04_17:11:56_drp-srcf-cmp030:teb0.log (and Riccardo saw it in his tests, below)
    3. saw this on May 27 or 28 on drp-srcf-cmp025 running Julian's new 0x4000400 firmware.
  8. (important) (perhaps done by fixing reset logic?) reproduce/fix link-lock failure on timing system KCUs
    1. after Julian's fixes in late 2023 on April 7 we had a failure where cmp002 kcu wouldn't lock to its timing link.  power cycling "fixed" the problem.  However, cm002 kcu has had other issues (see below)
  9. (important) saw xpm5 link not recover on its own
    1. Saw this on April 10, 2024 (see below for details)
  10. (important) after a timing outage on May 22, 2024 xpm3 timing frames got largely "stuck" after a day-long ACR timing outage.  Seen using the xpmpva "RxLinkDump" button with the unused lane in loopback mode.  Details are here: https://confluence.slac.stanford.edu/display/PSDMInternal/Debugging+DAQ#DebuggingDAQ-DecodingXPMPackets
    1. this was "fixed" on xpm3 with TxLinkReset from xpm0 to xpm3. there is a RxReset on the UsTiming tab of xpm3 that might have also worked. "CLEAR" on groupca events-tab resets counters, but also some xpm logic, but this didn't fix the issue.
  11. make pyxpm processes robust to timing outages?
  12. (done) ensure that Matt's latest xpm firmware fixes the xpm link-glitch storms
  13. (perhaps done by fixing reset logic ?) reproduce/fix TxLinkReset workaround
    1. on May 1, 2024 it feels like we may have fixed this?
  14. (perhaps done by fixing reset logic?) reproduce/fix xpmmini-to-lcls2timing workaround
    1. on May 1, 2024 it feels like we may have fixed this?
  15. (done, fixed with equalizer 0x3 setting) check/fix loopback fiber problem in production xpms in room 208
  16. also saw two incidents in April 2024 where "cat /proc/datadev_0" showed all 1's (0xffffffff) everywhere as well as nonsensensical string values.  Triggered by timing outages?  One of the instances was on cmp002 and I think the other one was on another node that I don't recall.
    1. May 1, 2024: cpo recollection that we saw this twice on cmp002
    2. in all cases "fixed" by power cycling
    3. Matt says: means one can't read anything on the pcie bus.  Not clear who the culprit is.  clock is used from the pci bus for register reads.
  17. (important) TPR readout group intermittently wrong
    1. matt thinks this is a design flaw with a delay fifo in the timing receiver that's not present in all designs (present in TPR and ATCA on controls systems, but NOT xpm)
  18. (also after Julian's fixes in late 2023) this file shows a failure mode of a tdet kcu1500 on drp-srcf-cmp010 where its pulse-ids were off by one pulse-id ("bucket jumping" problem that Riccardo reproduced on the teststand): teb log file showing the cmp010 problem: /cds/home/opr/rixopr/scripts/logfiles/2024/04/08_11:58:28_drp-srcf-cmp013:teb0.log.  Powercycling "fixed" the problem.  Split event partial-output from that log (two Andor's on cmp010 timestamps were incorrect, since all other detectors showed 0x8ff3 at the end).  A similar failure on drp-srcf-cmp025 can be seen here: /cds/home/opr/rixopr/scripts/logfiles/2024/04/13_12:43:08_drp-srcf-cmp013:teb0.log.  There was a timing outage two days previously, I believe.

...

  • Julian:
    • focus on the stuck frames in the epixHR system
    • four prototype XPM boards are in production with new connector (only 1 so far?).  Larry will work with Julian (with advice from Matt) to test the boards.  One goes to BPM group, another to low-level-RF test stand.  Not clear who these are going to (we're not the only customer)
    • will implement bucket-hopping fix (with advice from Matt)
  • Riccardo
    • will test when bucket-hopping fix is available
    • non-self-locking xpm ports
    • longer term: add hsd/wave8 systems to test stand
  • cpo will try to reproduce the stuck-frames (which we "fix" with xpmmini→lcls2 workaround) with the tixel system that Christos Bakalis is using.  Now scheduled for Dec. 12

Meetings

...

Jan. 5, 2024

(Julian, matt, Riccardo, cpo)

...

  • Julian:
    • provides equalizer parameters
    • works on remaining 3 XPM prototypes to make sure they're good
    • provides lcls2-pgp-pcie-apps firmware for camlink
  • Riccardo:
    • check the room 208 xpm fiber-loopback with the new equalizer parameters when they are available
    • check the bucket hopping fix when it is available
    • (lower priority) ideally add hsd/wave8 to the teststand
      • to try to get the data links to be robust to power outages and fiber-unplugs
      • possibility to put it in an existing drp node (if we remove infiniband card) or cpo thinks we may have a spare chassis (setup from Omar might block us)
  • Matt:
    • provides new firmware with above fixes, and programming of equalizer values
  • Chris:
    • work with Christos to get the fixes in the tixel

May 29, 2024

Update from Julian:

  • hsd-kcu eyescan software needs to be pushed
  • Julian will merge the eyediag and/or eyescan branches as needed
  • hsd jesd scan is in progress
  • XPM order need to clarify if there are one or two orders, and how many pm's does the photon side get?  We need 2 for mfx, 2 for txi plus spares.  Matt is asking Thuy for this info.
    • Update: Matt writes that we own the following hardware: "2 xpm boards are ready (Thuy has them), 2 xpm boards are still being completed by Lupe.  We have 8 AMC cards (need 2 per carrier).  We have 2 passive timing fanouts and 2 network switches (one is in use elsewhere).  She lists 2 atca crates with no location - I think those are not yet acquired".  Matt says Maybe only have 4 carriers.  Julian only has 2 AMC cards, but perhaps those are for the "other order" that isn't LCLS.
  • ATCA crate vendor still in-progress (need 2 for mfx (hutch/mezzanine) and 1 for txi)
  • Matt says there might be a board that can do low-jitter xpm functionality.  This would be cool.

June 12, 2024

  • Whole xpm->hsd→drp path is done except the JESD links.  Needs latest xpm/hsd/drp firmware.  New hsd firmware is 0x5000100. Feb. 6 hsd-drp firmware here https://github.com/slaclab/l2si-drp/releases (4.0.4) has this.  hsd needs new software since the address map changed (backwardly incompatible) but still in the eyediag branch only.  be careful when we merge this branch (do this in July downtime?)
  • JESD link eye-scan is in-progress (getting close)
    • JESD links are over the mezzanine connector (not firefly)
  • cmp005 datadev_1 eye-scan looks terrible (rix hsd_2, hsd_3, which are known bad).  lab3 hsd looks "poor" BER 10**-19, but expect 10**-21 in loopback mode.
    • Matt says two parameters in the MGT to try to improve lab3 loopback (exposed in pgp axilite interface): "diff control" "post cursor" (tx params).  matt exposed in epics.
    • not aware of any QSFP parameters (and we don't have I2C bus hooked up)