Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PENDING important improvement: Julian has timing link reset fixes (for all detectors that receive timing).  We need to update all that firmware everywhere.  Note: VHDL interface has changed (in lcls_timing_core) so it's more work (register map is the same). Could solve many problems below? List of firmware: tdet, wave8, camlink, epixhr, epixm, epixuhr, hsd, hrencoder, xpm+varieties, tpr (others?).  The data systems group should deploy the tdet firmware everywhere in production as a first-pass test.  If that is successful then we do everything else. (non-production firmware build from Julian can be found here: /cds/home/j/jumdz/mcs/DrpTDet-0x04000400-20240413131956-jumdz-dirty

  1. (important) eye-scans for all transceivers
    1. hsd eyescan status on May 15, 2024: data links working, but the timing link scan needs work?
    2. xpm eyescan is documented on debugging daq (in pyxpm folder)
    3. Julian can hopefully add the kcu eyescan to debugging daq
    4. Let's put all the eyescan software in psdaq/psdaq/eyescan
      1. wave8 may not work because we don't have the rogue package in lcls2
  2. (important) eye-scan for hsd jesd links?
  3. work on high-speed-digitizer timing robustness using teststand
    1. occasional need to restart hsdioc process
    2. kcu1500 can lose link and hsd loses/regains power, and can only be recovered by power cycling cmp node
  4. check wave8 timing robustness
  5. (done) program hsd firmware over pcie?
  6. (important) manufacture new xpm boards (4 for txi)
    1. Minh is testing new cards on May 1, 2024:  gave two cards to Julian on May 15, 2024.  Julian is going to check.
    2. do we need another xpm/crate order for mfx? (separate from LCLS-II-HE?).  go from mfx hutch back to 208 or the mezzanine?
      1. could use xpm7 in room 208.  but would like a crate longterm
      2. on May 1, 2024 a crate has not been ordered yet (and none for HE either)
  7. (important) reproduce/fix timing nodes assigning wrong timestamp to configure transition by 1 or 2 buckets
    1. matt thinks this is on the receiver side: some fifos that carry daq data separate from timing data.  matt thinks perhaps we have to connect the resets to those fifos.
    2. have seen this is hsd/wave8.  see both being problematic after a power outage here: /cds/home/opr/tmoopr/2024/03/04_17:11:56_drp-srcf-cmp030:teb0.log (and Riccardo saw it in his tests, below)
  8. (important) (perhaps done by fixing reset logic?) reproduce/fix link-lock failure on timing system KCUs
    1. after Julian's fixes in late 2023 on April 7 we had a failure where cmp002 kcu wouldn't lock to its timing link.  power cycling "fixed" the problem.  However, cm002 kcu has had other issues (see below)
  9. (important) saw xpm5 link not recover on its own
    1. Saw this on April 10, 2024 (see below for details)
  10. (important) after a timing outage on May 22, 2024 xpm3 timing frames got largely "stuck" after a day-long ACR timing outage.  Seen using the xpmpva "RxLinkDump" button with the unused lane in loopback mode.  Details are here: https://confluence.slac.stanford.edu/display/PSDMInternal/Debugging+DAQ#DebuggingDAQ-DecodingXPMPackets
    1. this was "fixed" on xpm3 with TxLinkReset from xpm0 to xpm3. there is a RxReset on the UsTiming tab of xpm3 that might have also worked. "CLEAR" on groupca events-tab resets counters, but also some xpm logic, but this didn't fix the issue.
  11. make pyxpm processes robust to timing outages?
  12. (done) ensure that Matt's latest xpm firmware fixes the xpm link-glitch storms
  13. (perhaps done by fixing reset logic ?) reproduce/fix TxLinkReset workaround
    1. on May 1, 2024 it feels like we may have fixed this?
  14. (perhaps done by fixing reset logic?) reproduce/fix xpmmini-to-lcls2timing workaround
    1. on May 1, 2024 it feels like we may have fixed this?
  15. (done, fixed with equalizer 0x3 setting) check/fix loopback fiber problem in production xpms in room 208
  16. also saw two incidents in April 2024 where "cat /proc/datadev_0" showed all 1's (0xffffffff) everywhere as well as nonsensensical string values.  Triggered by timing outages?  One of the instances was on cmp002 and I think the other one was on another node that I don't recall.
    1. May 1, 2024: cpo recollection that we saw this twice on cmp002
    2. in all cases "fixed" by power cycling
    3. Matt says: means one can't read anything on the pcie bus.  Not clear who the culprit is.  clock is used from the pci bus for register reads.
  17. (important) TPR readout group intermittently wrong
    1. matt thinks this is a design flaw with a delay fifo in the timing receiver that's not present in all designs (present in TPR and ATCA on controls systems, but NOT xpm)
  18. (also after Julian's fixes in late 2023) this file shows a failure mode of a tdet kcu1500 on drp-srcf-cmp010 where its pulse-ids were off by one pulse-id ("bucket jumping" problem that Riccardo reproduced on the teststand): teb log file showing the cmp010 problem: /cds/home/opr/rixopr/scripts/logfiles/2024/04/08_11:58:28_drp-srcf-cmp013:teb0.log.  Powercycling "fixed" the problem.  Split event partial-output from that log (two Andor's on cmp010 timestamps were incorrect, since all other detectors showed 0x8ff3 at the end).  A similar failure on drp-srcf-cmp025 can be seen here: /cds/home/opr/rixopr/scripts/logfiles/2024/04/13_12:43:08_drp-srcf-cmp013:teb0.log.  There was a timing outage two days previously, I believe.
Code Block
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 0 (andor_norm_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff2, size 0, source 1 (andor_dir_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 2 (manta_0)
rix-teb[2111]: <W> Fixup Configure, 008a4a15bf8ff3, size 0, source 3 (mono_encoder_0)
  1. See this issue on drp-srcf-cmp002, also saw this on drp-srcf-cmp004 on May 7, 2024.  May 22, 2024: Seems to be better after replacing kcu1500 on cmp002?  Was happening about once per day.  Haven't seen it in about a week now.
    1. Code Block
      (ps-4.6.3) drp-srcf-cmp004:software$ cat /proc/datadev_0 
      -------------- Axi Version ----------------
           Firmware Version : 0xffffffff
                 ScratchPad : 0xffffffff
              Up Time Count : 4294967295
                  Device ID : 0xffffffff
                   Git Hash : ffffffffffffffffffffffffffffffffffffffff
                  DNA Value : 0xffffffffffffffffffffffffffffffff
               Build String : ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������[�d�
      
      -------------- General HW -----------------
                Int Req Count : 4294967295
              Hw Dma Wr Index : 4294967295
              Sw Dma Wr Index : 3136
              Hw Dma Rd Index : 4294967295
              Sw Dma Rd Index : 323
           Missed Wr Requests : 4294967295
             Missed IRQ Count : 27819533
               Continue Count : 0
                Address Count : 4096
          Hw Write Buff Count : 4095
           Hw Read Buff Count : 0
                 Cache Config : 0xffffffff
                  Desc 128 En : 1
                  Enable Ver  : 0xffffffff
            Driver Load Count : 255
                     IRQ Hold : 4294967295
                    BG Enable : 0x0
      
      -------------- General --------------------
                Dma Version : 0x6
                Git Version : 5.17.3
      
      -------------- Read Buffers ---------------
               Buffer Count : 1048572
                Buffer Size : 8192
                Buffer Mode : 2
            Buffers In User : 0
              Buffers In Hw : 4095
        Buffers In Pre-Hw Q : 1044477
        Buffers In Rx Queue : 0
            Missing Buffers : 0
             Min Buffer Use : 2
             Max Buffer Use : 227890
             Avg Buffer Use : 1116
             Tot Buffer Use : 1170295872
      
      -------------- Write Buffers ---------------
               Buffer Count : 16
                Buffer Size : 8192
                Buffer Mode : 2
            Buffers In User : 0
              Buffers In Hw : 0
        Buffers In Pre-Hw Q : 16
        Buffers In Sw Queue : 0
            Missing Buffers : 0
             Min Buffer Use : 5141
             Max Buffer Use : 5142
             Avg Buffer Use : 5141
             Tot Buffer Use : 82259
      
      (ps-4.6.3) drp-srcf-cmp004:software$ 
      drp-srcf-cmp002:~$ cat /proc/datadev_0 
      -------------- Axi Version ----------------
           Firmware Version : 0xffffffff
                 ScratchPad : 0xffffffff
              Up Time Count : 4294967295
                  Device ID : 0xffffffff
                   Git Hash : ffffffffffffffffffffffffffffffffffffffff
                  DNA Value : 0xffffffffffffffffffffffffffffffff
               Build String : �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������A
      
      
  2. XPM Link issues 2024/04/10-2024/04/11:
    1. Around 14:00-14:10 on 2024/04/10, RIX Grafana page shows there were fairly global XPM issues (measured by XPM RxDspErrs  rates)
    2. XPM5 link (XPM3-5) goes down around 14:07 on 2024/04/10
    3. Other XPMs recover but 5 does not, and the link stays down.
    4. xpmpva shows XPM5 looks mostly healthy except for the RxLinkUp 
    5. Required TxLinkReset  to restore RxLinkUp  (on 2024/04/11 ~09:15).
    6. Image ModifiedImage Modified
  1. Summary Of Testing

These are the results of the tests that have been conducted in the FEE alcove to determine if the XPM glitch can be reproduced.
Every test is run from a starting behavior where the DAQ can allocate, configure, run, and disable.
Whenever the DAQ does not follow the starting behavior remedies are applied to recover it.

...