OVERVIEW

RTEMS 4.10.2, the version of RTEMS used at SLAC, is presenting a few issues that are causing disturbance for operations and IT:

  1. There is trouble accessing the NFS server. Autosave, in particular, is the EPICS module that uses NFS more frequently and is the first one that alerts the problem. After an unplanned power outage in March 2023, the frequency of this issue increased substantially. A power cycle of the VME crate usually fixes the problem for a period of time, but it will return eventually.
  2. When trying to run IOCs with EPICS 7.0.3.1-1.0 in our VME crates with RTEMS 4.10.2 on an MVME6100 (Beatnik), the second Ethernet port is not recognized by the system. This Ethernet port is used for systems that rely on the connection with the Coldfire-uC5282 pizza boxes. This is preventing the global upgrade of all LCLS IOCs to EPICS 7. The issue was not observed in EPICS R3.14.12/RTEMS 4.9.4 or EPICS R7.0.2-1.0/RTEMS 4.10.1.
  3. This version of RTEMS is not compatible with NFS v4. This prevents the lab from switching completely to S3DF and turning off the NFS v2 server currently in use.

The present proposal focuses on the fixes for the aforementioned issues in 2 fronts:

  1. Fixes in RTEMS 4.10.2 so we can implement them immediately after they are solved.
  2. Upgrade to RTEMS 6, making sure that the fixes above are addressed, so we can upgrade all VME-based systems in the future.

GOALS

  1. With RTEMS 4.10.2:
    1. Solve the issue with the second Ethernet port when running EPICS 7.0.3.1-1.0 IOCs with RTEMS 4.10.2. Examples of CATERS addressing this:
      1. 146947 - issue of RTEMS 4.10.2 not consistently working well with MVME6100 2nd ethernet NIC operation.
      2. 148332 - IOC-BSY0-BP01, IOC-LTUH-BP02 second ethernet port communication fails with updated RTEMS and EPICS.
    2. Solve the issue where IOCs in RTEMS 4.10.2 disconnect from the NFS server. Examples of CATERS addressing this (J. Lorelli):
      1. 164090 - Undulator VME IOC has NFS communication issues after power-cycling crate crat-undh-uc47.
      2. 97639 - Autosave running in sioc-li20-ky00 and ioc-li23-cv01 no longer able to write save/files. This will cause lost values after a reboot.
      3. 162668 - IOC logs are getting overfilled with timeout messages. VME IOCs screenlogs are issuing errors about autosave restore files.
      4. 153346 - IOC-LI24-IM01 fails to boot with BOOTP timeout.
    3. Memory leak seen in only 1 VME IOC, built with EPICS R7.0.3.1-1.0/RTEMS 4.10.2. The error message issue in the screenlog.0 file is “Can't create mutex semaphore: too many”. In an email thread with Michael Davidsaver and Till Straumann, Michael said it’s a bug he introduced when adding qsr to RTEMS. The suggestion by Michael was to stop qsrv, but this did not fix the problem. The only solution was having the beatnik binary built without qsrv. CATER 162073 describes the issue.
    4. CATER 146950 - IOC-LI30-CV01 failure when load Beam Code 2 FTP databases:
      1. Can't release semaphore: invalid object id. sevr=fatal Wait event error in fidPHASETask. We die now.
    5. With RTEMS 6:
      1. Bring all the fixes implemented for RTEMS 4.10.2 above to RTEMS 6 for MVME6100, MVME3100 and Coldfire-uC5282.
      2. Have a complete EPICS 7 BPM IOC running in a MVME6100 SBC, accessing an NFS v4 server, and a Coldfire-uC5282 box.
      3. Run RTEMS 6 in a Coldfire-uC5282 box.
      4. Run an EPICS 7 IOC in a MVME3100 SBC.

LINKS

ISSUES

  • Solve the issue where IOCs in RTEMS 4.10.2 disconnect from the NFS server (J. Lorelli)
    • I think Jeremy Lorelli has made very good progress and this might be almost done?  I see some github action between Till and Jeremy about RPCIO driver
      • What is next? Test in coldfire-uC5282 IOC?  Jeremy Mock can help with this.
  • Solve the issue with the second Ethernet port when running EPICS 7.0.3.1-1.0 IOCs with RTEMS 4.10.2
    • This is probably the most pressing of the RTEMS 4.10.2 issues other than the one above.  This affects anything that wants to connect to FCOM or other multicast group
  • Memory leak seen in only 1 VME IOC
    • This is not causing any problem other than the APP is built without EPICS v4 support right now.
  • CATER 146950 - IOC-LI30-CV01 failure when load Beam Code 2 FTP databases
  • No labels