Notes on debugging common issues

What to do if the IOC won't reboot:

Figure 1 is a GIF showing the full recovery process. The GIF is sped up 4X!

  1. Hit "Reboot POE Hub"
    1. If there are cameras from other SIOCs on the same POE hub, they will go down too. The following SIOCs share POE hubs:
      1. sioc-lr10-pm01 and sioc-in10-pm01 share poe-la10-pm01 (aka POE:LA10:1)
      2. sioc-li20-pm07 and sioc-li20-pm08 share poe-li20-ex01 (aka POE:LI20:3)
      3. sioc-li20-pm02 and sioc-li20-pm03 share poe-li20-pm01 (aka POE:LI20:1)
    2. In general, it is safe to reboot a POE hub.
  2. After POE comes back (purple to green), reboot the SIOC.
  3. After the IOC comes back, jiggle the cameras.


The following assumes that the camera(s) in question were working at some point, so there is no question as to whether or not the cameras are connected correctly.

  1. A single camera is not responsive (the CNCT circle is not green)
    1. Hit the reboot button (Figure 1) and wait 30 seconds.
  2. If the camera does not come back after 30 seconds, follow the decision tree below:
    1. If you have a terminal associated with the instance of facethome, you will be able to see the output/status when you push the reboot button. The following is example output from a successful reboot:

      Rebooting CAMR:LI20:300
      POWER-ETHERNET-MIB::pethPsePortAdminEnable.0.1 = INTEGER: false(2)
      POWER-ETHERNET-MIB::pethPsePortAdminEnable.0.1 = INTEGER: true(1)
      Jiggle CAMR:LI20:300

      The following is an example of an unsuccessful reboot:

      POWER-ETHERNET-MIB::pethPsePortAdminEnable.0.2 = INTEGER: false(2)
      POWER-ETHERNET-MIB::pethPsePortAdminEnable.0.2 = INTEGER: true(1)
      Camera CAMR:LI20:301 could not be revived.
      Jiggle CAMR:LI20:301
      CA.Client.Exception...............................................
          Warning: "Channel write request failed"
          Context: "op=1, channel=CAMR:LI20:301:TriggerMode, type=DBR_ENUM, count=1, ctx="CAMR:LI20:301:TriggerMode""
          Source File: ../oldChannelNotify.cpp line 158
          Current Time: Fri Aug 27 2021 00:49:44.749875033
      ..................................................................
      CA.Client.Exception...............................................
          Warning: "Channel write request failed"
          Context: "op=1, channel=CAMR:LI20:301:DataType, type=DBR_ENUM, count=1, ctx="CAMR:LI20:301:DataType""
          Source File: ../oldChannelNotify.cpp line 158
          Current Time: Fri Aug 27 2021 00:49:44.753613620
      ..................................................................
      

      This is another example of an unsuccessful reboot:

      Rebooting CAMR:LI20:300
      Timeout: No Response from poe-li20-pm02
      Timeout: No Response from poe-li20-pm02

      This indicates a problem with the PoE hub. Skip to step d.

    2. If the reboot is unsuccessfully (with or without confirmation from the terminal, the next step is to reboot the IOC. This can be done by navigating to the Network/LI20 page (Figure 2) and selecting the appropriate IOC. Hit the reboot IOC button and wait (Figure 3). Note that this will take down all cameras on the IOC.
    3. If the camera is still not revived after the IOC reboot, rebooting the server is the final check. Go to Network/LI20 page (Figure 2) and select the appropriate IOC. At the bottom, there is a Power Off button (Figure 3). Hit and wait at least 10 seconds before powering on again. Some IOC panels are missing the button. In that case do:

      ipmitool -H cpu-li20-pm04-mgt -U ADMIN -P Ipmi1Admin power off
      <wait 10 seconds>
      ipmitool -H cpu-li20-pm04-mgt -U ADMIN -P Ipmi1Admin power on

      If that doesn't fix it, camera is probably dead.

    4. If the reboot was unsuccessful due to a "No Response from poe" error, that means the PoE hub has stopped responding. This is a known bug and the manufacturer has been notified. The problem can be resolved by power cycling the PoE hub.
      1. For the laser room cameras, go to the Network/LI20 page (Figure 2) and select ACSW-LI20-NW19 and turn on/off the PoE hub. Note that this will take down all of the cameras in the laser room and laser transport.
      2. For the tunnel cameras, go to the Network/LI20 page (Figure 2) and select ACSW-LI20-NW13 (Figure 4). SIOC-LI20-PM02 and SIOC-LI20-PM03 are associated with POE-LI20-PM01 which is labeled "POE01 for PM02 PM03" (there is a character limit so apologies for being cryptic). Rebooting this device will take down all cameras on those IOCs. SIOC-LI20-PM04 is associated with POE-LI20-PM02 which is labeled "POE02 for PM04". Rebooting this device will take down all cameras on that IOC.

Figure 1 - GIF of recovery from when SIOC is down and can't be restarted. The GIF is sped up 4X!

Figure 2 - facethome Network/LI20 page

Figure 3 - Example IOC screen for Advantech Server (LinuxRT)

Figure 4 - ACSW-LI20-NW13 panel with PoE hubs

SIOC won't come back

One issue that keeps an SIOC from coming back online is that a camera is in a bad state and interferes with IOC startup. The best thing to do when the IOC does not come back online is step 2.d) above.




Things you can check

  1. Is the server pinging?

    1. Good
    2. [fphysics@facet-srv01 ~ ]$ ping cpu-li20-pm02
      PING cpu-li20-pm02.slac.stanford.edu (172.27.72.99) 56(84) bytes of data.
      64 bytes from cpu-li20-pm02.slac.stanford.edu (172.27.72.99): icmp_seq=1 ttl=64 time=0.113 ms
    3. Bad (Go to step 2.c above)

      [fphysics@facet-srv01 ~ ]$ ping cpu-li20-pm03
      PING cpu-li20-pm03.slac.stanford.edu (172.27.72.100) 56(84) bytes of data.
      From facet-srv01.slac.stanford.edu (172.27.72.22) icmp_seq=1 Destination Host Unreachable
  2. Is the IOC running?
    1. Good (internal errors in IOC are normal):

      [fphysics@facet-srv01 ~ ]$ [fphysics@facet-srv01 ~ ]$ iocConsole sioc-li20-pm02
      <start screen>
      2022/08/05 14:31:04.778 prosilica:readStats: error, status=7
      2022/08/05 14:31:05.485 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:07.015 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:09.106 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:09.182 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:09.688 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:09.688 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:09.778 prosilica:readStats: error, status=7
      2022/08/05 14:31:10.285 prosilica:frameCallback: ERROR, frame has error code 16
      2022/08/05 14:31:14.778 prosilica:readStats: error, status=7
      <ctrl+a,d to exit screen>
    2. Bad



Log into camera server

[fphysics@facet-srv01 ~]$ ssh flaci@cpu-li20-pm02
Entering envSet.sh
Entering commonSetup.sh
Pacific timezone is PST8PDT
EPICS_IOCS=/usr/local/facet/epics/iocCommon
EPICS_CPUS=/usr/local/facet/epics/cpuCommon
CPU_ARCH=linuxRT-x86_64
Adding to PATH EPICS_CPU_BIN=/usr/local/facet/epics/iocTop/ProfileMonitorAD/R6.1.1/bin/linuxRT-x86_64
caRepeater=/usr/local/facet/epics/iocTop/ProfileMonitorAD/R6.1.1/bin/linuxRT-x86_64/caRepeater
EPICS_CPU_LIB=/usr/local/facet/epics/iocTop/ProfileMonitorAD/R6.1.1/lib/linuxRT-x86_64
Exiting commonSetup.sh
Exiting envSet.sh
[ flaci@cpu-li20-pm02]$ 


  • No labels