TLDR

The function CtrlSys.pv_checker() in the meclas package is a basic tool for helping narrow down potential error sources quickly.

Use CtrlSys.pv_checker("YOUR:PV:NAME:HERE") to help determine if the problem is with the PV/IOC, the host, or something else.

Disclaimer

This is written by a laser scientist, NOT a member of the ECS department. Seek after higher sources of knowledge if you require better answers!

Example: PV issue upon system start

You try to turn on the YFE, and it gives you an error message that looks like the one below. What do you do?



Side note: understanding Python error messages

For the Python averse, here's a tip on how to make general sense of the "Python barf":

  • The top line will often tell you about what kind of error was generated.
    • In the example above, this is "TimeoutError" in red letters at the top of the screen
  • The brightly-colored blocks of code give context about where within the code the issue arose.
    • In the example above, the yellow highlighting and arrow indication show this happened in a nested function on line 1116 of "EpicsSingalBase._get_with_timeout" and line 1495 of "EpicsSignal.wait_for_connection"; in this case, this isn't code that we wrote ourselves, so we don't care so much about this, as we likely aren't needing to go hunt down a bug within the code itself as much as we need to hunt down a problem we fed into this code
  • The bottom of the message repeats the error generated and briefly tells you why the error was generated
    • In the example above, it states that there was a timeout error when trying to connect to a PV (Process Variable) called MEC:LPL:MBC:01:RunningMode_RBV. This is the culprit!

What to do when a PV seems to fail to read out appropriately?

Secure a basic understanding of the system

If necessary, first review what a PV is and what role it plays in communicating with a piece of equipment – you won't be able to perform any basic troubleshooting without a basic understanding of PVs, IOCs, and the like!

Side note: basic understanding of PVs, IOCs, and hosts

Let's use the example pictured above to illustrate how our controls system communicates with devices like the MBC:

  1. The MBC mentioned above is a bias control box for the pulse shaper of the front end of the LPL system in MEC.
  2. The MBC is made up of several different sub-components with different functions – for example, a voltage operating mode controller, a voltage operating mode readback indicator, a voltage controller, a voltage readback indicator, a reset button, a fault indicator, and so on.
  3. Lots of different laser staff or instrument scientists might want to control or read out information from these sub-components at any given time, but these sometimes asynchronous/sometimes simultaneous/sometimes random requests can be overwhelming and cause problems.

  4. How can the MBC avoid issues with multiple people trying to talk to it all at the same time?

  5. To manage these potential communication issues, an "Input / Output Controller" (or IOC) gets created for the MBC.
  6. The IOC assigns a "Process Variable" (or PV) to each sub-component of the MBC.
  7. All communication in or out of an MBC sub-component goes through the PV.
  8. The communication of all the PVs is coordinated by the IOC in a way that doesn't overwhelm the MBC.
  9. The IOC runs on a nearby computer (i.e. the computer acts as "host" or "server" for the IOC; this computer may be responsible for running one or possibly multiple IOCs at once). 


In order to help solidify this picture, let's consider an analogy with a rock band.

  1. Imagine a rock band that's super popular.
    1. (The MBC is like the rock band – they're both super popular!)
  2. The rock band is made up of several different group members with different functions – for example, a lead guitarist, a lead singer, a bassist, a drummer, and so on.
    1. (The MBC's sub-components are like the different group members – they all have their own functions!)
  3. Lots of different superfans or concert venues might want to communicate with the band members at any given time, but these sometimes asynchronous/sometimes simultaneous/sometimes random requests can be overwhelming and cause problems.
    1. (The laser staff or instrument scientists are like the superfans or concert venues – they all can be too much to handle without some coordination!)
  4. How can the rock band avoid issues with multiple people trying to talk to it all at the same time?

  5. To manage these potential communication issues, a talent agency gets hired for the rock band.
    1. (The IOC is like the talent agency – they both help make communication requests manageable!)
  6. The talent agency assigns an agent to each group member of the rock band.
    1. (The PV is like the agent – they are uniquely assigned to taking care of communication with a single entity!)
  7. All communication with a group member goes through their agent.
    1. (The PV is like the agent – all communication in or out is managed through the single official channel!)
  8. The communication of all the agents is coordinated by the talent agency in a way that doesn't overwhelm the rock band.
    1. (The IOC is like the talent agency – communications can be metered such that all the requests can be handled appropriately!)
  9. The talent agency rents out space in a nearby building (i.e. the building acts as landlord for the talent agency; this building may rent out space to one or possibly multiple talent agencies at once).
    1. (The nearby computer host/server is like the building where the talent agency rents space – there may be one or several entities hosted in the same location)

Hope that helps!!

Initial troubleshooting

From what we know about EPICS (see above), the most likely problems could come from any of the following:

  1. An issue with the piece of equipment itself (e.g. not turned on, fried and dead, disconnected from the network, etc.)
  2. An issue with the PV/IOC managing the communication with that device (e.g. the program crashed or had a runtime error, etc.)
  3. An issue with the computer host/server managing the IOC in question (e.g. the computer lost power, crashed, got replaced without migrating the IOCs, etc.)
  4. Another peripheral issue (e.g. the network responsible for facilitating all these communications is down; the computer making the request isn't connected properly; the communication request to the IOC wasn't made correctly (i.e. EDM screen problem, typo in EPICS code, other user error, etc.), etc.)

The function CtrlSys.pv_checker() in the meclas package is a basic tool for helping narrow down potential error sources quickly. This function utilizes the ECS group's engineering_tools package (see here for more info) to quickly troubleshoot a few of the likely error sources listed above. (Note that this function executes automatically as part of the LPL YFE start-up procedure, so you may just be able to scroll up in your terminal and look for the issue there.)

In the example above, the error messaging has informed us that the PV called MEC:LPL:MBC:01:RunningMode_RBV is causing an error. By looking in the start-up report in the terminal or by using the CtrlSys.pv_checker function with our problematic PV (like this: CtrlSys.pv_checker("MEC:LPL:MBC:01:RunningMode_RBV") ), we see the following information:

  • The left column specifies which PV has been checked
    • in the highlighted row above, this is MEC:LPL:MBC:01:RunningMode_RBV
  • The next column states which IOC runs that particular PV (using the grep_pv tool)
    • in the highlighted row above, this indicates that the PV MEC:LPL:MBC:01:RunningMode_RBV is run by the IOC ioc-mec-lpl-mbc-01
  • The next column states which host runs that particular IOC (using the grep_ioc tool)
    • in the highlighted row above, this indicates that the IOC ioc-mec-lpl-mbc-01 is run by the computer ioc-mec-ipimb01
  • The next column states where the host is located (using the netconfig search tool)
    • in the highlighted row above, this indicates that the computer ioc-mec-ipimb01 is located at B999 H6 R64B E39
  • The next column states whether your computer was successfully able to ping the host or not (using the ping tool)
    • in the highlighted row above, this indicates that the computer ioc-mec-ipimb01 was able to ping successfully
  • The last column states the readback value of the PV or indicates the PV's failure to return a value 
    • in the highlighted row above, this indicates that the PV MEC:LPL:MBC:01:RunningMode_RBV failed to read out successfully

Returning to our possible issue sources above, we update the likelihood of where we might suspect there could be an issue:

  1. Equipment side: (In this case, the piece of equipment was already checked, and it seemed to be turned on and connected OK)
    1. If this were an issue, we would definitely try to fix this first – an IOC can't communicate with an unpowered or broken piece of equipment!
  2. PV/IOC side: The PV failed to read out!
    1. At this point, the iocmanager tool may be used (usually by your ECS POC, but occasionally by yourself with permission...) to try to fix the IOC issue
      1. in a terminal, type imgr -h to get help in how to use iocmanager from the command line; ultimately this functionality could be added to CtrlSys.pv_checker!
      2. alternatively, open the IOC Manager GUI by typing iocmanager in a terminal
      3. alternatively, open the IOC Manager GUI by double-clicking the appropriate icon on the mecopr desktop 
  3. Computer side: The computer successfully pinged!
    1. If this were an issue, use the "Host location" information to check out the computer to see if it's powered up or not, etc.
  4. Peripheral issue: (The terminal was working, and the code somewhat executed, so there were not any immediate indications that we should look deeper for peripheral issues)
    1. (These are often harder and more nebulous causes to catch and track down, but paying attention to whether or not other equipment also seems to be broken could give an indication as to whether you should prioritize spending time looking for other underlying issues)

Even if the steps above do not solve your problem, communicating this basic troubleshooting gives your ECS POC a huge head-start in helping figure out other possible reasons why a piece of equipment might not be working properly.


  • No labels

1 Comment

  1. Great Scott ! I now understand !!!