Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anchor
Recovering from errors
Recovering from errors

Recovering from errors:  restarting the DAQ

Use the "restartdaq <-w>" command. If you only want to stop the DAQ, call "stopdaq".

...

Anchor
Troubleshooting the DAQ
Troubleshooting the DAQ

Recovering from errors:  restarting AMI (the client)

Use the "startami" command. This will start a second client if you are not on the main DAQ machine and it will restart the DAQ AMI client if run on the main DAQ machine (after asking for confirmation, so you will need to use a terminal). Should the server side by unhappy, you will need to restart the DAQ.

Troubleshooting the DAQ

Ami does not work: the DAQ is fine after a restart, but I see no updates/detectors in ami

1) Make sure that all dss nodes are selected in the daq partition. If you needed to take a node out due to problems, you need to edit the <hutch>.cnf file. You can use "serverStat <ip>" to get the name for the node that causes a problem and then edit the list of dss_nodes to exclude this node. If the problematic node is the last one, you might see that you have to reselect the Bld upon a restart. This means that the DAQ will make you allocate twice (the first time it'll fail with a complaint about a Bld change).

...

Note that this is a temporary solution. Notify the ECS point of contact who will escalate or resolve the calib mount issue.

The DAQ shows "NO CONNECT" in procStat: 

use "serverStat <DAQ device alias> <command>". "cycle" will power cycle the node with some time between off/on in the script. It will tie up the terminal, so if you have to deal with several nodes, you can also call "serverStat <DAQ device alias> off" and "serverStat <DAQ device alias> on" explicitly. Remember to wait a few second between off/on. After the script returns from turning the node(s) on, continue to run "serverStat <ip/node name>" until both pings work. If you can ssh into the node(s), you can restart the DAQ.

The DAQ shows an error message, requesting a restart,  indicating a given IP as culprit

Use "serverStat <ip>" to check if both interfaces of the node in question are up. This script will also tell you which machine has the issue.

...

Depending on the data rate, you can run with 2 or 3 nodes (cspad + other detectors: 3 nodes, two EPIX: 2 nodes). As we run all the data into a single ami session and the best mapping allows max one ami node/dss node, you have less ami power if you have less dss nodes.

One of my DAQ devices has a problem (damage,....):

use "serverStat <DAQ device alias>" to check on the health of the node. Most likely it is prudent to power-cycle this node. Does this not help, you should power cycle the detector/camera/device itself as well. 

...

Should your detector not configure, it is either not turned on, needs power cycling or some cable is not patched correctly. Please let CDS folks know which detectors you would like to use so we can test them beforehand.

My ipimb has an issue:

Troubleshooting ipimbs is described on this page:Troubleshooting for Controls IPIMB and Wave8s

My OPAL damages on every shot (but configures):

The possible problems are that there is an issue with the EVR, the RCX boxes (data coming back from the camera) or the EDT card.

...

The address of the sources is here.

My BLD has an issue:

For LCLS-I, we use the last server in the list of used dss-nodes to run the BLD process. Check out of the data comes in, you can ssh to that node and call:

...

If data are not coming through, first check that the BLD source is properly selected in the DAQ partition.

My detector runs without errors, but I can't see any beam:

Trusting that there actually are photons on the detector that could be seen, the second point are the timing settings. Find the values you should use here:

...

Go to the EVR configuration (with the exception of pgp triggered devices as the EPIX), if you are using aliases in your DAQ, it should be straightforward to find the right trigger channel. Look at all available EVR cards. For the run trigger setting for e.g. CsPad/Cs140k detectors, it might be best to contact your POC.

My data does not seem to be moving well:

The Files-in-WAIT-state table in the PSDM-Mover-Tape Grafana dashboard shows all files that are waiting for the movers to be transferred including files that are being transferred.

...

[ Take out the nodes with the issue, assuming your problem is limited to a single node. If it's wider spread, it might warrant a call.]

I have included a recorder source and the DAQ won't start:

Most likely, the recorder process cannot connect to the node that you are trying to run the recording process on. This node depends on the source and is listed in your .iocrc file. You can use serverStat to check on the server health and power cycle if appropriate.

The IOC recording seems to work fine, but my output file is of size 0:

This means your recording process is up and you can connect to the PVs. If this PV does not have well-formed timestamps when you request this (necessary for cameras!), the data the process would want to record cannot be found as the timestamp us used to determine what should be recorded.

The XTCAV recording does not work:

First, please make sure the image in the camViewer updates. This means the IOC is up and running. The recorder process uses a special interface on a special machine, so in addition this this, that machine needs to be up (see recorder source and DAQ won't start). More details can be found here:

XTCAV Controls Recorder Troubleshooting

Acqiris

Some things to check if the Acqiris is causing problems:

  • Reduce the total number of sample (especially at higher rates)
  • Make sure the acquisition window (sampling*number of samples) is not too close to the trigger rate.
  • Better cooling

Rayonix:

  • Rayonix MX[340/170]-HS (expert information)
  • Taking a dark
    • In the DAQ, "shutdown" first,  then go to 'edit' -> 'rayonix' .
    • Make sure binning is what you are going to use to take data
    • Select on the menu 'Update background on config' and click 'apply'. Run the DAQ for few seconds.
    • Go back to the 'expert mode' and select on the menu 'keep current background' and click apply.\
    • If this doesn't work go to the terminal where the Rayonix server is running (ask Silke), stop it (CRTL-X) and restart it after about 10 sec (CRTL-R). Restart the DAQ.
  • 'rayonix' will start a session as hsuser on the Rayonix server. capXure can be run here and the DAQ process also runs in a procServ on this machine
  • Trouble shooting: on Rayonix machine - skip first 3 lines if the DAQ process is already running.
    • cd ~/slac
      ./startDaqInterface
      Now you can connect to the _active_ session using telnet localhost 30099.
      This window behaves kind of like an IOC. ctrl+x to kill the process (necessary in several cases....), ctrl+R to restart things. It is not running auto restart as the detector interfaces sometimes needs more time to be available again than auto restart will give it. After the initial start I have found that typically I needed to restart the process (ctrl+x; wait; ctrl+R)
      What you should see (after a bit) are either lines showing the detector status:

      13_11:46:45:2015-Nov-13 11:46:45,[STATUS  ],[STATUS_DATA],-79.95,-79.35,-79.95,-79.75,-156.15,52.25,-860.95,-870.25,-774.25,33.05,49.85,49.85,276,273,275,36,0,0,1200,00:00:04

      or, if you are already running it in the DAQ, one line for each image you record:

      12_15:30:30:2015-Nov-12 15:30:30,[VERBOSE ],RxDetector::FrameAcquisitionFrameProcessor(83, 0x7f55d403bc20,0x7f55d42d2230) - starting

      12_15:30:30:FrameReady: Frame #84  size=7372800  timestamp=59989 ms

      In DAQ config, go to Rayonix settings, choose binning you need (10 Hz at 2x2, 120 Hz at 10x10)
  • Trouble shooting (includes controller reboot)
    sometimes, the detector reports wrong status information and believe many state bits are flipping. You can see either that some of the temperature are reading unreasonably numbers without any change from the "usual" ones (you should see 4 times ~ -80 and 2 times ~-110)

    13_11:38:45:2015-Nov-13 11:38:45,[STATUS  ],[STATUS_DATA],-79.95,-79.35,-79.95,-79.65,-109.25,-109.95,34.05,42.65,45.45,34.75,49.85,49.85,276,273,275,37,0,0,1200,20:50:31

    13_11:39:45:2015-Nov-13 11:39:45,[STATUS  ],[STATUS_DATA],-79.95,-79.35,-79.95,-79.65,-109.25,-109.95,34.05,42.35,45.45,34.85,49.85,49.85,276,273,275,36,0,0,1200,20:51:31

    13_11:40:45:2015-Nov-13 11:40:45,[STATUS  ],[STATUS_DATA],-79.95,-79.35,-79.95,-79.75,-156.15,52.05,-860.85,-870.25,-774.15,34.75,49.85,49.85,276,273,275,37,0,0,1200,00:00:04

    13_11:41:45:2015-Nov-13 11:41:45,[STATUS  ],[STATUS_DATA],-79.95,-79.35,-79.85,-79.65,-156.15,52.05,-860.85,-870.25,-774.15,33.95,49.85,49.85,276,273,275,36,0,0,1200,00:00:04

    To recover from that, you need to stop the DAQ process (ctrl+x  in the telnet localhost 30099 terminal window) and open the Rayonix software capxure.

    Reboot the detector controller (there is a button for that). Re-enable cooling. You might have to quite this process and start again to see the now reasonable temperatures. Once this looks fine, you can quit this software and restart the DAQ process:  (ctrl+R  in the telnet localhost 30099 terminal window) 

...

If this doesn't work go to the terminal where the Rayonix server is running (ask Silke), stop it (CRTL-X) and restart it after about 10 sec (CRTL-R). Restart the DAQ.

Technical note:

"serverStat" at this moment works in all hutches when using the machine name or IP, but the "DAQ alias" interpretation feature might not quite work. We hope to improve on this soon.

...