Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines - see what to do here (This is a log of a conversation, probably posted from the OpsLog, should be made into a child page)
  • A staging disk is full (these are accessed from /afs/slac.stanford.edu/g/glast/ground/PipelineStaging[1-7]need instructions on how to check xroot files)
  • /nfs/farm/g/glast/u41 or /nfs/farm/g/glast/u28 is full

...

  • G4 propagator error. If recon dies complaining about g4propagator, we can't fix it. If this happens, email Heather (heather625{at}gmail) and Anders (borgland{at}slac) (and possibly Warren (focke{at}slac) and Maria Elena (monzani{at}slac)). Include a link to the log file, which will tell them where the core file is.
  • Overlapping chunks. If findChunks complains about overlapping chunks, tell Steve Tether; if trending merges complain, tell datamonlist@glast2.Stanford.EDU. Rollback won't help.
  • Deliveries arriving too closely will mangle the display. The underlying processes are fine, but email obsproblist{at}glast2.stanford.edu, Steve Tether (need e-mail) or Jim (jchiang{at}slac) to fix the display.  (Need instructions for this added to the HTF FC Incoming page)
  • Digitization crashes. Sometimes we get a bad event and have to skip it. Instructions for that are in the rollback section below.

...

As of 2011/04/19, the prod command-line client fails if you try to roll back substreams. But the dev client unning in prod mode can.
afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline -m PROD ... (was this ever fixed? why or why not?)

When to rollback

Wait for the "setStatus" stream to have run.

...

Notice that the GRB search is executed per delivery and depends on all the FT1 and FT2 files in each run to be registered (therefore, it depends on almost "everything" that belongs to that delivery). For this reason, you might need to wait for the entire delivery to be completed before being able to roll back any failed recon jobs. And because of the run lock (see below), some of the (parts of) runs in the delivery might have to wait for other deliveries to finish, which might have their own failures... It's possible, but rare, to get deadlocks, where nothing can proceed until a lock is removed by hand. Best to ask for help then .(who do we ask? and if us, what do we do to fix it?).

In general, In general, experience will tell you when you can roll back what. So, in doubt, you can try anyways (if it's too soon, nothing will happen and you will get an error)!

...

Now you should have a list of (what??) complaining about not being able to find input files. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, and this should be considered a "permanent failure".

...

This process is now automatically retried like most of the others. When it fails, it attempts to remove the run lock (see above) and the throttle lock (next section) by hand (and usually succeeds). But if it fails harder than usual, you might still have to do that by hand. Also, you'll probably have to:
mv /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt.tmp

...

The instructions on how to retry FASTCopy ingestion or ProcessSCI can be found here

AFS buffers & idle threads

Skip this whole section, we buffer on xroot now.

When the AFS servers where we keep temporary files hiccup, it's usually because they ran low on idle threads. It is possible to monitor this value and intervene to stave off disaster. It can be viewed with Ganglia or Nagios. Nagios only works inside SLAC's firewall, but is somewhat more reliable. Ganglia works from anywhere, and shows higher time resolution, but sometimes it stops updating and just shows a flat line with some old value.

Ganglia

Nagios

  • Using ssh as a proxy: You can use ssh as a SOCKS proxy to relay requests through a SLAC computer.
  • Pick a port. Should be a 4-digit number. Referred to as <port> below.
  • ssh -D <port> noric.slac.stanford.edu
  • Then configure your browser to use a SOCKS proxy at localhost:<port>

Now you should be able to access SLAC-only pages. There's 2 places to get the threads:

  • new nagios (02) click "Hostgroup Overview" on the left side, then the host you want in the "AFS fileservers" table. This shows idle threads for that server on the "AFSfs" row.
  • old nagios (01) then click "AFS FS Graphs" in the upper left. This shows most or all SLAC AFS servers. Not as reliable as nagios02, and it's going away eventually.
  • Warren's "old" 1-hour page

There's 3 places to get the information...

What to do with the info

When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. For us, it seems llike things usually work as long as it stays above 60 or so (although afs112 has been known to go as low as 40, since PipelineStanging6 is for the HalfPipe). This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between. Suspending jobs will make the server recover sooner, but some of the jobs are likely to get timeout errors and fail due to the delay of being suspended, so it may be best to wait until idle threads hits 0 before suspending.

  • From a noric, as glastraw:
  • setenv L1ProcROOT /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/1.70 (any L1 version >= 1.70 should work)
  • $L1ProcROOT/tools/grepJobs.py PipelineStaging2 1200 60
  • first arg is the volume to be suspended
  • second arg is seconds to wait between stopping and resuming
  • third arg is seconds to wait between resuming jobs
  • If you want to do afs106, you have to say "PipelineStaging/"; if you say "PipelineStaging" it will suspend everything.

The association between disk and server can be found in several ways. Here's one:

  • $ fs whereis PipelineStaging*

You can also find the amount of disk usage with the following command:

  • $ fs lq PipelineStaging*
    Volume Name Quota Used %Used Partition
    G.glast.ground.PipStg 350000000 128422194 37% 26%
    G.glast.ground.PipStg 350000000 128422194 37% 26%
    G.glast.ground.PipStg2 350000000 118129732 34% 24%
    G.glast.ground.PipStg3 350000000 124826187 36% 26%
    G.glast.ground.PipStg4 350000000 124744630 36% 26%
    G.glast.ground.PipStg5 350000000 132978520 38% 27%
    G.glast.ground.PipStg6 350000000 59700367 17% 12%
    G.glast.ground.PipStg7 350000000 107201000 31% 22%

Useful resources

  • List of failed runs. Report here failed runs and their status. When a run is fixed (meaning: no red cross appears anywhere anymore), you can delete it from the list (the runs listed as 'ignore error (it is fixed)' are the ones that were fixed with a fake delivery, so they still appear as failed in the Data Processing page.
  • AFS hiccups. We keep track here of the AFS hiccups (or at least we try to). You can add date/time, LFS job ID and the 'missing' file.
  • LSF hiccups. We keep track here of the LSF hiccups (or at least we try to). You can add date/time, LSF batch host(s) and the reason, if known.
  • List of runs incorrectly marked as InProgress. These are runs where acq status is Complete, but L1 status is InProgress.

You should definitely join the following mailing lists:

Useful resources

You should definitely join the following mailing lists:

  • Data Handling (datalist)
  • Level 1 processing (l1proclist)
  • Data processing pipeline (
  • Data Handling (datalist)
  • Level 1 processing (l1proclist)
  • Data processing pipeline (pipelinelist)

And probably these:

...

* The monitoring shifter doesn't need to do anything about it. The L1 shifter should figure out why it happened. In this case, it was because I rolled back a merge job in a run that was already done. (Warren)

 

cancelProcessInstance

Here's the syntax to cancel a process that is not in final state, and all its dependenciesHere's the syntax to cancel a process that is not in final state, and all its dependencies (might be useful when you don't want to wait for it to finish before rolling something back.  However, it's usually faster to wait as the cancel can take a long time):
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline --mode PROD cancelProcessInstance 8073657

...

Occasionally, if deliveries arrive too close together, there may be a problem with the assignment of IDs. In this case, there may be several lines that appear for each run in a particular delivery, as shown here. It is a display issue and not a problem with the processing itself, and Jim Panetta (panetta{at}slac) or Steve Tether (tether{at}slac) should be able to resolve the problem with the display (Who should be doing this going forward?  Also, documentation needed on how this is accomplished). Also email OPSPROB (opsproblist{at}glast2{dot}stanford{dot}edu)

...

If you suspect a process is stuck, visit the process page from the run page (ie - click on the green task bar in the Data Processing Page, then click on the particular job that is running in the list at the bottom of the page). Each entry in the table on the page will contain a link to "Messages", "Log", and "Files". If you click on "Log", it should show you the log, but more importantly, it'll show the location of the log file right above the box containing the log information (note: if the file is too big, it will not open, but the link to "Files" will show where the log is located, so you can get the information that way as well). If you visit the location of the file on the noric public machines, you can determine when the last time the file was updated (ls -l, or tail -f to see if it's still streaming). If it is not updating, find out the batch job number (JobId from the stream table), log into noric as glastraw, and issue a bkill <JobId> (bkill -r might be necessary if they're not going away after doing the bkill). The process should automatically be restarted after this command is issued.

...

When weird things are happening with the delivery that you can't figure out or it has too many chunks to process, it may be necessary to repipe the stream (probably a good idea to check with someone first - Who?). Log in as glastraw and issue the following command:

...

The repiped run(s) will create run and throttle locks just like any other, so you don't need to worry about what else is going on in L1 when you inject them. Except there shouldn't be anything else for this run running or waiting, or you'll have trouble renaming the chunklists as described above. But they won't show up on the data processing page, and the regular runs that do show up there will appear to be slow starting for no obvious reason, so you might want to put a note in opslog (is the opslog still being used?  where is it?) to avoid confusing the monitoring shifter.

...

From Steve Tether: I have a prototype script that corrects a test problem I created in the NIGHTLY database. If you see an ingestion failure for one or more Level zero files of a delivery first take a look at the FASTCopy logs available through the Data Processing web app. (Click on the progress bar for FASTCopy then on the links for those files under Input Products that show a status of INGESTFAIL). If it looks like an NFS glitch, i.e., a message says such-and-such a directory or file does not exist, run the following command in a terminal window that is logged into the SLAC AFS cell and has the ISOC PROD environment set up (where are the instructions for setting up that environment?:

python ~tether/retryIngestion.py delivery-id-number

...

The script will check whether the files for that delivery are still on disk. If any are not or if the script reports a failure then you'll have to refer the problem to me or Jim P(will be Don? Need procedures for that person). Normally the files received for a given day (UTC) are archived and removed from disk at about noon (Pacific time) the next day. If all goes well then the last output from the script will be a listing of the L0 files whose statuses were reset to NEW. After that the L0 file states should go through SUBMITTED and stop at INGESTDONE. If ingestion fails again then Jim or I will have to handle it.

...