Page History

Table of Contents

Task of the Data Processing on-call expert

...

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file (linked from the pipeline web client) ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below) (check for what?). If recon segfaults for no apparent reason, email Heather and Anders (general emergency contacts are on the How to Fix page) before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan the Pipeline Expert (currently Brian) and include a link to the process instance.

...

single batch host is killing several jobs ('rogue' LSF host)
group of batch hosts crashed or went offline
afs crashed on one of the host machines
scratch disk is full on one or more of the host machines - see what to do here (This is a log of a conversation, probably posted from the OpsLog, should be made into a child page)
A staging disk is full (these are accessed from /afs/need instructions on how to check xroot files)
/nfs/farmslac.stanford.edu/g/glast/ground/PipelineStaging[1-7])/nfs/u41 or /nfs/farm/g/glast/u41 or /nfs/farm/g/glast/u15 u28 is full

How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.

What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena e-mail the appropriate on-call expert (see L1 shift schedule - need to get link to new one once it's ready) and the opsprob list if you think that one of those failures might be happening during the night (if in doubt, page e-mail anyways).

Permanent failures

G4 propagator error. If recon dies complaining about g4propagator, we can't fix it. If this happens, email Heather (heather625{at}gmail) ~~and Anders (borgland{at}slac)~~ (and possibly Warren (focke{at}slac) and Maria Elena (monzani{at}slac)). Include a link to the log file, which will tell them where the core file is.
Overlapping chunks. If findChunks complains about overlapping chunks, tell BrysonSteve Tether; if trending merges complain, tell ~~Bryson and~~ datamonlist@glast2.Stanford.EDU. Rollback won't help.
Deliveries arriving too closely will mangle the display. The underlying processes are fine, but email obsproblist{at}glast2.stanford.edu, Jim (jchiang{at}slacSteve Tether (need e-mail) or Bryson Jim (bleejchiang{at}slac) to fix the display. (Need instructions for this added to the HTF FC Incoming page)
Digitization crashes. Digitization crashes. Sometimes we get a bad event and have to skip it. Instructions for that are in the rollback section below.

...

What to do in case of permanent failures: contact the appropriate people above, if you are sure you know what happened. Otherwise, page Warren and/or Maria Elena e-mail the opsprob list and the appropriate on-call expert (see L1 shift schedule - link). If there is another part of the run waiting, the run lock (see below) will have to be removed by hand; page e-mail unless you're really sure of what you're doing.

...

Too few events in the run, or gaps - can lead to too few events in the magic7 file (example log file of an FT2 failure). Try to copy the /nfs/farm/g/glast/u28/stage/XXX/magic7_XXX.txt (where XXX is the delivery number) to /nfs/farm/g/glast/u28/stage/YYY/magic7_YYY.txt (where YYY is the delivery where fakeFT2 failed ... remember to change XXX to YYY on the magic7 file name). Then rollback fakeFT2. If this fails, email Giacomo Vianello Andrea (tramacer{at}slac(need e-mail) for additional information regarding problems with FT2.
- See opsprob discussion of FT2 failure due to a very small chunk containing only 1 crumb.

...

This will roll back all of the failed, terminated, or cancelled processes in delivery 80819007. If you don't say --minimum, it will roll back the whole delivery. That's usually not what you want. Also note that it will not rollback processes that have succeeded, but with incomplete information (ie - problems arising from afs/nfs hiccups). Such processes may need to be rolled back via the front end. (for a list of pipeline commands, `/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline --help` shows what is available )

After a rollback, the red x on the data processing page will be gone, After a rollback, the red x on the data processing page will be gone, but the L1 status will still say Failed. This tends to confuse the duty scientists. You might want to use the setL1Status task (see bellow) to make it say Running. This is really optional, it won't affect the processing in any way. But there will be fewer pagers beeping.

...

As of 2011/04/19, the prod command-line client fails if you try to roll back substreams. But the dev client unning running in prod mode can.
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline -m PROD ... (was this ever fixed? why or why not?)

When to rollback

Wait for the "setStatus" stream to have run.

Rollback won't work unless everything downstream of the failed process is in a final state. It's generally not harmful to try too soon, you just get an unhelpful error message. Most things at run level can be rolled back right away. If a recon job fails, you'll have to wait at least an hour. Maybe half a day. Basically wait until all downstream processes have finished. If you try to roll back too soon, the attempt will fail harmlessly. So don't be afraid to try.

Notice that the GRB search is executed per delivery and depends on all the FT1 and FT2 files in each run to be registered (therefore, it depends on almost "everything" that belongs to that delivery). For this reason, you might need to wait for the entire delivery to be completed before being able to roll back any failed recon jobs. And because of the run lock (see below), some of the (parts of) runs in the delivery might have to wait for other deliveries to finish, which might have their own failures... It's possible, but rare, to get deadlocks, where nothing can proceed until a lock is removed by hand. Best to ask for help then (who do we ask? and if us, what do we do to fix it?).

In general, experience will tell you when you can roll back what. So, in doubt, you can try anyways (if it's too soon, nothing will happen and you will get an error)!

Often you can roll things back sooner if you cancel some processes. If there is a delivery with some runs that are ready to roll back and others that aren't, you can do the rollback if you cancel kludgeAsp. "/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline -m PROD cancelProcessInstance 13388768" The number (or numbers, you can use a space-separated list to do more than one at a time) is the oracle PK for the process instance, it's in the URL for the process instance page in the frontend. This takes a long time, 10-30 minutes. Check to make sure the cancel actually happened. This often takes so long that it times out and the cancel fails.

The dontCleanUp file

Any time one of the merges processes can't find all of its input files, a message is generated in the Log Watcher (and there will be errors in the log of the failed processes complaining about "Different number of events ...") and cleanup for the run is disabled by a file called dontCleanUp in the run directory on u41/L1. All cleanup jobs will fail if the dontCleanUp file is present. If everything is OK (see instructions below), that file can be removed and the jobs rolled back.

...

Now you should have a list of (what??) complaining about not being able to find input files. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, and this should be considered a "permanent failure".

...

This process is now automatically retried like most of the others. When it fails, it attempts to remove the run lock (see above) and the throttle lock (next section) by hand (and usually succeeds). But if it fails harder than usual, you might still have to do that by hand. Also, you'll probably have to:
mv /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt.tmp

...

The instructions on how to retry FASTCopy ingestion or ProcessSCI can be found here

AFS buffers & idle threads

Skip this whole section, we buffer on xroot now.

When the AFS servers where we keep temporary files hiccup, it's usually because they ran low on idle threads. It is possible to monitor this value and intervene to stave off disaster. It can be viewed with Ganglia or Nagios. Nagios only works inside SLAC's firewall, but is somewhat more reliable. Ganglia works from anywhere, and shows higher time resolution, but sometimes it stops updating and just shows a flat line with some old value.

Ganglia

Ganglia threads page shows idle threads for all servers.
Warren's custom disk monitoring pages shows just the servers we're actually interested in, and shows the mapping between server and volume names.

Nagios

Using ssh as a proxy: You can use ssh as a SOCKS proxy to relay requests through a SLAC computer.

Pick a port. Should be a 4-digit number. Referred to as <port> below.
ssh -D <port> noric.slac.stanford.edu
Then configure your browser to use a SOCKS proxy at localhost:<port>

Now you should be able to access SLAC-only pages. There's 2 places to get the threads:

new nagios (02) click "Hostgroup Overview" on the left side, then the host you want in the "AFS fileservers" table. This shows idle threads for that server on the "AFSfs" row.
old nagios (01) then click "AFS FS Graphs" in the upper left. This shows most or all SLAC AFS servers. Not as reliable as nagios02, and it's going away eventually.
Warren's "old" 1-hour page

There's 3 places to get the information...

What to do with the info

When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. For us, it seems llike things usually work as long as it stays above 60 or so (although afs112 has been known to go as low as 40, since PipelineStanging6 is for the HalfPipe). This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between. Suspending jobs will make the server recover sooner, but some of the jobs are likely to get timeout errors and fail due to the delay of being suspended, so it may be best to wait until idle threads hits 0 before suspending.

From a noric, as glastraw:

setenv L1ProcROOT /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/1.70 (any L1 version >= 1.70 should work)
$L1ProcROOT/tools/grepJobs.py PipelineStaging2 1200 60
first arg is the volume to be suspended
second arg is seconds to wait between stopping and resuming
third arg is seconds to wait between resuming jobs
If you want to do afs106, you have to say "PipelineStaging/"; if you say "PipelineStaging" it will suspend everything.

The association between disk and server can be found in several ways. Here's one:

$ fs whereis PipelineStaging*

File PipelineStaging is on host afs106.slac.stanford.edu
File PipelineStaging1 is on host afs106.slac.stanford.edu
File PipelineStaging2 is on host afs104.slac.stanford.edu
File PipelineStaging3 is on host afs107.slac.stanford.edu
File PipelineStaging4 is on host afs108.slac.stanford.edu
File PipelineStaging5 is on host afs109.slac.stanford.edu
File PipelineStaging6 is on host afs112.slac.stanford.edu
File PipelineStaging7 is on host afs113.slac.stanford.edu

You can also find the amount of disk usage with the following command:

$ fs lq PipelineStaging*
Volume Name Quota Used %Used Partition
G.glast.ground.PipStg 350000000 128422194 37% 26%
G.glast.ground.PipStg 350000000 128422194 37% 26%
G.glast.ground.PipStg2 350000000 118129732 34% 24%
G.glast.ground.PipStg3 350000000 124826187 36% 26%
G.glast.ground.PipStg4 350000000 124744630 36% 26%
G.glast.ground.PipStg5 350000000 132978520 38% 27%
G.glast.ground.PipStg6 350000000 59700367 17% 12%
G.glast.ground.PipStg7 350000000 107201000 31% 22%

Useful resources

...

can be found here

Useful resources

...

You should definitely join the following mailing lists:
...
l1RunStatus defaults to Running, but you can set it to any of the allowed values (Complete, InProgress, Incomplete, Running, Failed).
Advanced: in case a run needs to be set as Complete/Incomplete in the ACQSUMMARY table, see here: HTF Run completion status.

setL1Quality task

This can be used when you need to preset a run to Bad (upcoming tests, etc.).

...

Send an email to unix-admin (this usually works, even in the night and in the weekend).
If you don't get an answer and the issue is urgent, call xHELP (650-926-4357). Choose 3 to page the on-call person.
If your call/email isn't answered and the issue is REALLY urgent, page again the on-call person at 650-926-2230.

Log Watcher Messages

Message text: Can't open lockfile /nfs/farm/g/glast/u41/L1/r0248039911/r0248039911.lock.

* The monitoring shifter doesn't need to do anything about it. The L1 shifter should figure out why it happened. In this case, it was because I rolled back a merge job in a run that was already done. (Warren)

cancelProcessInstance

Here's the syntax to cancel a process that is not in final state, and all its dependenciesits dependencies (might be useful when you don't want to wait for it to finish before rolling something back. However, it's usually faster to wait as the cancel can take a long time):
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline --mode PROD cancelProcessInstance 8073657

...

Occasionally, if deliveries arrive too close together, there may be a problem with the assignment of IDs. In this case, there may be several lines that appear for each run in a particular delivery, as shown here. It is a display issue and not a problem with the processing itself, and Jim Panetta (panetta{at}slac) or Steve Tether (tether{at}slac) should be able to resolve the problem with the display (Who should be doing this going forward? Also, documentation needed on how this is accomplished). Also email OPSPROB (opsproblist{at}glast2{dot}stanford{dot}edu)

...

If you suspect a process is stuck, visit the process page from the run page (ie - click on the green task bar in the Data Processing Page, then click on the particular job that is running in the list at the bottom of the page). Each entry in the table on the page will contain a link to "Messages", "Log", and "Files". If you click on "Log", it should show you the log, but more importantly, it'll show the location of the log file right above the box containing the log information (note: if the file is too big, it will not open, but the link to "Files" will show where the log is located, so you can get the information that way as well). If you visit the location of the file on the noric public machines, you can determine when the last time the file was updated (ls -l, or tail -f to see if it's still streaming). If it is not updating, find out the batch job number (JobId from the stream table), log into noric as glastraw, and issue a bkill <JobId> (bkill -r might be necessary if they're not going away after doing the bkill). The process should automatically be restarted after this command is issued.

...

When weird things are happening with the delivery that you can't figure out or it has too many chunks to process, it may be necessary to repipe the stream (probably a good idea to check with someone first - Who?). Log in as glastraw and issue the following command:

...

The repiped run(s) will create run and throttle locks just like any other, so you don't need to worry about what else is going on in L1 when you inject them. Except there shouldn't be anything else for this run running or waiting, or you'll have trouble renaming the chunklists as described above. But they won't show up on the data processing page, and the regular runs that do show up there will appear to be slow starting for no obvious reason, so you might want to put a note in opslog (is the opslog still being used? where is it?) to avoid confusing the monitoring shifter.

You should be able to see if the process restarted correctly by visiting here, (Note that link only works for L1 1.85Updated to point to L1 v5.7). Click on "Last" to get to the end of the list.

...

From Steve Tether: I have a prototype script that corrects a test problem I created in the NIGHTLY database. If you see an ingestion failure for one or more Level zero files of a delivery first take a look at the FASTCopy logs available through the Data Processing web app. (Click on the progress bar for FASTCopy then on the links for those files under Input Products that show a status of INGESTFAIL). If it looks like an NFS glitch, i.e., a message says such-and-such a directory or file does not exist, run the following command in a terminal window that is logged into the SLAC AFS cell and has the ISOC PROD environment set up (where are the instructions for setting up that environment?:

python ~tether/retryIngestion.py delivery-id-number

...

The script will check whether the files for that delivery are still on disk. If any are not or if the script reports a failure then you'll have to refer the problem to me ~~or Jim P~~(will be Don? Need procedures for that person). Normally the files received for a given day (UTC) are archived and removed from disk at about noon (Pacific time) the next day. If all goes well then the last output from the script will be a listing of the L0 files whose statuses were reset to NEW. After that the L0 file states should go through SUBMITTED and stop at INGESTDONE. If ingestion fails again then Jim or I will have to handle it.

...

Here's a little script that is useful for monitoring:

#!/bin/bash

lines=$1

export LSB_BJOBS_FORMAT="id name:15 user submit_time stat:5 exec_host:10 start_time mem cpu_used"

...

busers -w glast glastmc glastraw

bqueues -w glastdataq

Shift sign up

Sign up for shifts here. View shift calendar here

Useful L1Proc links (these are designed to keep you from waiting for the entire L1Proc history to be retrieved ... which could be 10 min and up).

All L1Proc Deliveries that have not completed
All L1Proc Runs that have not completed
All L1Proc Runs that have failed with a timestamp (it is a good idea to update your timestamp to shorten the search).

...

Space shortcuts

Child pages

Versions Compared

Old Version 166

New Version Current

Key

Task of the Data Processing on-call expert

Permanent failures

When to rollback

The dontCleanUp file

AFS buffers & idle threads

Ganglia

Nagios

What to do with the info

Useful resources

Useful resources

setL1Quality task

Log Watcher Messages

cancelProcessInstance

Shift sign up

Useful L1Proc links (these are designed to keep you from waiting for the entire L1Proc history to be retrieved ... which could be 10 min and up).