Confluence will be down for maintenance June 14 2024 at 6AM PT.
Table of Contents |
---|
...
How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file (linked from the pipeline web client) ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.
What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below) (check for what?). If recon segfaults for no apparent reason, email Heather and Anders (general emergency contacts are on the How to Fix page) before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan the Pipeline Expert (currently Brian) and include a link to the process instance.
...
How to recognize infrastructure failures: they usually affect a large number of jobs, either on the same LSF host or on different LSF hosts.
What to do in case of infrastructure failures: these failures involve a large number of people to be taken care of (the infrastructure expert on-call and often also the SCCS), so for the time being still page Warren and/or Maria Elena e-mail the appropriate on-call expert (see L1 shift schedule - need to get link to new one once it's ready) and the opsprob list if you think that one of those failures might be happening during the night (if in doubt, page e-mail anyways).
...
What to do in case of permanent failures: contact the appropriate people above, if you are sure you know what happened. Otherwise, page Warren and/or Maria Elena e-mail the opsprob list and the appropriate on-call expert (see L1 shift schedule - link). If there is another part of the run waiting, the run lock (see below) will have to be removed by hand; page e-mail unless you're really sure of what you're doing.
...
...
This will roll back all of the failed, terminated, or cancelled processes in delivery 80819007. If you don't say --minimum, it will roll back the whole delivery. That's usually not what you want. Also note that it will not rollback processes that have succeeded, but with incomplete information (ie - problems arising from afs/nfs hiccups). Such processes may need to be rolled back via the front end. (for a list of pipeline commands, `/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline --help` shows what is available )
After a rollback, the red x on the data processing page will be gone, After a rollback, the red x on the data processing page will be gone, but the L1 status will still say Failed. This tends to confuse the duty scientists. You might want to use the setL1Status task (see bellow) to make it say Running. This is really optional, it won't affect the processing in any way. But there will be fewer pagers beeping.
...
As of 2011/04/19, the prod command-line client fails if you try to roll back substreams. But the dev client unning running in prod mode can.
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline -m PROD ... (was this ever fixed? why or why not?)
Wait for the "setStatus" stream to have run.
Rollback won't work unless everything downstream of the failed process is in a final state. It's generally not harmful to try too soon, you just get an unhelpful error message. Most things at run level can be rolled back right away. If a recon job fails, you'll have to wait at least an hour. Maybe half a day. Basically wait until all downstream processes have finished. If you try to roll back too soon, the attempt will fail harmlessly. So don't be afraid to try.
Notice that the GRB search is executed per delivery and depends on all the FT1 and FT2 files in each run to be registered (therefore, it depends on almost "everything" that belongs to that delivery). For this reason, you might need to wait for the entire delivery to be completed before being able to roll back any failed recon jobs. And because of the run lock (see below), some of the (parts of) runs in the delivery might have to wait for other deliveries to finish, which might have their own failures... It's possible, but rare, to get deadlocks, where nothing can proceed until a lock is removed by hand. Best to ask for help then (who do we ask? and if us, what do we do to fix it?).
In general, experience will tell you when you can roll back what. So, in doubt, you can try anyways (if it's too soon, nothing will happen and you will get an error)!
Often you can roll things back sooner if you cancel some processes. If there is a delivery with some runs that are ready to roll back and others that aren't, you can do the rollback if you cancel kludgeAsp. "/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline -m PROD cancelProcessInstance 13388768" The number (or numbers, you can use a space-separated list to do more than one at a time) is the oracle PK for the process instance, it's in the URL for the process instance page in the frontend. This takes a long time, 10-30 minutes. Check to make sure the cancel actually happened. This often takes so long that it times out and the cancel fails.
Any time one of the merges processes can't find all of its input files, a message is generated in the Log Watcher (and there will be errors in the log of the failed processes complaining about "Different number of events ...") and cleanup for the run is disabled by a file called dontCleanUp in the run directory on u41/L1. All cleanup jobs will fail if the dontCleanUp file is present. If everything is OK (see instructions below), that file can be removed and the jobs rolled back.
...
Now you should have a list of (what??) complaining about not being able to find input files. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, and this should be considered a "permanent failure".
...
This process is now automatically retried like most of the others. When it fails, it attempts to remove the run lock (see above) and the throttle lock (next section) by hand (and usually succeeds). But if it fails harder than usual, you might still have to do that by hand. Also, you'll probably have to:
mv /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt /nfs/farm/g/glast/u41/L1/${runId}/${runId}${deliveryid}_chunkList.txt.tmp
...
The instructions on how to retry FASTCopy ingestion or ProcessSCI can be found here
Skip this whole section, we buffer on xroot now.
When the AFS servers where we keep temporary files hiccup, it's usually because they ran low on idle threads. It is possible to monitor this value and intervene to stave off disaster. It can be viewed with Ganglia or Nagios. Nagios only works inside SLAC's firewall, but is somewhat more reliable. Ganglia works from anywhere, and shows higher time resolution, but sometimes it stops updating and just shows a flat line with some old value.
Now you should be able to access SLAC-only pages. There's 2 places to get the threads:
There's 3 places to get the information...
When the servers are idle, idle threads should be 122. The SLAC IT people consider it a warning if it goes below 110 and an error at 100. For us, it seems llike things usually work as long as it stays above 60 or so (although afs112 has been known to go as low as 40, since PipelineStanging6 is for the HalfPipe). This is likely to occur if there are more than ~300 chunk jobs running. Usually after recon finishes and the chunk-level jobs downstream of recon start up. I've written a script that will suspend jobs that are using a specified server, wait a bit, and then resume them with a little delay between. Suspending jobs will make the server recover sooner, but some of the jobs are likely to get timeout errors and fail due to the delay of being suspended, so it may be best to wait until idle threads hits 0 before suspending.
The association between disk and server can be found in several ways. Here's one:
You can also find the amount of disk usage with the following command:
...
can be found here
...
You should definitely join the following mailing lists:
...
l1RunStatus defaults to Running, but you can set it to any of the allowed values (Complete, InProgress, Incomplete, Running, Failed).
Advanced: in case a run needs to be set as Complete/Incomplete in the ACQSUMMARY table, see here: HTF Run completion status.
This can be used when you need to preset a run to Bad (upcoming tests, etc.).
...
Message text: Can't open lockfile /nfs/farm/g/glast/u41/L1/r0248039911/r0248039911.lock.
* The monitoring shifter doesn't need to do anything about it. The L1 shifter should figure out why it happened. In this case, it was because I rolled back a merge job in a run that was already done. (Warren)
Here's the syntax to cancel a process that is not in final state, and all its dependenciesits dependencies (might be useful when you don't want to wait for it to finish before rolling something back. However, it's usually faster to wait as the cancel can take a long time):
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline --mode PROD cancelProcessInstance 8073657
...
Occasionally, if deliveries arrive too close together, there may be a problem with the assignment of IDs. In this case, there may be several lines that appear for each run in a particular delivery, as shown here. It is a display issue and not a problem with the processing itself, and Jim Panetta (panetta{at}slac) or Steve Tether (tether{at}slac) should be able to resolve the problem with the display (Who should be doing this going forward? Also, documentation needed on how this is accomplished). Also email OPSPROB (opsproblist{at}glast2{dot}stanford{dot}edu)
...
If you suspect a process is stuck, visit the process page from the run page (ie - click on the green task bar in the Data Processing Page, then click on the particular job that is running in the list at the bottom of the page). Each entry in the table on the page will contain a link to "Messages", "Log", and "Files". If you click on "Log", it should show you the log, but more importantly, it'll show the location of the log file right above the box containing the log information (note: if the file is too big, it will not open, but the link to "Files" will show where the log is located, so you can get the information that way as well). If you visit the location of the file on the noric public machines, you can determine when the last time the file was updated (ls -l, or tail -f to see if it's still streaming). If it is not updating, find out the batch job number (JobId from the stream table), log into noric as glastraw, and issue a bkill <JobId> (bkill -r might be necessary if they're not going away after doing the bkill). The process should automatically be restarted after this command is issued.
...
When weird things are happening with the delivery that you can't figure out or it has too many chunks to process, it may be necessary to repipe the stream (probably a good idea to check with someone first - Who?). Log in as glastraw and issue the following command:
...
The repiped run(s) will create run and throttle locks just like any other, so you don't need to worry about what else is going on in L1 when you inject them. Except there shouldn't be anything else for this run running or waiting, or you'll have trouble renaming the chunklists as described above. But they won't show up on the data processing page, and the regular runs that do show up there will appear to be slow starting for no obvious reason, so you might want to put a note in opslog (is the opslog still being used? where is it?) to avoid confusing the monitoring shifter.
You should be able to see if the process restarted correctly by visiting here, (Note that link only works for L1 1.85Updated to point to L1 v5.7). Click on "Last" to get to the end of the list.
...
From Steve Tether: I have a prototype script that corrects a test problem I created in the NIGHTLY database. If you see an ingestion failure for one or more Level zero files of a delivery first take a look at the FASTCopy logs available through the Data Processing web app. (Click on the progress bar for FASTCopy then on the links for those files under Input Products that show a status of INGESTFAIL). If it looks like an NFS glitch, i.e., a message says such-and-such a directory or file does not exist, run the following command in a terminal window that is logged into the SLAC AFS cell and has the ISOC PROD environment set up (where are the instructions for setting up that environment?:
python ~tether/retryIngestion.py delivery-id-number
...
The script will check whether the files for that delivery are still on disk. If any are not or if the script reports a failure then you'll have to refer the problem to me or Jim P(will be Don? Need procedures for that person). Normally the files received for a given day (UTC) are archived and removed from disk at about noon (Pacific time) the next day. If all goes well then the last output from the script will be a listing of the L0 files whose statuses were reset to NEW. After that the L0 file states should go through SUBMITTED and stop at INGESTDONE. If ingestion fails again then Jim or I will have to handle it.
...
Here's a little script that is useful for monitoring:
#!/bin/bash
lines=$1
export LSB_BJOBS_FORMAT="id name:15 user submit_time stat:5 exec_host:10 start_time mem cpu_used"
...
busers -w glast glastmc glastraw
bqueues -w glastdataq
Sign up for shifts here. View shift calendar here
...