Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So probably really only one realistic option at the moment.

 

Also you should probably notify unix-admin.

 

Files missing from NFS file server

This one came up the incident reference by this thread in the opsprob mailing list: https://www-glast.stanford.edu/protected/mail/opsprob/11338.html

Basically several jobs were hung, when they were killed, they still didn't work and rolling things back made it worse.  In the end, we needed to copy the files back from xroot to NFS and restart the process.  Some questions, comments, and answers from the thread (some of these still need answers):

1) How do we know that? Is it just because the logs complain that they are not there? How do we check for their existence?

2) Where do they live to copy them from?

3) I'm assuming they need to get copied to directory that it can't find them in (root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/542/r0542185591/e00000000000002729588/event/ in this case). Or do they go somewhere else?

4) How do we perform the copy?

See #10 below

5) I'm guessing we'll need to move/rename the existing chunkList file so a new one can be created at this point? Is this correct? (BTW the notes say we should have a section on chunkList files that no one has written yet)

6) Where do we roll back at to get everything going again? Just a general rollback on the command line? Or is there a specific task that can be rolled back to kick everything off properly again?

7) The .evt files reference by the chunkList file (in /nfs/farm/g/glast/u28/stage/180308007/r0542185591/) don't exist.

They've been moved to xroot. It really would be better if they were copied, I [Warren] don't remember why I get discouraged every time I try to make that change.

8) The directory in xroot (root://glast-rdr.slac.stanford.edu//glast/Scratch/l1Stage/runs/542/r0542185591/e00000000000002729588/event/) only contains a single evt file: r0542185591_e00000000000002729588_v246043240_event.evt.  The failed jobs are complaining about other files which are missing.

Each was moved to it's own directory.

9) In the /nfs/farm/g/glast/u41/L1/deliveries/1803 directory, there is no entry for this delivery (180308007).  The entries for 180308006 and 180308008 are there though.  Found it.  It's in the /nfs/farm/g/glast/u28/stage directory.  I looks like that gets moved over to deliveries once it's done.

10) Still can't seem to find the original evt files.

Each has a directory corresponding to the chunk number. The second-to-last component in the path you give above.

I use a command like this to generate a script to move them back to NFS:

"awk '/^\/afs.*xrdcp/{print  $1, $2, $3, $5, $4}' /nfs/farm/g/glast/u41/L1/logs/PROD/L1Proc/5.5/doRun/findChunks/160xxxxxx/515xxx/011/485xxxxxx/001xxx/694/archive/198092024/logFile.txt

Final thoughts from Warren on this particular issue:

There are some things here I don't understand, LSF is definitely screwing up and maybe the ippeline too, but:

Among the jobs that were launched after 2:30 and failed was findChunks, which had moved some of the evt files. The automatic retry moved the rest. Then some of the digi and fastMon jobs couldn't find their inputs. This was a little trickier than the usual "move them back" scenario, because their locations in xroot had to be gathered from 2 log files. So I did that and rolled back findChunks and it all looks OK to me now.