Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Seems to be an NFS error.  I can't write to that directory as myself or the glast account.  Need to check on propre proprer account/permissions

 

Handling Stuck Jobs

...

You'll have to be glastraw for the last 2 steps.

But then I messed up. I rolled back fondChunks 180502012.546962321 and it failed complaining about overlapping data. The right thing is to roll back the whole doRun stream while defining deliveriesToIgnore=180502013 (How?). Which I've now done.

And now there are more errors, which I'll have to investigate later, but probably involve the magic7 file.

 

Multiple Submissions of the same Job

This is usually indicated by one or more instances of the LSF summary report appearing near the end (although not always) of the file.  The summary looks like this:


------------------------------------------------------------ Sender: LSF System <lsf@hequ0119> Subject: Job 217648: <findChunks> in cluster <slac> Done Job <findChunks> was submitted from host <fermilnx-v08> by user <glastraw> in cluster <slac> at Thu Aug 2 11:37:06 2018. Job was executed on host(s) <hequ0119>, in queue <glastdataq>, as user <glastraw> in cluster <slac> at Thu Aug 2 11:37:08 2018. </u/gl/glastraw> was used as the home directory. </nfs/farm/g/glast/u41/L1/logs/PROD/L1Proc/5.6/doRun/findChunks/180xxxxxx/802xxx/015/554xxxxxx/915xxx/869> was used as the working directory. Started at Thu Aug 2 11:37:08 2018. Terminated at Thu Aug 2 11:39:29 2018. Results reported at Thu Aug 2 11:39:29 2018. Your job looked like: ------------------------------------------------------------ # LSBATCH: User input bash pipeline_wrapper ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 8.33 sec. Max Memory : 66 MB Average Memory : 24.33 MB Total Requested Memory : - Delta Memory : - Max Swap : - Max Processes : 8 Max Threads : 13 Run time : 141 sec. Turnaround time : 143 sec. The output (if any) is above this job summary.

In 95% of the cases, all that needs to be done is a simple rollback of that affected process.  In the other 5%, there is some other underlying problem that is also affecting the completion of the job and you'll need to search through the log files for the error.

 

HalfPipe progress bar field has red question mark

This occurs when there are no files for the half-pipe to process in the delivery. There are three causes for this:

  1. There really is no data.  Looking at the list of files in the delivery (click on the progress bar for FastCopy on that delivery) shows that there are no  *.LSCI.* files in the delivery.  Nothing needs to be done.
  2. It's duplicate data.  An example of this is delivery 18729012.  In this case the same files (but with a .02 version number) were delivered that had come in a previous delivery (in this case delivery 18729011).  The system determined that there were no new events so the halfpipe wasn't started.  Nothing needs to be done.
  3. The half-pipe failed to start.  For some reason, The FastCopy processing successfully completed but the half pipe failed to launch.  When this happens, and the delivery contains science data follow the instructions on the HTF FC Incoming page for "Dispatching Runs" (item 2).  This will launch the half-pipe for that run.

 

Unknown LSE_Keys error

They symptom of this is that you get an error of the form:

LSEReader::read: unknown LSE_Keys typeid 551075433 from /nfs/farm/g/glast/u42/ISOC-flight/Downlinks/180619008/0000f5a2-20d8be69-03bd-00a21.evt

In this particular case the LSE_Key was set to the run number instead of it's proper value (-1 through 3 from an enum in the code).

To date, I've only ever seen this in the mergeEvt task.  This is symptomatic of a problem upstream in the doChunk streams (running the makeEvt task).  In every instance, I've found one or more of those makeEvt tasks had a multiple submission.  Rolling all the doChunk streams back with the multiple submissions fixes the problem.