Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF. 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks.

 

Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

 

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

  •  Datagrams being omitted due to issue with merging logic.
    •  Runs broken up between deliveries. Datagrams fall through cracks.
    •  Has to be repiped.

 

ISOC Software Monitoring: 

Also known as....what does Steve T. need to do and how often?

  •  Monitor the backup process
    •  60 days of data on disk, rest on rooted server
    •  Archive of FastCopy packaged backed up to tape
    •  Raw data backed up to tape outside of xrootd
    •  Not uncommon for a job to be terminated for no reason and have to be restarted
    •  Check once a week, commonly needs intervention
      •  Partly due to TRSchron. Change in the kerberos implementations
      •  Get AFS tokens, but no kerberos tokens. Batch job semi-fails. 
      •  Backup works but cleanup fails.
      •  Only operating at ~60% of capacity… 1/2% per week. 
    •  Documented in the problem report submitted. Jim also complained.
  •  Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

Other issues that Steve T. deals with?

  •  Monitor event log for Error or Fatal messages in the last week.
    •  Usually they come in pages via email
    •  Example…changing how gap finder is triggered. Now invoked by science delivery.
    •  FSW errors go to Steve, Gregg, Jana
    •  For others, need to review the paging configuration to see who is contacted.
  •  Vetting of orbital event reports. 
    •  Would reject if events shifted from one orbit to another.
    •  No more than 3 in any single report.

In migrating (what?) to RHEL6, how much was touched?

  •  CHS package and offline were not touched
  •  All other had to be touched in some way
  •  Depends on a lot of external packages.
  •  Unlikely to be portable past RHEL6
  •  Currently using RHEL5 release under RHEL6
  •  RHEL5 very near end of life


Data Distribution:

Get rid of FastCopy?

...