Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF.
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks.

Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

Launches gap finder to look for missing pieces
Divides delivery into chunks and writes xml file
Up to 20 chunks/delivery. Usually ~12 or so

Halfpipe reads chunked files:

Extracts events from the datagrams

Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
i.e. If numEPUs =/= 2, stop

Merges data from EPUs into a single time-ordered stream

Necessary for down-stream processing

Launches Level 1 pipeline

Question: What issues arise with the Halfpipe that need to be addressed?

Datagrams being omitted due to issue with merging logic.

Runs broken up between deliveries. Datagrams fall through cracks.
Has to be repiped.

ISOC Software Monitoring:

Also known as....what does Steve T. need to do and how often?

Monitor the backup process

60 days of data on disk, rest on rooted server
Archive of FastCopy packaged backed up to tape
Raw data backed up to tape outside of xrootd
Not uncommon for a job to be terminated for no reason and have to be restarted
Check once a week, commonly needs intervention

Partly due to TRSchron. Change in the kerberos implementations
Get AFS tokens, but no kerberos tokens. Batch job semi-fails.
Backup works but cleanup fails.
Only operating at ~60% of capacity… 1/2% per week.

Documented in the problem report submitted. Jim also complained.

Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

Other issues that Steve T. deals with?

Monitor event log for Error or Fatal messages in the last week.

Usually they come in pages via email
Example…changing how gap finder is triggered. Now invoked by science delivery.
FSW errors go to Steve, Gregg, Jana
For others, need to review the paging configuration to see who is contacted.

Vetting of orbital event reports.

Would reject if events shifted from one orbit to another.
No more than 3 in any single report.

In migrating (what?) to RHEL6, how much was touched?

CHS package and offline were not touched
All other had to be touched in some way
Depends on a lot of external packages.
Unlikely to be portable past RHEL6
Currently using RHEL5 release under RHEL6
RHEL5 very near end of life

Data Distribution:

Get rid of FastCopy?

...