...
- Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
- Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF.
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks.
Halfpipe (Steve Tether, maybe Gregg?):
Halfpipe has a specific Operations Task - gap reporting? (from Rob)
Warren: No, occurs before the Halfpipe…during FastCopy post-processing
Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain
processsci.py (?? I didn't really catch this)
- Launches gap finder to look for missing pieces
- Divides delivery into chunks and writes xml file
- Up to 20 chunks/delivery. Usually ~12 or so
Halfpipe reads chunked files:
- Extracts events from the datagrams
- Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
- i.e. If numEPUs =/= 2, stop
- Merges data from EPUs into a single time-ordered stream
- Necessary for down-stream processing
- Launches Level 1 pipeline
Question: What issues arise with the Halfpipe that need to be addressed?
- Datagrams being omitted due to issue with merging logic.
- Runs broken up between deliveries. Datagrams fall through cracks.
- Has to be repiped.
ISOC Software Monitoring:
Also known as....what does Steve T. need to do and how often?
- Monitor the backup process
- 60 days of data on disk, rest on rooted server
- Archive of FastCopy packaged backed up to tape
- Raw data backed up to tape outside of xrootd
- Not uncommon for a job to be terminated for no reason and have to be restarted
- Check once a week, commonly needs intervention
- Partly due to TRSchron. Change in the kerberos implementations
- Get AFS tokens, but no kerberos tokens. Batch job semi-fails.
- Backup works but cleanup fails.
- Only operating at ~60% of capacity… 1/2% per week.
- Documented in the problem report submitted. Jim also complained.
- Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.
Other issues that Steve T. deals with?
- Monitor event log for Error or Fatal messages in the last week.
- Usually they come in pages via email
- Example…changing how gap finder is triggered. Now invoked by science delivery.
- FSW errors go to Steve, Gregg, Jana
- For others, need to review the paging configuration to see who is contacted.
- Vetting of orbital event reports.
- Would reject if events shifted from one orbit to another.
- No more than 3 in any single report.
In migrating (what?) to RHEL6, how much was touched?
- CHS package and offline were not touched
- All other had to be touched in some way
- Depends on a lot of external packages.
- Unlikely to be portable past RHEL6
- Currently using RHEL5 release under RHEL6
- RHEL5 very near end of life
Data Distribution:
Get rid of FastCopy?
...