Notes from Closeout Session: https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM
DOE Commitment:
- 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
- Hardware purchasing? Server infrastructure, databases, etc.
- IFC still funding hardware purchases
- Hardware will come to stay at SLAC
- ORACLE server migration soon
...
- Went through externals and determined statuses/necessity of each package
- Also determined whether or not we needed to provided packages or if they were available in conda
- ScienceTools Package Author Review
Other Notes About Science Tools:
- Many pieces of analysis rely on ST being backwards-compatible.
- Is the FSSC going to be doing more than just bug-fixes?
- Will there be more development?
- Fermipy unit tests catch a lot of issues
- Advisable to run after Likelihood updates
Data Flow
Data Pipeline:
- Data arrives from GSFC
- L0 ingest triggered by FastCopy post-processing script (into database)
- Halfpipe keeps track of what has been delivered
- Once tidy & ordered, hands off to L1
- L1 registers in data catalog & sends FITS to GSFC
Pipeline Monitoring (Warren & Michael Kuss):
- Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
- Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF.
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks.
Halfpipe (Steve Tether, maybe Gregg?):
Halfpipe has a specific Operations Task - gap reporting? (from Rob)
Warren: No, occurs before the Halfpipe…during FastCopy post-processing
Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain
processsci.py (?? I didn't really catch this)
- Launches gap finder to look for missing pieces
- Divides delivery into chunks and writes xml file
- Up to 20 chunks/delivery. Usually ~12 or so
Halfpipe reads chunked files:
- Extracts events from the datagrams
- Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
- i.e. If numEPUs =/= 2, stop
- Merges data from EPUs into a single time-ordered stream
- Necessary for down-stream processing
- Launches Level 1 pipeline
Question: What issues arise with the Halfpipe that need to be addressed?
Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain
processsci.py (?? I didn't really catch this)
- Launches gap finder to look for missing pieces
- Divides delivery into chunks and writes xml file
- Up to 20 chunks/delivery. Usually ~12 or so
- Halfpipe keeps track of what has been delivered
- Once tidy & ordered, hands off to L1
- L1 registers in data catalog & sends FITS to GSFC
Halfpipe (Steve Tether, maybe Gregg?):
Halfpipe has a specific Operations Task - gap reporting? (from Rob)
Warren: No, occurs before the Halfpipe…during FastCopy post-processing
Halfpipe reads chunked files:
- Extracts events from the datagrams
- Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
- i.e. If numEPUs =/= 2, stop
- Merges data from EPUs into a single time-ordered stream
- Necessary for down-stream processing
- Launches Level 1 pipeline
Question: What issues arise with the Halfpipe that need to be addressed?
- Datagrams being omitted due to issue with merging logic.
- Runs broken up between deliveries. Datagrams fall through cracks.
- Has to be repiped.
Pipeline Monitoring (Warren & Michael Kuss):
- Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
- Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF (batch submission).
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks
- Datagrams being omitted due to issue with merging logic.
- Runs broken up between deliveries. Datagrams fall through cracks. Has to be repiped
- .
ISOC Software Monitoring:
Also known as....what does Steve T. need to do and how often?
https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary
- Monitor the backup process
- 60 days of data on disk, rest on rooted server
- Archive of FastCopy packaged backed up to tape
- Raw data backed up to tape outside of xrootd
- Not uncommon for a job to be terminated for no reason and have to be restarted
- Check once a week, commonly needs intervention
- Partly due to TRSchron. Change in the kerberos implementations
- Get AFS tokens, but no kerberos tokens. Batch job semi-fails.
- Backup works but cleanup fails.
- Only operating at ~60% of capacity… 1/2% per week.
- Documented in the problem report submitted. Jim also complained.
- Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.
...
/afs/slac/g/glast/ground/releases/calibrations/CAL/p7repro//
/afs/slac/g/glast/ground/releases/calibrations/TKR/
ACD calibration code location: /nfs/farm/afs/slac/g/glast/groundtak/releases/calibrations/TKR/GR-20-09-10.
ASP Discussion (Jim Chiang):
...
- Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
- Docker containers for something like the batch system may cause some problems, since
- For something like the L1 pipeline, a number of images would need to be launched simultaneously
- Would size of the software cause problems with deployment?
- We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
- There is probably a precedent for this, however, Matt has no experience deploying on this scale
- File size of ~1 GB is best, a few is manageable for production.
- IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
- Lyon is a much larger computing center - likely they will upgrade to Docker first
- Now full support for Docker at Lyon (Fred)
- Now full support for Docker at Lyon (Fred)
Infrastructure:
- Last purchase went into dev cluster
- many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
- Still figuring out NFS/AFS sorted out with RHEL7. GPFS?
- It's good to come up with a plan because of security implications if NFS underneath.
- Use right docker (UID issues w/security)
- SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
- AFS on RHEL6 docker
- read files if world readable.
- NFS is hardest.
- Timeline for RHEL7, 12mo? 2018? (Matt)
- RHEL7 support is dodgy.
- Configuration stuff is hard part
Flight Software:
...
- to prevent collisions/problems
- There is probably a precedent for this, however, Matt has no experience deploying on this scale
- File size of ~1 GB is best, a few is manageable for production.
- IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
- Lyon is a much larger computing center - likely they will upgrade to Docker first
- Now full support for Docker at Lyon (Fred)
- Now full support for Docker at Lyon (Fred)
Infrastructure:
- Last purchase went into dev cluster
- many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
- Still figuring out NFS/AFS sorted out with RHEL7. GPFS?
- It's good to come up with a plan because of security implications if NFS underneath.
- Use right docker (UID issues w/security)
- SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
- AFS on RHEL6 docker
- read files if world readable.
- NFS is hardest.
- Timeline for RHEL7, 12mo? 2018? (Matt)
- RHEL7 support is dodgy.
- Configuration stuff is hard part
Flight Software:
- Julie: No path to having anyone other than SLAC supporting flight software
LAT On-board Configuration:
If we desired to change the on-board configuration, what happens?
- Jim Panetta knew the most about that. Took knowledge with him. But see this link.
- Gregg Thayer can do that nominally.
- Handshake in ISOC and GlastRelease has to be done as well
- MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
- Seems like it takes a while for the ground system to catch up before we can use on instrument.
From Gregg:
- Forming the input to the configuration build system is the least remembered part
- System for turning the crank and building configurations is fine
- Instructions for turning crank may need work
- Then need to check content of product before upload
- May 2010: Halt in data processing due to MOOT key mismatch with MPT
- Do we know how to handle the MPT?
- Gregg…yes.
Mission Planning/Flight Operations
Actions:
- Finalize and document list of needed permissions on the ISOC Mission Planning page
- Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
- Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
- document support cron scripts for SAA checks and planning product generation
- document occasional mission-week-boundary LAT command scheduling problem
- NASA planners to take over LCI calibrations planning, effective immediately
- FSSC to consider having LAT operations workstation/tool installation
September 2017 Actions:
- any?
Routine Flight Operations Tasks
- Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports
Actions:
- improve documentation in Confluence
- make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
- move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
- Fix known bugs in agepoly.pl and x4saa.pl
- re-discover use of the LAT Configuration GUI tool, and document it
- migrate other Excel spreadsheet usage to non-Excel implementations
- TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
- CAL light output history trending
If we desired to change the on-board configuration, what happens?
- Jim Panetta knew the most about that. Took knowledge with him
- Gregg Thayer can do that nominally.
- Handshake in ISOC and GlastRelease has to be done as well
- MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
- Seems like it takes a while for the ground system to catch up before we can use on instrument.
From Gregg:
- Forming the input to the configuration build system is the least remembered part
- System for turning the crank and building configurations is fine
- Instructions for turning crank may need work
- Then need to check content of product before upload
- May 2010: Halt in data processing due to MOOT key mismatch with MPT
- Do we know how to handle the MPT?
- Gregg…yes.
Mission Planning/Flight Operations
Actions:
- Finalize and document list of needed permissions on the ISOC Mission Planning page
- Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
- Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
- document support cron scripts for SAA checks and planning product generation
- document occasional mission-week-boundary LAT command scheduling problem
- NASA planners to take over LCI calibrations planning, effective immediately
- FSSC to consider having LAT operations workstation/tool installation
...
Mission Planning notes from Elizabeth:
...