Notes from Closeout Session: https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM
DOE Commitment:
- 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
- Hardware purchasing? Server infrastructure, databases, etc.
- IFC still funding hardware purchases
- Hardware will come to stay at SLAC
- ORACLE server migration soon
...
- Went through externals and determined statuses/necessity of each package
- Also determined whether or not we needed to provided packages or if they were available in conda
- ScienceTools Package Author Review
Other Notes About Science Tools:
- Many pieces of analysis rely on ST being backwards-compatible.
- Is the FSSC going to be doing more than just bug-fixes?
- Will there be more development?
- Fermipy unit tests catch a lot of issues
- Advisable to run after Likelihood updates
Data Flow
Data Pipeline:
- Data arrives from GSFC
- L0 ingest triggered by FastCopy post-processing script (into database)
- Halfpipe keeps track of what has been delivered
- Once tidy & ordered, hands off to L1
- L1 registers in data catalog & sends FITS to GSFC
Pipeline Monitoring (Warren & Michael Kuss):
- Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
- Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF.
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks.
Halfpipe (Steve Tether, maybe Gregg?):
Halfpipe has a specific Operations Task - gap reporting? (from Rob)
Warren: No, occurs before the Halfpipe…during FastCopy post-processing
Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain
processsci.py (?? I didn't really catch this)
- Launches gap finder to look for missing pieces
- Divides delivery into chunks and writes xml file
- Up to 20 chunks/delivery. Usually ~12 or so
Halfpipe reads chunked files:
- Extracts events from the datagrams
- Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
- i.e. If numEPUs =/= 2, stop
- Merges data from EPUs into a single time-ordered stream
- Necessary for down-stream processing
- Launches Level 1 pipeline
Question: What issues arise with the Halfpipe that need to be addressed?
Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain
processsci.py (?? I didn't really catch this)
- Launches gap finder to look for missing pieces
- Divides delivery into chunks and writes xml file
- Up to 20 chunks/delivery. Usually ~12 or so
- Halfpipe keeps track of what has been delivered
- Once tidy & ordered, hands off to L1
- L1 registers in data catalog & sends FITS to GSFC
Halfpipe (Steve Tether, maybe Gregg?):
Halfpipe has a specific Operations Task - gap reporting? (from Rob)
Warren: No, occurs before the Halfpipe…during FastCopy post-processing
Halfpipe reads chunked files:
- Extracts events from the datagrams
- Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
- i.e. If numEPUs =/= 2, stop
- Merges data from EPUs into a single time-ordered stream
- Necessary for down-stream processing
- Launches Level 1 pipeline
Question: What issues arise with the Halfpipe that need to be addressed?
- Datagrams being omitted due to issue with merging logic.
- Runs broken up between deliveries. Datagrams fall through cracks.
- Has to be repiped.
Pipeline Monitoring (Warren & Michael Kuss):
- Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
- Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF (batch submission).
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks
- Datagrams being omitted due to issue with merging logic.
- Runs broken up between deliveries. Datagrams fall through cracks. Has to be repiped
- .
ISOC Software Monitoring:
Also known as....what does Steve T. need to do and how often?
https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary
- Monitor the backup process
- 60 days of data on disk, rest on rooted server
- Archive of FastCopy packaged backed up to tape
- Raw data backed up to tape outside of xrootd
- Not uncommon for a job to be terminated for no reason and have to be restarted
- Check once a week, commonly needs intervention
- Partly due to TRSchron. Change in the kerberos implementations
- Get AFS tokens, but no kerberos tokens. Batch job semi-fails.
- Backup works but cleanup fails.
- Only operating at ~60% of capacity… 1/2% per week.
- Documented in the problem report submitted. Jim also complained.
- Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.
...
/afs/slac/g/glast/ground/releases/calibrations/TKR/
ASP Discussion (Jim Chiang):
ACD calibration code location: /nfs/farm/g/glast/tak/releases/GR-20-09-10.
ASP Discussion (Jim Chiang):
Current status at: http://Current status at: http://glast-ground.slac.stanford.edu/DataProcessing/
...
Infrastructure Maintenance
RHEL5 issues:
Virtualization:
- What needs virtualization?
- Halfpipe sounds like a candidate..
- No, it runs on RHEL6. But unlikely to move beyond. So yes, virtualize at RHEL6.
- GlastRelease is also stuck on RHEL6
- Couple APIs need QT, using commercial version
- Release Manager uses free version of QT
- Unsure why using commercial version.
- Might be worth exploring move to free version
- Need to have a discussion about FastCopy, as it requires RHEL5.
- ISOC ops boxes are mostly under RHEL5. Demonstrated that the tools can be run under RHEL6.
- Backup ISOC is no longer supported.
What kind of virtualization?
Flight Software:
...
VM or container?
GlastRelease:
- GlastRelease needs virtualizations
- RHEL 6 is last release that we have the personnel to support
- A few people running GlastRelease (Developers) - nice use case for Docker. Getting GlastRelease to run on your laptop is painful.
- GlastRelease carries around geant4
- Is there a distinction between Users and Developers for GlastRelease?
- No
- Focus with ScienceTools is just ease of distribution
- Would it be useful to distribute the tools in VMs? Containers? Both?
- Are there external dependencies (like xroot-d) that would cause problems with virtualization if backend changes?
We need automated build system for ST: Release manager vs. manual builds
- GR uses xrootd ST does not (Eric)
- Use of virtualization is for convenience - which is most useful thing to do? (Richard)
- Don't depend on NFS/AFS if build container right. Stable for data xrootd
- getting files/libraries and also output data.
- Container helps with diffuse model
- on nodes not on NSF
- on nodes there's low overhead.
- Caching image on all of the nodes.
- Fermi ST image will have the diffuse model in it.
Release Manager: Release manager doesn't talk to Oracle - but it does talk to a database. Not user friendly.
- For slac farm - docker containers for GlastRelease. Need docker registry
- Docker containers is the right solution for batch farm (Brian)
Use their system run to RHEL6 container, but batch host is RHEL7.
- Carefully build container (nice with xrootd)
- need to find out from Warren if FT1, FT2 files included (Richard)
What systems need what kinds of containers?
- Samuel needed to discuss w/simulations at Lyon. (He is sick today)
- What is different for developers/users?
- Same image for all the GR uses.
- Don't want to pull a 3GB image to pull FT1, GR is 3x bigger. Just have 1 image at the moment.
- One giant image - good command line interface installed in that image.
- Images built such that the top looks the same between GR and ST. Keep same image.
- Separate builds for debugging purposes?
- GlastRelease is frozen, ST is constantly evolving. Debugging GR is not a problem, debugging ST is important
- Giacomo
- Mount code at runtime, container doesn't have debugging tools.
- Container provides environment.
- Compile inside the container.
- run debugger inside container.
- User image has everything - compiled.
- Lightweight container for developers then they can compile. Users have full compiled.
- Debugging in GR and ST is very different
- The computing center will have a cache of docker.
- Every project will say what docker images do you want on the batch nodes?
- Plan for managing cashed images. Work out allocations for collaborations.
- Cost of using docker?
Pipeline:
- Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
- Docker containers for something like the batch system may cause some problems, since
- For something like the L1 pipeline, a number of images would need to be launched simultaneously
- Would size of the software cause problems with deployment?
- We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
- There is probably a precedent for this, however, Matt has no experience deploying on this scale
- File size of ~1 GB is best, a few is manageable for production.
- IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
- Lyon is a much larger computing center - likely they will upgrade to Docker first
- Now full support for Docker at Lyon (Fred)
- Now full support for Docker at Lyon (Fred)
Infrastructure:
- Last purchase went into dev cluster
- many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
- Still figuring out NFS/AFS sorted out with RHEL7. GPFS?
- It's good to come up with a plan because of security implications if NFS underneath.
- Use right docker (UID issues w/security)
- SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
- AFS on RHEL6 docker
- read files if world readable.
- NFS is hardest.
- Timeline for RHEL7, 12mo? 2018? (Matt)
- RHEL7 support is dodgy.
- Configuration stuff is hard part
Flight Software:
- Julie: No path to having anyone other than SLAC supporting flight software
LAT On-board Configuration:
If we desired to change the on-board configuration, what happens?
- Jim Panetta knew the most about that. Took knowledge with him. But see this link.
- Gregg Thayer can do that nominally.
- Handshake in ISOC and GlastRelease has to be done as well
- MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
- Seems like it takes a while for the ground system to catch up before we can use on instrument.
From Gregg:
- Forming the input to the configuration build system is the least remembered part
- System for turning the crank and building configurations is fine
- Instructions for turning crank may need work
- Then need to check content of product before upload
- May 2010: Halt in data processing due to MOOT key mismatch with MPT
- Do we know how to handle the MPT?
- Gregg…yes.
Mission Planning/Flight Operations
Actions:
- Finalize and document list of needed permissions on the ISOC Mission Planning page
- Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
- Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
- document support cron scripts for SAA checks and planning product generation
- document occasional mission-week-boundary LAT command scheduling problem
- NASA planners to take over LCI calibrations planning, effective immediately
- FSSC to consider having LAT operations workstation/tool installation
September 2017 Actions:
- any?
Routine Flight Operations Tasks
- Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports
Actions:
- improve documentation in Confluence
- make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
- move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
- Fix known bugs in agepoly.pl and x4saa.pl
- re-discover use of the LAT Configuration GUI tool, and document it
- migrate other Excel spreadsheet usage to non-Excel implementations
- TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
- CAL light output history trending
If we desired to change the on-board configuration, what happens?
- Jim Panetta knew the most about that. Took knowledge with him
- Gregg Thayer can do that nominally.
- Handshake in ISOC and GlastRelease has to be done as well
- MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
- Seems like it takes a while for the ground system to catch up before we can use on instrument.
From Gregg:
- Forming the input to the configuration build system is the least remembered part
- System for turning the crank and building configurations is fine
- Instructions for turning crank may need work
- Then need to check content of product before upload
- May 2010: Halt in data processing due to MOOT key mismatch with MPT
- Do we know how to handle the MPT?
- Gregg…yes.
Mission Planning/Flight Operations
Actions:
- Finalize and document list of needed permissions on the ISOC Mission Planning page
- Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
- Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
- document support cron scripts for SAA checks and planning product generation
- document occasional mission-week-boundary LAT command scheduling problem
- NASA planners to take over LCI calibrations planning, effective immediately
- FSSC to consider having LAT operations workstation/tool installation
...
Mission Planning notes from Elizabeth:
...