Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Notes from Closeout Session: 

Transitioning to Git: 

Joanne foresees a large amount of work involved in transitioning to git due to the SLAC cvs structure involving large numbers of symlinked packages.  She is concerned with possible drift from glastrelease and ScienceTools, since both have overlapping tools.

We settled on keeping the release manager/cvs system the same for now, but have all commits go through git.  An automated script/utility would port any git changes to CVS and would be the only 'user' to have commit privileges on the CVS repository in order to prevent collisions.

 

Updates from Tuesday Morning:

Brian and Jim began process of porting CVS to git using cvstogit utility

-in principle you can restrict the groups that have write permission

-newer CVS to git seems to retain the history, based on svn to git utility

-main hurdle is permissions problem

-Need to restrict checkins to CVS to only git script

-Need to be able to allow users to access CVS users area without affecting/conflicting with git/cvs checkins

-submodules/details not sorted yet, just prepping for transition now by moving repo to SSD

 

Brian update:

-Took list and made CSV file for table on confluence, grep through repo for needed files

-Copied entire CVS repo locally to machine in order to ease transition

-Doesn't anticipate too many problems with this, but there are a few hiccups like author matching

-Org: fermi-lat

-Alex and I will be added to the fermi-lat team on github

-Test conversion of Likelihood (github.com/brianv0/Likelihood)

 

 

Externals:

 

Generally in favor of jettisoning most external packages.  Python can be provided by users, no need to carry our own versions around.

IRF generation require ROOT, nothing else does.  Can be removed from main FSSC ScienceTools for the public release, however, it would be good to keep it around for internal/developer use.

 

Updates from Tuesday Morning:

-Went through externals and determined statuses/necessity of each package

-Also determined whether or not we needed to provided packages or if they were available in conda

 

Other Notes About Science Tools:

  • Many pieces of analysis rely on ST being backwards-compatible.
  • Is the FSSC going to be doing more than just bug-fixes?
  • Will there be more development?

DOE Commitment:

  • 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
  • Hardware purchasing? Server infrastructure, databases, etc.
  • IFC still funding hardware purchases
  • Hardware will come to SLAC
  • ORACLE server migration soon

Want to prepare for a long-term mission continuation.

Revise if appropriate and effort not excessive.

https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM

DOE Commitment:

  • 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
  • Hardware purchasing? Server infrastructure, databases, etc.
  • IFC still funding hardware purchases
  • Hardware will stay at SLAC
  • ORACLE server migration soon

Want to prepare for a long-term mission continuation.

Revise if appropriate and effort not excessive.

  

Science Tools

Transitioning to Git: 

Joanne foresees a large amount of work involved in transitioning to git due to the SLAC cvs structure involving large numbers of symlinked packages.  She is concerned with possible drift from glastrelease and ScienceTools, since both have overlapping tools.

We settled on keeping the release manager/cvs system the same for now, but have all commits go through git.  An automated script/utility would port any git changes to CVS and would be the only 'user' to have commit privileges on the CVS repository in order to prevent collisions.

Updates from Tuesday Morning:

Brian and Jim began process of porting CVS to git using cvstogit utility

  • in principle you can restrict the groups that have write permission
  • newer CVS to git seems to retain the history, based on svn to git utility
  • main hurdle is permissions problem
    • Need to restrict checkins to CVS to only git script
    • Need to be able to allow users to access CVS users area without affecting/conflicting with git/cvs checkins
  • submodules/details not sorted yet, just prepping for transition now by moving repo to SSD

Brian update:

  • Took list and made CSV file for table on confluence, grep through repo for needed files
  •  Copied entire CVS repo locally to machine in order to ease transition
  •  Doesn't anticipate too many problems with this, but there are a few hiccups like author matching
  •  Org: fermi-lat
  •  Alex and Joe will be added to the fermi-lat team on github
  •  Test conversion of Likelihood (github.com/brianv0/Likelihood)

 

Externals:

Generally in favor of jettisoning most external packages.  Python can be provided by users, no need to carry our own versions around.

IRF generation require ROOT, nothing else does.  Can be removed from main FSSC ScienceTools for the public release, however, it would be good to keep it around for internal/developer use.

Updates from Tuesday Morning:

  • Went through externals and determined statuses/necessity of each package
  •  Also determined whether or not we needed to provided packages or if they were available in conda
  • ScienceTools Package Author Review

 

Other Notes About Science Tools:

  • Many pieces of analysis rely on ST being backwards-compatible.
  • Is the FSSC going to be doing more than just bug-fixes?
  • Will there be more development?
  • Fermipy unit tests catch a lot of issues
    • Advisable to run after Likelihood updates

...


Data Flow

Data Pipeline:

  • Data arrives from GSFC
  • L0 ingest triggered by FastCopy post-processing script (into database)

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so
  • Halfpipe keeps track of what has been delivered
  • Halfpipe keeps track of what has been delivered
  • Once tidy & ordered, hands off to L1
  • L1 registers in data catalog & sends FITS to GSFC

Pipeline Monitoring (Warren & Michael Kuss): 

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF. 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks.

 

Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

 

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

  •  Datagrams being omitted due to issue with merging logic.
    •  Runs broken up between deliveries. Datagrams fall through cracks.
    •  Has to be repiped.

 

ISOC Software Monitoring: 


Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

  •  Datagrams being omitted due to issue with merging logic.
    •  Runs broken up between deliveries. Datagrams fall through cracks.
    •  Has to be repiped.

 

Pipeline Monitoring (Warren & Michael Kuss): 

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF (batch submission). 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks.

 

ISOC Software Monitoring: 

Also known as....what does Steve T. need to do and how often?

https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary

  •  Monitor the backup process
    •  60 days of data on disk, rest on rooted server
    •  Archive of FastCopy packaged backed up to tape
    •  Raw data backed up to tape outside of xrootd
    •  Not uncommon for a job to be terminated for no reason and have to be restarted
    •  Check once a week, commonly needs intervention
      •  Partly due to TRSchron. Change in the kerberos implementations
      •  Get AFS tokens, but no kerberos tokens. Batch job semi-fails. 
      •  Backup works but cleanup fails.
      •  Only operating at ~60% of capacity… 1/2% per week. 
    •  Documented in the problem report submitted. Jim also complained.
  •  Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

...

  •  CHS package and offline were not touched
  •  All other had to be touched in some way
  •  Depends on a lot of external packages.
  •  Unlikely to be portable past RHEL6
  •  Currently using RHEL5 release under RHEL6
  •  RHEL5 very near end of life

 

FSSC (Don) will review Steve T.’s page with information of what he does on a regular basis.


Data Distribution:

Get rid of FastCopy?

...

/afs/slac/g/glast/ground/releases/calibrations/TKR/

ASP Discussion (Jim Chiang):

ACD calibration code location: /nfs/farm/g/glast/tak/releases/GR-20-09-10.



ASP Discussion (Jim Chiang):

Current status at: http:/Current status at: http://glast-ground.slac.stanford.edu/DataProcessing/

...

If I think of other things to document, I'll check the text into that repo and will let you know about it.   Plus there are a number of maintenance scripts that I need to add to that repo...



Data Reprocessing (Maria Elena):

  •  Cutoff date is vital.
  •  Make new version of all packages, even if no changes
  •  Test L1 on dev pipeline to determine all ok
  •  Tom produces all the reprocessed file on a system that requires no monitoring.
  •  Start shipping reprocessed files to FSSC (database queries by hand)
    •  Ship based on file size (~ GB range, 4-8 at a time)
    •  Takes several weeks
    •  1 week for LS1, 1 day for FT1
  •  Switch over L1 Pipeline XML at a specific date/time. New version & new processing #s
  •  Tom reprocesses the gap data (“backfill”)
    •  1-2 days up to 2 months depending on how detailed reprocessing is
  •  Send backfill data by hand

...

 


Tracker Recon (was Tracy Usher):

  •  No one else understands it
  •  Eric Charles knows downstream work
  • Need to figure out how to handle this going forward

 

Mission Planning/Flight Operations

Actions:

  • Finalize and document list of needed permissions on the ISOC Mission Planning page
  • Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
  • Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
  • document support cron scripts for SAA checks and planning product generation
  • document occasional mission-week-boundary LAT command scheduling problem
  • NASA planners to take over LCI calibrations planning, effective immediately
  • FSSC to consider having LAT operations workstation/tool installation


Datasets (Richard):

Question is what are people using?

  •  Short timescale dataset (no ISOC change needed)
  •  Catalog-like dataset
  •  Electron data
  •  Merit Files (long-term), may be useful for finding something novel in the data
  •  Cosmic Ray data
  •  Monte Carlo datasets, for generating new event classes (Usable in 15 years? Can it be put in context?)

Maybe C&A can ask the collaboration what they need that isn’t being provided by the FSSC.

  •  If reasonable, can look then at adding datasets
  •  Ask the FUG? Not initially planning to make them public.

 

 

Infrastructure Maintenance

 Virtualization: 

  • What needs virtualization?
    • Halfpipe sounds like a candidate..
      •  No, it runs on RHEL6. But unlikely to move beyond. So yes, virtualize at RHEL6.
    •  GlastRelease is also stuck on RHEL6
    •  Couple APIs need QT, using commercial version
      •  Release Manager uses free version of QT
      •  Unsure why using commercial version.
      •  Might be worth exploring move to free version
  •  Need to have a discussion about FastCopy, as it requires RHEL5.
  •  ISOC ops boxes are mostly under RHEL5. Demonstrated that the tools can be run under RHEL6.
  •  Backup ISOC is no longer supported.

What kind of virtualization?  VM or container?

GlastRelease: 

  •     GlastRelease needs virtualizations
    • RHEL 6 is last release that we have the personnel to support
    • A few people running GlastRelease (Developers) - nice use case for Docker. Getting GlastRelease to run on your laptop is painful. 
    • GlastRelease carries around geant4
  • Is there a distinction between Users and Developers for GlastRelease? 
    • No

 

Science Tools:

 

  • Focus with ScienceTools is just ease of distribution
  • Would it be useful to distribute the tools in VMs? Containers? Both?
  • Are there external dependencies (like xroot-d) that would cause problems with virtualization if backend changes?
  • We need automated build system for ST: Release manager vs. manual builds 

  • GR uses xrootd ST does not (Eric)
  • Use of virtualization is for convenience - which is most useful thing to do? (Richard)

 

    • Don't depend on NFS/AFS if build container right. Stable for data xrootd
    • getting files/libraries and also output data.  
    • Container helps with diffuse model
      • on nodes not on NSF
      • on nodes there's low overhead. 
      • Caching image on all of the nodes. 
      • Fermi ST image will have the diffuse model in it. 

Release Manager: Release manager doesn't talk to Oracle - but it does talk to a database. Not user friendly. 

  • For slac farm - docker containers for GlastRelease. Need docker registry
  • Docker containers is the right solution for batch farm (Brian) 
  • Use their system run to RHEL6 container, but batch host is RHEL7.

    • Carefully build container (nice with xrootd)
  • need to find out from Warren if FT1, FT2 files included (Richard)

What systems need what kinds of containers?

  • Samuel needed to discuss w/simulations at Lyon. (He is sick today)
  •  What is different for developers/users? 
  •  Same image for all the GR uses. 
  •  Don't want to pull a 3GB image to pull FT1, GR is 3x bigger. Just have 1 image at the moment. 
  •  One giant image - good command line interface installed in that image. 
  •  Images built such that the top looks the same between GR and ST. Keep same image. 
  •  Separate builds for debugging purposes? 
  •  GlastRelease is frozen, ST is constantly evolving. Debugging GR is not a problem, debugging ST is important
  • Giacomo
    • Mount code at runtime, container doesn't have debugging tools. 
    • Container provides environment. 
    • Compile inside the container. 
    • run debugger inside container. 
    • User image has everything - compiled. 
  • Lightweight container for developers then they can compile. Users have full compiled. 
  • Debugging in GR and ST is very different
  • The computing center will have a cache of docker. 
  • Every project will say what docker images do you want on the batch nodes? 
  • Plan for managing cashed images. Work out allocations for collaborations. 
  • Cost of using docker? 

 

Pipeline:

 

  • Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
  • Docker containers for something like the batch system may cause some problems, since
  • For something like the L1 pipeline, a number of images would need to be launched simultaneously
  •  Would size of the software cause problems with deployment?
  • We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
  • There is probably a precedent for this, however, Matt has no experience deploying on this scale 
  • File size of ~1 GB is best, a few is manageable for production. 
  • IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
  • Lyon is a much larger computing center - likely they will upgrade to Docker first
    • Now full support for Docker at Lyon (Fred)

Infrastructure:

  • Last purchase went into dev cluster
    • many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
    • Still figuring out NFS/AFS sorted out with RHEL7. GPFS? 
  • It's good to come up with a plan because of security implications if NFS underneath. 
    • Use right docker (UID issues w/security)
  • SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
    • AFS on RHEL6 docker
    • read files if world readable. 
    • NFS is hardest. 
  • Timeline for RHEL7, 12mo? 2018? (Matt)
    • RHEL7 support is dodgy. 
    • Configuration stuff is hard part



Flight Software:

  •  Julie: No path to having anyone other than SLAC supporting flight software

LAT On-board Configuration:

 If we desired to change the on-board configuration, what happens?

  •  Jim Panetta knew the most about that. Took knowledge with him. But see this link.
  •  Gregg Thayer can do that nominally.
  •  Handshake in ISOC and GlastRelease has to be done as well
  •  MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
  •  Seems like it takes a while for the ground system to catch up before we can use on instrument.

From Gregg:

  •  Forming the input to the configuration build system is the least remembered part
  •  System for turning the crank and building configurations is fine
  •  Instructions for turning crank may need work
  •  Then need to check content of product before upload
  •  May 2010: Halt in data processing due to MOOT key mismatch with MPT
    •  Do we know how to handle the MPT?
    •  Gregg…yes.

Mission Planning/Flight Operations

Actions:

  • Finalize and document list of needed permissions on the ISOC Mission Planning page
  • Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
  • Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
  • document support cron scripts for SAA checks and planning product generation
  • document occasional mission-week-boundary LAT command scheduling problem
  • NASA planners to take over LCI calibrations planning, effective immediately
  • FSSC to consider having LAT operations workstation/tool installation

September 2017 Actions:

  • any?

Routine Flight Operations Tasks

  • Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports

Actions:

  • improve documentation in Confluence
  • make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
  • move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
    • Fix known bugs in agepoly.pl and x4saa.pl
  • re-discover use of the LAT Configuration GUI tool, and document it
  • migrate other Excel spreadsheet usage to non-Excel implementations
    • TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
    • CAL light output history trending

...

Mission Planning notes from Elizabeth:

...