Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Notes from Closeout Session: https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM

DOE Commitment:

  • 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
  • Hardware purchasing? Server infrastructure, databases, etc.
  • IFC still funding hardware purchases
  • Hardware will come to stay at SLAC
  • ORACLE server migration soon

...

  • Went through externals and determined statuses/necessity of each package
  •  Also determined whether or not we needed to provided packages or if they were available in conda
  • ScienceTools Package Author Review

 

Other Notes About Science Tools:

  • Many pieces of analysis rely on ST being backwards-compatible.
  • Is the FSSC going to be doing more than just bug-fixes?
  • Will there be more development?
  • Fermipy unit tests catch a lot of issues
    • Advisable to run after Likelihood updates


Data Flow

Data Pipeline:

  • Data arrives from GSFC
  • L0 ingest triggered by FastCopy post-processing script (into database)
  • Halfpipe keeps track of what has been delivered
  • Once tidy & ordered, hands off to L1
  • L1 registers in data catalog & sends FITS to GSFC

Pipeline Monitoring (Warren & Michael Kuss): 

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF. 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks.

 

Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

 

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so
  • Halfpipe keeps track of what has been delivered
  • Once tidy & ordered, hands off to L1
  • L1 registers in data catalog & sends FITS to GSFC


Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

  •  Datagrams being omitted due to issue with merging logic.
    •  Runs broken up between deliveries. Datagrams fall through cracks.
    •  Has to be repiped.

 

Pipeline Monitoring (Warren & Michael Kuss): 

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF (batch submission). 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks
  •  Datagrams being omitted due to issue with merging logic.
  •  Runs broken up between deliveries. Datagrams fall through cracks.
  •  Has to be repiped
    • .

 

ISOC Software Monitoring: 

Also known as....what does Steve T. need to do and how often?

https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary

  •  Monitor the backup process
    •  60 days of data on disk, rest on rooted server
    •  Archive of FastCopy packaged backed up to tape
    •  Raw data backed up to tape outside of xrootd
    •  Not uncommon for a job to be terminated for no reason and have to be restarted
    •  Check once a week, commonly needs intervention
      •  Partly due to TRSchron. Change in the kerberos implementations
      •  Get AFS tokens, but no kerberos tokens. Batch job semi-fails. 
      •  Backup works but cleanup fails.
      •  Only operating at ~60% of capacity… 1/2% per week. 
    •  Documented in the problem report submitted. Jim also complained.
  •  Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

...

/afs/slac/g/glast/ground/releases/calibrations/TKR/

ASP Discussion (Jim Chiang):

ACD calibration code location: /nfs/farm/g/glast/tak/releases/GR-20-09-10.



ASP Discussion (Jim Chiang):

Current status at: http://Current status at: http://glast-ground.slac.stanford.edu/DataProcessing/

...

Infrastructure Maintenance

RHEL5 issues:

 Virtualization: 

     
  • What needs virtualization?
      • Halfpipe sounds like a candidate..
        •  No, it runs on RHEL6. But unlikely to move beyond. So yes, virtualize at RHEL6.
      •  GlastRelease is also stuck on RHEL6
      •  Couple APIs need QT, using commercial version
        •  Release Manager uses free version of QT
        •  Unsure why using commercial version.
        •  Might be worth exploring move to free version
    •  Need to have a discussion about FastCopy, as it requires RHEL5.
    •  ISOC ops boxes are mostly under RHEL5. Demonstrated that the tools can be run under RHEL6.
    •  Backup ISOC is no longer supported.

    What kind of virtualization? 

    Flight Software:

    ...

    VM or container?

    GlastRelease: 

    •     GlastRelease needs virtualizations
      • RHEL 6 is last release that we have the personnel to support
      • A few people running GlastRelease (Developers) - nice use case for Docker. Getting GlastRelease to run on your laptop is painful. 
      • GlastRelease carries around geant4
    • Is there a distinction between Users and Developers for GlastRelease? 
      • No

     

    Science Tools:

     

    • Focus with ScienceTools is just ease of distribution
    • Would it be useful to distribute the tools in VMs? Containers? Both?
    • Are there external dependencies (like xroot-d) that would cause problems with virtualization if backend changes?
    • We need automated build system for ST: Release manager vs. manual builds 

    • GR uses xrootd ST does not (Eric)
    • Use of virtualization is for convenience - which is most useful thing to do? (Richard)

     

      • Don't depend on NFS/AFS if build container right. Stable for data xrootd
      • getting files/libraries and also output data.  
      • Container helps with diffuse model
        • on nodes not on NSF
        • on nodes there's low overhead. 
        • Caching image on all of the nodes. 
        • Fermi ST image will have the diffuse model in it. 

    Release Manager: Release manager doesn't talk to Oracle - but it does talk to a database. Not user friendly. 

    • For slac farm - docker containers for GlastRelease. Need docker registry
    • Docker containers is the right solution for batch farm (Brian) 
    • Use their system run to RHEL6 container, but batch host is RHEL7.

      • Carefully build container (nice with xrootd)
    • need to find out from Warren if FT1, FT2 files included (Richard)

    What systems need what kinds of containers?

    • Samuel needed to discuss w/simulations at Lyon. (He is sick today)
    •  What is different for developers/users? 
    •  Same image for all the GR uses. 
    •  Don't want to pull a 3GB image to pull FT1, GR is 3x bigger. Just have 1 image at the moment. 
    •  One giant image - good command line interface installed in that image. 
    •  Images built such that the top looks the same between GR and ST. Keep same image. 
    •  Separate builds for debugging purposes? 
    •  GlastRelease is frozen, ST is constantly evolving. Debugging GR is not a problem, debugging ST is important
    • Giacomo
      • Mount code at runtime, container doesn't have debugging tools. 
      • Container provides environment. 
      • Compile inside the container. 
      • run debugger inside container. 
      • User image has everything - compiled. 
    • Lightweight container for developers then they can compile. Users have full compiled. 
    • Debugging in GR and ST is very different
    • The computing center will have a cache of docker. 
    • Every project will say what docker images do you want on the batch nodes? 
    • Plan for managing cashed images. Work out allocations for collaborations. 
    • Cost of using docker? 

     

    Pipeline:

     

    • Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
    • Docker containers for something like the batch system may cause some problems, since
    • For something like the L1 pipeline, a number of images would need to be launched simultaneously
    •  Would size of the software cause problems with deployment?
    • We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
    • There is probably a precedent for this, however, Matt has no experience deploying on this scale 
    • File size of ~1 GB is best, a few is manageable for production. 
    • IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
    • Lyon is a much larger computing center - likely they will upgrade to Docker first
      • Now full support for Docker at Lyon (Fred)

    Infrastructure:

    • Last purchase went into dev cluster
      • many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
      • Still figuring out NFS/AFS sorted out with RHEL7. GPFS? 
    • It's good to come up with a plan because of security implications if NFS underneath. 
      • Use right docker (UID issues w/security)
    • SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
      • AFS on RHEL6 docker
      • read files if world readable. 
      • NFS is hardest. 
    • Timeline for RHEL7, 12mo? 2018? (Matt)
      • RHEL7 support is dodgy. 
      • Configuration stuff is hard part



    Flight Software:

    •  Julie: No path to having anyone other than SLAC supporting flight software

    LAT On-board Configuration:

     If we desired to change the on-board configuration, what happens?

    •  Jim Panetta knew the most about that. Took knowledge with him. But see this link.
    •  Gregg Thayer can do that nominally.
    •  Handshake in ISOC and GlastRelease has to be done as well
    •  MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
    •  Seems like it takes a while for the ground system to catch up before we can use on instrument.

    From Gregg:

    •  Forming the input to the configuration build system is the least remembered part
    •  System for turning the crank and building configurations is fine
    •  Instructions for turning crank may need work
    •  Then need to check content of product before upload
    •  May 2010: Halt in data processing due to MOOT key mismatch with MPT
      •  Do we know how to handle the MPT?
      •  Gregg…yes.

    Mission Planning/Flight Operations

    Actions:

    • Finalize and document list of needed permissions on the ISOC Mission Planning page
    • Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
    • Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
    • document support cron scripts for SAA checks and planning product generation
    • document occasional mission-week-boundary LAT command scheduling problem
    • NASA planners to take over LCI calibrations planning, effective immediately
    • FSSC to consider having LAT operations workstation/tool installation

    September 2017 Actions:

    • any?

    Routine Flight Operations Tasks

    • Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports

    Actions:

    • improve documentation in Confluence
    • make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
    • move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
      • Fix known bugs in agepoly.pl and x4saa.pl
    • re-discover use of the LAT Configuration GUI tool, and document it
    • migrate other Excel spreadsheet usage to non-Excel implementations
      • TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
      • CAL light output history trending

     If we desired to change the on-board configuration, what happens?

    •  Jim Panetta knew the most about that. Took knowledge with him
    •  Gregg Thayer can do that nominally.
    •  Handshake in ISOC and GlastRelease has to be done as well
    •  MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
    •  Seems like it takes a while for the ground system to catch up before we can use on instrument.

    From Gregg:

    •  Forming the input to the configuration build system is the least remembered part
    •  System for turning the crank and building configurations is fine
    •  Instructions for turning crank may need work
    •  Then need to check content of product before upload
    •  May 2010: Halt in data processing due to MOOT key mismatch with MPT
      •  Do we know how to handle the MPT?
      •  Gregg…yes.

     

     

    Mission Planning/Flight Operations

    Actions:

    • Finalize and document list of needed permissions on the ISOC Mission Planning page
    • Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
    • Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
    • document support cron scripts for SAA checks and planning product generation
    • document occasional mission-week-boundary LAT command scheduling problem
    • NASA planners to take over LCI calibrations planning, effective immediately
    • FSSC to consider having LAT operations workstation/tool installation

    ...

    Mission Planning notes from Elizabeth:

    ...