Notes from Closeout Session: https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM

DOE Commitment:

  • 1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
  • Hardware purchasing? Server infrastructure, databases, etc.
  • IFC still funding hardware purchases
  • Hardware will stay at SLAC
  • ORACLE server migration soon

Want to prepare for a long-term mission continuation.

Revise if appropriate and effort not excessive.

  

Science Tools

Transitioning to Git: 

Joanne foresees a large amount of work involved in transitioning to git due to the SLAC cvs structure involving large numbers of symlinked packages.  She is concerned with possible drift from glastrelease and ScienceTools, since both have overlapping tools.

We settled on keeping the release manager/cvs system the same for now, but have all commits go through git.  An automated script/utility would port any git changes to CVS and would be the only 'user' to have commit privileges on the CVS repository in order to prevent collisions.

Updates from Tuesday Morning:

Brian and Jim began process of porting CVS to git using cvstogit utility

  • in principle you can restrict the groups that have write permission
  • newer CVS to git seems to retain the history, based on svn to git utility
  • main hurdle is permissions problem
    • Need to restrict checkins to CVS to only git script
    • Need to be able to allow users to access CVS users area without affecting/conflicting with git/cvs checkins
  • submodules/details not sorted yet, just prepping for transition now by moving repo to SSD

Brian update:

  • Took list and made CSV file for table on confluence, grep through repo for needed files
  •  Copied entire CVS repo locally to machine in order to ease transition
  •  Doesn't anticipate too many problems with this, but there are a few hiccups like author matching
  •  Org: fermi-lat
  •  Alex and Joe will be added to the fermi-lat team on github
  •  Test conversion of Likelihood (github.com/brianv0/Likelihood)

 

Externals:

Generally in favor of jettisoning most external packages.  Python can be provided by users, no need to carry our own versions around.

IRF generation require ROOT, nothing else does.  Can be removed from main FSSC ScienceTools for the public release, however, it would be good to keep it around for internal/developer use.

Updates from Tuesday Morning:

  • Went through externals and determined statuses/necessity of each package
  •  Also determined whether or not we needed to provided packages or if they were available in conda
  • ScienceTools Package Author Review

 

Other Notes About Science Tools:

  • Many pieces of analysis rely on ST being backwards-compatible.
  • Is the FSSC going to be doing more than just bug-fixes?
  • Will there be more development?
  • Fermipy unit tests catch a lot of issues
    • Advisable to run after Likelihood updates


Data Flow

Data Pipeline:

  • Data arrives from GSFC
  • L0 ingest triggered by FastCopy post-processing script (into database)

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

  •  Launches gap finder to look for missing pieces
  •  Divides delivery into chunks and writes xml file
  •  Up to 20 chunks/delivery. Usually ~12 or so
  • Halfpipe keeps track of what has been delivered
  • Once tidy & ordered, hands off to L1
  • L1 registers in data catalog & sends FITS to GSFC


Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

Halfpipe reads chunked files:

  •  Extracts events from the datagrams
    •  Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
    •  i.e. If numEPUs =/= 2, stop
  •  Merges data from EPUs into a single time-ordered stream
    •  Necessary for down-stream processing
  •  Launches Level 1 pipeline

 

Question: What issues arise with the Halfpipe that need to be addressed?

  •  Datagrams being omitted due to issue with merging logic.
    •  Runs broken up between deliveries. Datagrams fall through cracks.
    •  Has to be repiped.

 

Pipeline Monitoring (Warren & Michael Kuss): 

  • Current shifting scheme:
    •  Warren & Michael each take responsibility for 1/2 day.
    •  Maria Elena covers for when Warren on vacation.
    •  No coverage for Michael.
  • Large disorganized document 
    https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
    •  Needs to be split into child pages and updated
    •  As new person gets trained, would be good to start working on reorganization
    •  Some work done via command line to interact with LSF (batch submission). 
    •  Most (95%) done via the web interface.
    •  Luca L. was looking at adding pipeline monitoring for young Torino folks.

 

ISOC Software Monitoring: 

Also known as....what does Steve T. need to do and how often?

https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary

  •  Monitor the backup process
    •  60 days of data on disk, rest on rooted server
    •  Archive of FastCopy packaged backed up to tape
    •  Raw data backed up to tape outside of xrootd
    •  Not uncommon for a job to be terminated for no reason and have to be restarted
    •  Check once a week, commonly needs intervention
      •  Partly due to TRSchron. Change in the kerberos implementations
      •  Get AFS tokens, but no kerberos tokens. Batch job semi-fails. 
      •  Backup works but cleanup fails.
      •  Only operating at ~60% of capacity… 1/2% per week. 
    •  Documented in the problem report submitted. Jim also complained.
  •  Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

Other issues that Steve T. deals with?

  •  Monitor event log for Error or Fatal messages in the last week.
    •  Usually they come in pages via email
    •  Example…changing how gap finder is triggered. Now invoked by science delivery.
    •  FSW errors go to Steve, Gregg, Jana
    •  For others, need to review the paging configuration to see who is contacted.
  •  Vetting of orbital event reports. 
    •  Would reject if events shifted from one orbit to another.
    •  No more than 3 in any single report.

In migrating (what?) to RHEL6, how much was touched?

  •  CHS package and offline were not touched
  •  All other had to be touched in some way
  •  Depends on a lot of external packages.
  •  Unlikely to be portable past RHEL6
  •  Currently using RHEL5 release under RHEL6
  •  RHEL5 very near end of life

 

FSSC (Don) will review Steve T.’s page with information of what he does on a regular basis.


Data Distribution:

Get rid of FastCopy?

  • New version of FastCopy won't work with current hardware & vice versa
  • Use another transfer mechanism?
    • Would require changing all scripts downstream
    • Since everyone is having issues, may need to look at alternatives
  • New transfer mechanism would require many changes across the players
    • SSC, White Sands, MOC, GIOC, LISOC
    • FOT using VMs w/ SL5 to maintain

  

GlastRelease / Level-1 updates (Heather & Warren):

  •  C&A or operations communicates to ME that something needs to change.
  •  Maria-Elena communicates specific tags with Heather.
  •  Change gets communicated to Heather via mailing lists & JIRAs
  •  Heather makes a new tag and develops on dev.
  •  Run a few datasets through the dev copy of the pipeline.
  •  System tests need to be run. May be picked up by Simone.
  •  SAS-CCB has to happen after tests.
  •  Usually only happening when leap seconds.
  •  Additional version in waiting for external library changes.
  •  Activating new version just requires a new L1 XML file. 
    •  XML file generated by tools, never by hand.

 

Schematic of L1-Proc: http://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/TaskImageServlet?task=215560961&gvOrientation=LR

 

System Tests (Liz Hays, orphan for a long time, now Simone):

Original Documentation: https://confluence.slac.stanford.edu/display/SYSTEST/System+test+overview+and+instructions



Astroserver vs. Data Server:

Do we need it?

  • What does AS do that DS doesn't?
    • Allows for multiple private data sets that can be checked before approving for release
    • Example: multiple P8 revisions 
  • How integrated is it into LAT team everyday life?
    • Limited subset of people
  • Gamma-ray bursts - not all event classes available at FSSC
    • Make available, but use SLAC authentication to access?
  • Method of maintaining some functionality but without a database?
  • Scripting interface available on the astroserver but only from SLAC
  • May be available via ssh tunneling
  • Remote authentication an option for the astroserver.

 Note: Astroserver web interface had 65 uses in last 2 months by 13 users. But most people likely use scripting.

 

Data Catalog: 

Data Catalog functionality is something that MUST be maintained. Need someone for this, definitely.

Data catalog issues that can arise: 

  • Changes to data file path /nfs/...
  • Trying to display too many folders in a browser. Not really an issue on command line. Command line only available onsite.
  • Need to log and get statistics on who is using these applications
  • MC data, intermediate data products for trending, etc.

Data server & astroserver may be combinable..?

  • Host everything in FITS format
  • Perhaps include merit data
  • No need to host instrument monitoring quantities
  • Create a "catalog-ready" data set that is queryable via python?


Updating Calibrations (Warren Focke):

https://confluence.slac.stanford.edu/display/SCIGRPS/LAT+calibrations

  •  Database doesn’t know about leap seconds…slowly drifting out of sync
    •  Must run updates during SAA passages.
  •  If bad file received, shows up in data monitoring. Check datamon history for signature.
    •  CAL pedestals were out of whack. 
    •  Moved validity of previous calibration file, and then reprocessed affected runs.
  •  There is a RHEL6 version of the Gui
  •  Just as easy to use an SQL query instead
    •  Read-only account: glast-reader..pwd is glast-reader

/afs/slac/g/glast/ground/releases/calibrations/CAL/p7repro/

/afs/slac/g/glast/ground/releases/calibrations/TKR/

ACD calibration code location: /nfs/farm/g/glast/tak/releases/GR-20-09-10.



ASP Discussion (Jim Chiang):

Current status at: http://glast-ground.slac.stanford.edu/DataProcessing/

  • GRB alert processes not an issue if things don’t work.
  • Lots of automated processing. FAs use the tar balls that come from the Sky Monitor Processes

Maintenance:

  1. Adding monitored sources: Jim runs a script based on ATels by the FA list.
  2. Putting Leap Seconds in. Requires testing. Needed for the GRB ID assignation.
    1. Confluence page about Leap Seconds. Rob has tried to gather information:
      https://confluence.slac.stanford.edu/display/ISOC/Leap+Second+Adjustments

 

When ASP falls over, rollback at either the process level or the stream level.

ASP is not well documented:

  •  Jim has notes, nothing in confluence. 
  •  Cron jobs are somewhere in Jim’s account. 
  •  GCN parsing is somewhere. Tony Johnson
  •  Pipeline code is in cvs. Code runs under glast-raw.
  •  No CCB process for changes

Initial note from Jim about how to handle ASP: http://glast.stanford.edu/cgi-bin/viewcvs/users/jchiang/ASP/

From J. Chiang email:

The "install_procedures.txt" file is my rather elliptical description of all the steps needed to update the ASP software with a new release, so it points to all of the software elements I mentioned this morning. I'll try to clarify it over the next few days, so if you have any questions on it, please let me know.

The "insert_pointsource.py" script is what I use to add a monitored source to the FSSC list, and I use it with this page: https://confluence.slac.stanford.edu/display/SCIGRPS/New+DRP+sources+based+on+ATels which you've probably seen.  I'll put together more detailed notes on the full procedure and will check that into to the above repository.

If I think of other things to document, I'll check the text into that repo and will let you know about it.   Plus there are a number of maintenance scripts that I need to add to that repo...



Data Reprocessing (Maria Elena):

  •  Cutoff date is vital.
  •  Make new version of all packages, even if no changes
  •  Test L1 on dev pipeline to determine all ok
  •  Tom produces all the reprocessed file on a system that requires no monitoring.
  •  Start shipping reprocessed files to FSSC (database queries by hand)
    •  Ship based on file size (~ GB range, 4-8 at a time)
    •  Takes several weeks
    •  1 week for LS1, 1 day for FT1
  •  Switch over L1 Pipeline XML at a specific date/time. New version & new processing #s
  •  Tom reprocesses the gap data (“backfill”)
    •  1-2 days up to 2 months depending on how detailed reprocessing is
  •  Send backfill data by hand

 


Tracker Recon (was Tracy Usher):

  •  No one else understands it
  •  Eric Charles knows downstream work
  • Need to figure out how to handle this going forward


Datasets (Richard):

Question is what are people using?

  •  Short timescale dataset (no ISOC change needed)
  •  Catalog-like dataset
  •  Electron data
  •  Merit Files (long-term), may be useful for finding something novel in the data
  •  Cosmic Ray data
  •  Monte Carlo datasets, for generating new event classes (Usable in 15 years? Can it be put in context?)

Maybe C&A can ask the collaboration what they need that isn’t being provided by the FSSC.

  •  If reasonable, can look then at adding datasets
  •  Ask the FUG? Not initially planning to make them public.

 

 

Infrastructure Maintenance

 Virtualization: 

  • What needs virtualization?
    • Halfpipe sounds like a candidate..
      •  No, it runs on RHEL6. But unlikely to move beyond. So yes, virtualize at RHEL6.
    •  GlastRelease is also stuck on RHEL6
    •  Couple APIs need QT, using commercial version
      •  Release Manager uses free version of QT
      •  Unsure why using commercial version.
      •  Might be worth exploring move to free version
  •  Need to have a discussion about FastCopy, as it requires RHEL5.
  •  ISOC ops boxes are mostly under RHEL5. Demonstrated that the tools can be run under RHEL6.
  •  Backup ISOC is no longer supported.

What kind of virtualization?  VM or container?

GlastRelease: 

  •     GlastRelease needs virtualizations
    • RHEL 6 is last release that we have the personnel to support
    • A few people running GlastRelease (Developers) - nice use case for Docker. Getting GlastRelease to run on your laptop is painful. 
    • GlastRelease carries around geant4
  • Is there a distinction between Users and Developers for GlastRelease? 
    • No

 

Science Tools:

 

  • Focus with ScienceTools is just ease of distribution
  • Would it be useful to distribute the tools in VMs? Containers? Both?
  • Are there external dependencies (like xroot-d) that would cause problems with virtualization if backend changes?
  • We need automated build system for ST: Release manager vs. manual builds 

  • GR uses xrootd ST does not (Eric)
  • Use of virtualization is for convenience - which is most useful thing to do? (Richard)

 

    • Don't depend on NFS/AFS if build container right. Stable for data xrootd
    • getting files/libraries and also output data.  
    • Container helps with diffuse model
      • on nodes not on NSF
      • on nodes there's low overhead. 
      • Caching image on all of the nodes. 
      • Fermi ST image will have the diffuse model in it. 

Release Manager: Release manager doesn't talk to Oracle - but it does talk to a database. Not user friendly. 

  • For slac farm - docker containers for GlastRelease. Need docker registry
  • Docker containers is the right solution for batch farm (Brian) 
  • Use their system run to RHEL6 container, but batch host is RHEL7.

    • Carefully build container (nice with xrootd)
  • need to find out from Warren if FT1, FT2 files included (Richard)

What systems need what kinds of containers?

  • Samuel needed to discuss w/simulations at Lyon. (He is sick today)
  •  What is different for developers/users? 
  •  Same image for all the GR uses. 
  •  Don't want to pull a 3GB image to pull FT1, GR is 3x bigger. Just have 1 image at the moment. 
  •  One giant image - good command line interface installed in that image. 
  •  Images built such that the top looks the same between GR and ST. Keep same image. 
  •  Separate builds for debugging purposes? 
  •  GlastRelease is frozen, ST is constantly evolving. Debugging GR is not a problem, debugging ST is important
  • Giacomo
    • Mount code at runtime, container doesn't have debugging tools. 
    • Container provides environment. 
    • Compile inside the container. 
    • run debugger inside container. 
    • User image has everything - compiled. 
  • Lightweight container for developers then they can compile. Users have full compiled. 
  • Debugging in GR and ST is very different
  • The computing center will have a cache of docker. 
  • Every project will say what docker images do you want on the batch nodes? 
  • Plan for managing cashed images. Work out allocations for collaborations. 
  • Cost of using docker? 

 

Pipeline:

 

  • Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
  • Docker containers for something like the batch system may cause some problems, since
  • For something like the L1 pipeline, a number of images would need to be launched simultaneously
  •  Would size of the software cause problems with deployment?
  • We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
  • There is probably a precedent for this, however, Matt has no experience deploying on this scale 
  • File size of ~1 GB is best, a few is manageable for production. 
  • IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
  • Lyon is a much larger computing center - likely they will upgrade to Docker first
    • Now full support for Docker at Lyon (Fred)

Infrastructure:

  • Last purchase went into dev cluster
    • many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
    • Still figuring out NFS/AFS sorted out with RHEL7. GPFS? 
  • It's good to come up with a plan because of security implications if NFS underneath. 
    • Use right docker (UID issues w/security)
  • SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
    • AFS on RHEL6 docker
    • read files if world readable. 
    • NFS is hardest. 
  • Timeline for RHEL7, 12mo? 2018? (Matt)
    • RHEL7 support is dodgy. 
    • Configuration stuff is hard part



Flight Software:

  •  Julie: No path to having anyone other than SLAC supporting flight software

LAT On-board Configuration:

 If we desired to change the on-board configuration, what happens?

  •  Jim Panetta knew the most about that. Took knowledge with him. But see this link.
  •  Gregg Thayer can do that nominally.
  •  Handshake in ISOC and GlastRelease has to be done as well
  •  MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
  •  Seems like it takes a while for the ground system to catch up before we can use on instrument.

From Gregg:

  •  Forming the input to the configuration build system is the least remembered part
  •  System for turning the crank and building configurations is fine
  •  Instructions for turning crank may need work
  •  Then need to check content of product before upload
  •  May 2010: Halt in data processing due to MOOT key mismatch with MPT
    •  Do we know how to handle the MPT?
    •  Gregg…yes.

Mission Planning/Flight Operations

Actions:

  • Finalize and document list of needed permissions on the ISOC Mission Planning page
  • Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
  • Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
  • document support cron scripts for SAA checks and planning product generation
  • document occasional mission-week-boundary LAT command scheduling problem
  • NASA planners to take over LCI calibrations planning, effective immediately
  • FSSC to consider having LAT operations workstation/tool installation

September 2017 Actions:

  • any?

Routine Flight Operations Tasks

  • Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports

Actions:

  • improve documentation in Confluence
  • make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
  • move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
    • Fix known bugs in agepoly.pl and x4saa.pl
  • re-discover use of the LAT Configuration GUI tool, and document it
  • migrate other Excel spreadsheet usage to non-Excel implementations
    • TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
    • CAL light output history trending

Mission Planning notes from Elizabeth:

  • fermilnx01 must be used for mission planning when doing anything that required file management 
  • Use ssh -XY when logging into SLAC
  • LAT planning typically performed on Monday. Requires products from the FOT on Monday:
    •  SAA predicts
    •  Fermi ephemeris

 

    • Script available to verify that products are good. Sends emails to Rob.
    • Info about scripts are in the confluence MP page.
    • Can see if products are good or bad by looking at link on confluence MP page.

 

1 Comment

  1. Please append to this page your notes from the Software Week.