Software Week Meeting notes

Notes from Closeout Session: https://docs.google.com/document/d/1Ka4CkBBdhAkowJEzxd8_LyQCbSI2po41P4Tsrp6xviM

DOE Commitment:

1.5 FTEs after 2018, down from 3 FTEs. Needs to be complete by end of FY ’18. Gradual ramp-up and transition desired.
Hardware purchasing? Server infrastructure, databases, etc.
IFC still funding hardware purchases
Hardware will stay at SLAC
ORACLE server migration soon

Want to prepare for a long-term mission continuation.

Revise if appropriate and effort not excessive.

Science Tools

Transitioning to Git:

Joanne foresees a large amount of work involved in transitioning to git due to the SLAC cvs structure involving large numbers of symlinked packages. She is concerned with possible drift from glastrelease and ScienceTools, since both have overlapping tools.

We settled on keeping the release manager/cvs system the same for now, but have all commits go through git. An automated script/utility would port any git changes to CVS and would be the only 'user' to have commit privileges on the CVS repository in order to prevent collisions.

Updates from Tuesday Morning:

Brian and Jim began process of porting CVS to git using cvstogit utility

in principle you can restrict the groups that have write permission
newer CVS to git seems to retain the history, based on svn to git utility
main hurdle is permissions problem
- Need to restrict checkins to CVS to only git script
- Need to be able to allow users to access CVS users area without affecting/conflicting with git/cvs checkins
submodules/details not sorted yet, just prepping for transition now by moving repo to SSD

Brian update:

Took list and made CSV file for table on confluence, grep through repo for needed files
Copied entire CVS repo locally to machine in order to ease transition
Doesn't anticipate too many problems with this, but there are a few hiccups like author matching
Org: fermi-lat
Alex and Joe will be added to the fermi-lat team on github
Test conversion of Likelihood (github.com/brianv0/Likelihood)

Externals:

Generally in favor of jettisoning most external packages. Python can be provided by users, no need to carry our own versions around.

IRF generation require ROOT, nothing else does. Can be removed from main FSSC ScienceTools for the public release, however, it would be good to keep it around for internal/developer use.

Updates from Tuesday Morning:

Went through externals and determined statuses/necessity of each package
Also determined whether or not we needed to provided packages or if they were available in conda
ScienceTools Package Author Review

Other Notes About Science Tools:

Many pieces of analysis rely on ST being backwards-compatible.
Is the FSSC going to be doing more than just bug-fixes?
Will there be more development?
Fermipy unit tests catch a lot of issues
- Advisable to run after Likelihood updates

Data Flow

Data Pipeline:

Data arrives from GSFC
L0 ingest triggered by FastCopy post-processing script (into database)

Fast copy: https://confluence.slac.stanford.edu/display/ISOC/FASTCopy+processing+chain

processsci.py (?? I didn't really catch this)

Launches gap finder to look for missing pieces
Divides delivery into chunks and writes xml file
Up to 20 chunks/delivery. Usually ~12 or so

Halfpipe keeps track of what has been delivered
Once tidy & ordered, hands off to L1
L1 registers in data catalog & sends FITS to GSFC

Halfpipe (Steve Tether, maybe Gregg?):

Halfpipe has a specific Operations Task - gap reporting? (from Rob)

Warren: No, occurs before the Halfpipe…during FastCopy post-processing

Halfpipe reads chunked files:

Extracts events from the datagrams

Requires 2 and exactly 2 EPUs, otherwise will be rejected (in mergeEvents)
i.e. If numEPUs =/= 2, stop

Merges data from EPUs into a single time-ordered stream

Necessary for down-stream processing

Launches Level 1 pipeline

Question: What issues arise with the Halfpipe that need to be addressed?

Datagrams being omitted due to issue with merging logic.

Runs broken up between deliveries. Datagrams fall through cracks.
Has to be repiped.

Pipeline Monitoring (Warren & Michael Kuss):

Current shifting scheme:
- Warren & Michael each take responsibility for 1/2 day.
- Maria Elena covers for when Warren on vacation.
- No coverage for Michael.
Large disorganized document
https://confluence.slac.stanford.edu/display/ds/Things+to+know+while+on-call+for+Data+Processing
- Needs to be split into child pages and updated
- As new person gets trained, would be good to start working on reorganization
- Some work done via command line to interact with LSF (batch submission).
- Most (95%) done via the web interface.
- Luca L. was looking at adding pipeline monitoring for young Torino folks.

ISOC Software Monitoring:

Also known as....what does Steve T. need to do and how often?

https://confluence.slac.stanford.edu/display/SAS/ISOC+software+summary

Monitor the backup process

60 days of data on disk, rest on rooted server
Archive of FastCopy packaged backed up to tape
Raw data backed up to tape outside of xrootd
Not uncommon for a job to be terminated for no reason and have to be restarted
Check once a week, commonly needs intervention

Partly due to TRSchron. Change in the kerberos implementations
Get AFS tokens, but no kerberos tokens. Batch job semi-fails.
Backup works but cleanup fails.
Only operating at ~60% of capacity… 1/2% per week.

Documented in the problem report submitted. Jim also complained.

Regular maintenance documented on a page for Jim Pancetta that explains how to check up on data maintenance tasks.

Other issues that Steve T. deals with?

Monitor event log for Error or Fatal messages in the last week.

Usually they come in pages via email
Example…changing how gap finder is triggered. Now invoked by science delivery.
FSW errors go to Steve, Gregg, Jana
For others, need to review the paging configuration to see who is contacted.

Vetting of orbital event reports.

Would reject if events shifted from one orbit to another.
No more than 3 in any single report.

In migrating (what?) to RHEL6, how much was touched?

CHS package and offline were not touched
All other had to be touched in some way
Depends on a lot of external packages.
Unlikely to be portable past RHEL6
Currently using RHEL5 release under RHEL6
RHEL5 very near end of life

FSSC (Don) will review Steve T.’s page with information of what he does on a regular basis.

Data Distribution:

Get rid of FastCopy?

New version of FastCopy won't work with current hardware & vice versa
Use another transfer mechanism?
- Would require changing all scripts downstream
- Since everyone is having issues, may need to look at alternatives
New transfer mechanism would require many changes across the players
- SSC, White Sands, MOC, GIOC, LISOC
- FOT using VMs w/ SL5 to maintain

GlastRelease / Level-1 updates (Heather & Warren):

C&A or operations communicates to ME that something needs to change.
Maria-Elena communicates specific tags with Heather.
Change gets communicated to Heather via mailing lists & JIRAs
Heather makes a new tag and develops on dev.
Run a few datasets through the dev copy of the pipeline.
System tests need to be run. May be picked up by Simone.
SAS-CCB has to happen after tests.
Usually only happening when leap seconds.
Additional version in waiting for external library changes.
Activating new version just requires a new L1 XML file.

XML file generated by tools, never by hand.

Schematic of L1-Proc: http://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/TaskImageServlet?task=215560961&gvOrientation=LR

System Tests (Liz Hays, orphan for a long time, now Simone):

Original Documentation: https://confluence.slac.stanford.edu/display/SYSTEST/System+test+overview+and+instructions

Predefined simulations through the full pipeline.
Uses plots to compare with previous run to see what, if anything, changed.
Unit tests…Check that nothing breaks
Validation tests…is it working the way we expected?
As an example, look at notes from Simone
https://confluence.slac.stanford.edu/display/SAS/GlastRelease+Systest+Update+Software+Week+Feb+2017
Review of the results should happen in C&A?

Astroserver vs. Data Server:

Do we need it?

What does AS do that DS doesn't?
- Allows for multiple private data sets that can be checked before approving for release
- Example: multiple P8 revisions
How integrated is it into LAT team everyday life?
- Limited subset of people
Gamma-ray bursts - not all event classes available at FSSC
- Make available, but use SLAC authentication to access?
Method of maintaining some functionality but without a database?
Scripting interface available on the astroserver but only from SLAC
May be available via ssh tunneling
Remote authentication an option for the astroserver.

Note: Astroserver web interface had 65 uses in last 2 months by 13 users. But most people likely use scripting.

Data Catalog:

Data Catalog functionality is something that MUST be maintained. Need someone for this, definitely.

Data catalog issues that can arise:

Changes to data file path /nfs/...
Trying to display too many folders in a browser. Not really an issue on command line. Command line only available onsite.
Need to log and get statistics on who is using these applications
MC data, intermediate data products for trending, etc.

Data server & astroserver may be combinable..?

Host everything in FITS format
Perhaps include merit data
No need to host instrument monitoring quantities
Create a "catalog-ready" data set that is queryable via python?

Updating Calibrations (Warren Focke):

https://confluence.slac.stanford.edu/display/SCIGRPS/LAT+calibrations

About once per quarter, calibration location metadata gets loaded into database
Uses rdbGui tool from Joanne. Files are in XML format. OLD instructions:
http://www.slac.stanford.edu/exp/glast/ground/LATSoft/notes/rdbGui/rdbGui-use.shtml
- SQL database. Command could be run via command line instead.

Database doesn’t know about leap seconds…slowly drifting out of sync
- Must run updates during SAA passages.
If bad file received, shows up in data monitoring. Check datamon history for signature.

CAL pedestals were out of whack.
Moved validity of previous calibration file, and then reprocessed affected runs.

There is a RHEL6 version of the Gui
Just as easy to use an SQL query instead

Read-only account: glast-reader..pwd is glast-reader

/afs/slac/g/glast/ground/releases/calibrations/CAL/p7repro/

/afs/slac/g/glast/ground/releases/calibrations/TKR/

ACD calibration code location: /nfs/farm/g/glast/tak/releases/GR-20-09-10.

ASP Discussion (Jim Chiang):

Current status at: http://glast-ground.slac.stanford.edu/DataProcessing/

GRB alert processes not an issue if things don’t work.
Lots of automated processing. FAs use the tar balls that come from the Sky Monitor Processes

Maintenance:

Adding monitored sources: Jim runs a script based on ATels by the FA list.
Putting Leap Seconds in. Requires testing. Needed for the GRB ID assignation.
1. Confluence page about Leap Seconds. Rob has tried to gather information:
  https://confluence.slac.stanford.edu/display/ISOC/Leap+Second+Adjustments

When ASP falls over, rollback at either the process level or the stream level.

ASP is not well documented:

Jim has notes, nothing in confluence.
Cron jobs are somewhere in Jim’s account.
GCN parsing is somewhere. Tony Johnson
Pipeline code is in cvs. Code runs under glast-raw.
No CCB process for changes

Initial note from Jim about how to handle ASP: http://glast.stanford.edu/cgi-bin/viewcvs/users/jchiang/ASP/

From J. Chiang email:

The "install_procedures.txt" file is my rather elliptical description of all the steps needed to update the ASP software with a new release, so it points to all of the software elements I mentioned this morning. I'll try to clarify it over the next few days, so if you have any questions on it, please let me know.

The "insert_pointsource.py" script is what I use to add a monitored source to the FSSC list, and I use it with this page: https://confluence.slac.stanford.edu/display/SCIGRPS/New+DRP+sources+based+on+ATels which you've probably seen. I'll put together more detailed notes on the full procedure and will check that into to the above repository.

If I think of other things to document, I'll check the text into that repo and will let you know about it. Plus there are a number of maintenance scripts that I need to add to that repo...

Data Reprocessing (Maria Elena):

Cutoff date is vital.
Make new version of all packages, even if no changes
Test L1 on dev pipeline to determine all ok
Tom produces all the reprocessed file on a system that requires no monitoring.
Start shipping reprocessed files to FSSC (database queries by hand)

Ship based on file size (~ GB range, 4-8 at a time)
Takes several weeks
1 week for LS1, 1 day for FT1

Switch over L1 Pipeline XML at a specific date/time. New version & new processing #s
Tom reprocesses the gap data (“backfill”)

1-2 days up to 2 months depending on how detailed reprocessing is

Send backfill data by hand

Tracker Recon (was Tracy Usher):

No one else understands it
Eric Charles knows downstream work
Need to figure out how to handle this going forward

Datasets (Richard):

Question is what are people using?

Short timescale dataset (no ISOC change needed)
Catalog-like dataset
Electron data
Merit Files (long-term), may be useful for finding something novel in the data
Cosmic Ray data
Monte Carlo datasets, for generating new event classes (Usable in 15 years? Can it be put in context?)

Maybe C&A can ask the collaboration what they need that isn’t being provided by the FSSC.

If reasonable, can look then at adding datasets
Ask the FUG? Not initially planning to make them public.

Infrastructure Maintenance

Virtualization:

What needs virtualization?
- Halfpipe sounds like a candidate..
- GlastRelease is also stuck on RHEL6
- Couple APIs need QT, using commercial version
Need to have a discussion about FastCopy, as it requires RHEL5.
ISOC ops boxes are mostly under RHEL5. Demonstrated that the tools can be run under RHEL6.
Backup ISOC is no longer supported.

What kind of virtualization? VM or container?

GlastRelease:

GlastRelease needs virtualizations
- RHEL 6 is last release that we have the personnel to support
- A few people running GlastRelease (Developers) - nice use case for Docker. Getting GlastRelease to run on your laptop is painful.
- GlastRelease carries around geant4
Is there a distinction between Users and Developers for GlastRelease?
- No

Science Tools:

Focus with ScienceTools is just ease of distribution
Would it be useful to distribute the tools in VMs? Containers? Both?
Are there external dependencies (like xroot-d) that would cause problems with virtualization if backend changes?
We need automated build system for ST: Release manager vs. manual builds
GR uses xrootd ST does not (Eric)
Use of virtualization is for convenience - which is most useful thing to do? (Richard)

- Don't depend on NFS/AFS if build container right. Stable for data xrootd
- getting files/libraries and also output data.
- Container helps with diffuse model
  - on nodes not on NSF
  - on nodes there's low overhead.
  - Caching image on all of the nodes.
  - Fermi ST image will have the diffuse model in it.

Release Manager: Release manager doesn't talk to Oracle - but it does talk to a database. Not user friendly.

For slac farm - docker containers for GlastRelease. Need docker registry
Docker containers is the right solution for batch farm (Brian)
Use their system run to RHEL6 container, but batch host is RHEL7.
- Carefully build container (nice with xrootd)
need to find out from Warren if FT1, FT2 files included (Richard)

What systems need what kinds of containers?

Samuel needed to discuss w/simulations at Lyon. (He is sick today)
What is different for developers/users?
Same image for all the GR uses.
Don't want to pull a 3GB image to pull FT1, GR is 3x bigger. Just have 1 image at the moment.
One giant image - good command line interface installed in that image.
Images built such that the top looks the same between GR and ST. Keep same image.
Separate builds for debugging purposes?
GlastRelease is frozen, ST is constantly evolving. Debugging GR is not a problem, debugging ST is important
Giacomo
- Mount code at runtime, container doesn't have debugging tools.
- Container provides environment.
- Compile inside the container.
- run debugger inside container.
- User image has everything - compiled.
Lightweight container for developers then they can compile. Users have full compiled.
Debugging in GR and ST is very different
The computing center will have a cache of docker.
Every project will say what docker images do you want on the batch nodes?
Plan for managing cashed images. Work out allocations for collaborations.
Cost of using docker?

Pipeline:

Needs someone that he could show the pipeline code and train to do heavy lifting when it comes to kicking the pipeline
Docker containers for something like the batch system may cause some problems, since
For something like the L1 pipeline, a number of images would need to be launched simultaneously
Would size of the software cause problems with deployment?
We would need a system where you restrict loading images to the batch farm to prevent collisions/problems
There is probably a precedent for this, however, Matt has no experience deploying on this scale
File size of ~1 GB is best, a few is manageable for production.
IT dept supportive of docker@SLAC. There is 1 machine with RHEL7
Lyon is a much larger computing center - likely they will upgrade to Docker first
- Now full support for Docker at Lyon (Fred)

Infrastructure:

Last purchase went into dev cluster
- many nodes @RHEL6, upgrade to RHEL7 and doing docker with this
- Still figuring out NFS/AFS sorted out with RHEL7. GPFS?
It's good to come up with a plan because of security implications if NFS underneath.
- Use right docker (UID issues w/security)
SLAC will give us a few nodes for testing docker. Fall back way to install on user machines. (Brian)
- AFS on RHEL6 docker
- read files if world readable.
- NFS is hardest.
Timeline for RHEL7, 12mo? 2018? (Matt)
- RHEL7 support is dodgy.
- Configuration stuff is hard part

Flight Software:

Julie: No path to having anyone other than SLAC supporting flight software

LAT On-board Configuration:

If we desired to change the on-board configuration, what happens?

Jim Panetta knew the most about that. Took knowledge with him. But see this link.
Gregg Thayer can do that nominally.
Handshake in ISOC and GlastRelease has to be done as well
MOOT/MOOD Table where the key is stored before it’s transmitted to flight software.
Seems like it takes a while for the ground system to catch up before we can use on instrument.

From Gregg:

Forming the input to the configuration build system is the least remembered part
System for turning the crank and building configurations is fine
Instructions for turning crank may need work
Then need to check content of product before upload
May 2010: Halt in data processing due to MOOT key mismatch with MPT

Do we know how to handle the MPT?
Gregg…yes.

Mission Planning/Flight Operations

Actions:

Finalize and document list of needed permissions on the ISOC Mission Planning page
Get Fermi Sharepoint access for NASA people (SLAC windows account is not enough)
Robin/Elizabeth/Jerry to propose round robin schedule for weekly mission planning
document support cron scripts for SAA checks and planning product generation
document occasional mission-week-boundary LAT command scheduling problem
NASA planners to take over LCI calibrations planning, effective immediately
FSSC to consider having LAT operations workstation/tool installation

September 2017 Actions:

any?

Routine Flight Operations Tasks

Monitoring, Trending, Reporting: for Weekly LAT Science reports and Quarterly LAT Science reports

Actions:

improve documentation in Confluence
make use of API to Google sheets for auto-updating LAT SSR usage and LAT trigger histories
move LAT monitoring tools and data files from /afs/slac/u/gl/rac/LATmetrics/ to /afs/slac/g/glast/isoc/flightOps/LATmetrics/ for easier shared use and support
- Fix known bugs in agepoly.pl and x4saa.pl
re-discover use of the LAT Configuration GUI tool, and document it
migrate other Excel spreadsheet usage to non-Excel implementations
- TKR on-board and ground bad strip history trending: need a solution (ipython notebook?) that makes time plots, and also 4x4 grid-based info output
- CAL light output history trending

Mission Planning notes from Elizabeth:

fermilnx01 must be used for mission planning when doing anything that required file management
Use ssh -XY when logging into SLAC
LAT planning typically performed on Monday. Requires products from the FOT on Monday:
- SAA predicts
- Fermi ephemeris

- Script available to verify that products are good. Sends emails to Rob.
- Info about scripts are in the confluence MP page.
- Can see if products are good or bad by looking at link on confluence MP page.

Plan same week as Observatory timeline on Friday
- If Monday is a holiday, planning will wait until Tuesday morning.
- Go to http://www.slac.stanford.edu/exp/glast/ops/ and check the most recent MW info
- SAAcheck.latest gives information about shift between products. Look for small deviations.
- SAAcheck.txt has the full history for SAA file comparisons
- Follow the “Build the Weekly ATS” instructions at:
  https://confluence.slac.stanford.edu/display/ISOC/Mission+Planning#MissionPlanning-ToolsforMPProductReviewandApproval

Space shortcuts

Child pages

Science Tools

Data Flow

Routine Flight Operations Tasks

1 Comment

Joseph A. Asercion