Meeting and tutorial sessions with Warren F. during the 2017 September Software Week. My (brief) notes are below, as well as the recordings of the Zoom sessions that I took.
Videos
zoom_meeting_Warren_F_20170912a.mp4
zoom_meeting_Warren_F_20170912b.mp4
zoom_meeting_Warren_F_20170912c.mp4
zoom_meeting_Warren_F_20170912d.mp4
zoom_meeting_Warren_F_20170914.mp4
Notes
2017-09-12
- Implementing a new Calibration
- perform during SAA pass in the future (as far as processing is concerned)
- log into "ssh -XY jeggen@rhel6-64.slac.stanford.edu"
- had to edit my .cshrc file to get some environment stuff enabled
- bring up window with "$ rdbGUI &" command
- File -> Open DB Schema
- file Name: afs/slac/g/glast/applications/dbSchemas/calib
- make a bookmark for this location (GUI button)
- double-click "calib.xml"
- Session -> Open connection
- got info from Warren to fill in fields
- ongoing infrastucture issue still seems to be preventing us from progressing
- cd $LATCalibRoot/TKR
- latest file is LAT_BadStrips_44.xml, expecting "45"
- file not copied to dir
- file copied to appropriate place, continuing
- "Open connection" password: calibr8tor
- select: instrument
- click on "more"
- select:flavor
- 3rd field: type "L1current"
- click on "more"
- select: calib_type
- 3rd field: select "TKR_DeadChan
- cp /afs/slac/g/glast/users/lsrea/badStrips/xml/LAT_BadStrips_45.xml afs/slac/g/glast/applications/dbSchemas/calib
- check that you have write permissions to destination first
- select last row at bottom of rdbGUI
- right-click, "copy latest option"
- change date to current date,
- change time to middle of SAA passage chosen earlier
- chane data_ident to filename of .xml file above
- click "send"
- check "vstart" time of newest last line
- fixing error in HalfPipe
- look at "Fermi LAT Data Processing" page on the portal site
- this issue likely caused by an infrastructure problem from the previous night
- clicked on a doChunks stream, then "messages", saw a "read timed out" message, indicating network trouble
- use "bjobs" command to test connection to LSF server
2017-09-14
- Monitoring batch farm
- need to be on a machine with a batch client (all public machines do)
- e.g. rhel6-64n
- script in "Things to Know" page Warren created to monitor batch system
- shouldn't run too many things at once, otherwise sys is overloaded
- excess jobs are stored in subdirs
- shouldn't be more than a couple hours old
- look for pending jobs, find lock files
- look at pipeline page
- look at the "summary" link
- look at the "flagFT2" link
- look at the "Show streams" link
- saw a "TERMINATED" job, clicked on it
- saw a "LOCK_RUN" job, clicked on it
- log file never created, dead-end
- go to dir that contains log file
- tail logfile: lots of "permission denied" messages due to expired AFS tokens
- back at pipelin page, click "View messages" link
- put 10000 in " last ___ minutes" box
- indicates problem between pipeline and batch system
- only way to rectify a failed job is to bkill it
- (in terminal)
- jobs="288191 288152"
- bjobs $jobs
- takes an hour for the pipeline to realize the job is dead
- (in terminal)