You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Meeting and tutorial sessions with Warren F. during the 2017 September Software Week. My (brief) notes are below, as well as the recordings of the Zoom sessions that I took.

Videos

zoom_meeting_Warren_F_20170912a.mp4

zoom_meeting_Warren_F_20170912b.mp4

zoom_meeting_Warren_F_20170912c.mp4

zoom_meeting_Warren_F_20170912d.mp4

zoom_meeting_Warren_F_20170914.mp4

Notes

2017-09-12

  • Implementing a new Calibration
    • perform during SAA pass in the future (as far as processing is concerned)
    • log into "ssh -XY jeggen@rhel6-64.slac.stanford.edu"
    • had to edit my .cshrc file to get some environment stuff enabled
    • bring up window with "$ rdbGUI &" command
    • File -> Open DB Schema
    •   file Name: afs/slac/g/glast/applications/dbSchemas/calib
    •   make a bookmark for this location (GUI button)
    •   double-click "calib.xml"
  • Session -> Open connection
    •   got info from Warren to fill in fields
  • ongoing infrastucture issue still seems to be preventing us from progressing
  • cd $LATCalibRoot/TKR
  • latest file is LAT_BadStrips_44.xml, expecting "45"
    • file not copied to dir
    • file copied to appropriate place, continuing
  • "Open connection" password: calibr8tor
    •  select: instrument
    •  click on "more"
    •  select:flavor
    •  3rd field: type "L1current"
    •  click on "more"
    •  select: calib_type
    •  3rd field: select "TKR_DeadChan
  • cp /afs/slac/g/glast/users/lsrea/badStrips/xml/LAT_BadStrips_45.xml afs/slac/g/glast/applications/dbSchemas/calib
    • check that you have write permissions to destination first
  • select last row at bottom of rdbGUI
    • right-click, "copy latest option"
    • change date to current date,
    • change time to middle of SAA passage chosen earlier
    • chane data_ident to filename of .xml file above
    • click "send"
    • check "vstart" time of newest last line
  • fixing error in HalfPipe
    • look at "Fermi LAT Data Processing" page on the portal site
    • this issue likely caused by an infrastructure problem from the previous night
    • clicked on a doChunks stream, then "messages", saw a "read timed out" message, indicating network trouble
    • use "bjobs" command to test connection to LSF server

 

2017-09-14

  • Monitoring batch farm
    • need to be on a machine with a batch client (all public machines do)
    • e.g. rhel6-64n
  • script in "Things to Know" page Warren created to monitor batch system
    • shouldn't run too many things at once, otherwise sys is overloaded
    • excess jobs are stored in subdirs
    • shouldn't be more than a couple hours old
    • look for pending jobs, find lock files
    • look at pipeline page
      • look at the "summary" link
      • look at the "flagFT2" link
      • look at the "Show streams" link
      • saw a "TERMINATED" job, clicked on it
      • saw a "LOCK_RUN" job, clicked on it
      • log file never created, dead-end
    • go to dir that contains log file
      • tail logfile: lots of "permission denied" messages due to expired AFS tokens
    • back at pipelin page, click "View messages" link
      • put 10000 in " last ___ minutes" box
      • indicates problem between pipeline and batch system
  • only way to rectify a failed job is to bkill it
    • (in terminal)
      • jobs="288191 288152"
      • bjobs $jobs
    • takes an hour for the pipeline to realize the job is dead
  • No labels