You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Notes relating to monitoring the Spring 2012 Full Pass7 Fermi Reprocessing project.

  • Task control
    • trickleStream.py - long-running script to monitor current batch usage, and submit new top-level streams. Run only one instance of this script! I run it on my desktop machine, comet.slac.stanford.edu (RHEL6-64 machine), piping the output into a log file,
      • Starting trickleStream
        $ cd /nfs/farm/g/glast/u38/Reprocess-tasks/P202-ROOT/config
        $ ./trickleStream.py > tslogNN.txt &
        
        (Select "NN" to be next number in sequence.)
      • Stopping trickleStream - this can, of course, be done by killing the process explicitly. Or, one can halt gracefully by creating a file called 'stop' in the config directory, e.g.,
        $ touch stop
        
      • Monitoring trickleStream - watch the log file, e.g.,
        $ tail -f tslogNN.txt
        
  • Job Step Failures
    • Monitor failures from the main pipeline page for this task
    • Typically there have been ~10-30 failures per day, mostly transient disk-access problems (NFS, xroot) solved by a single rollback.
    • All rollbacks are manual (no automatic retries internal to task) – to prevent serious problems from creating run-away rollbacks.
  • No labels