Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 2/13/2012 - begin trials with final calibration and alignments from Leon; 5 runs reprocessed
  • 2/14/2012 - trials continue with blocks of 15, 20, 25 and 50 runs reprocessed (each run generates ~20 batch jobs)
  • 2/16/2012 - begin trickleStream production. Initial config:
    Code Block
    ===============================================================================
      TRICKLE PARMS
    ===============================================================================
    task =  P202-ROOT
    maxRuns =  19172
    firstStep =  setupRun
    steps =  [['/processRun processClump', 1500, 20], ['mergeClumps', 70, 1]]
    maxStreamsPerCycle =  20
    timePerCycle =  300
    ===============================================================================
    
  • 2/21/2012 - One clump reprocessed with pointer to new mySQL DB (stream 710.0)
  • 2/22/2012 - 776 runs complete. Pausing task.

    S/W component

    modification

    status

    FILTEREDMERIT TCut

    CTBClassLevel>0 changed to ((FT1EventClass & 0x00003EFF)!=0)

    done 3/14/2012

  • 3/15/2012 - resume task. New goal is 1-year of data (~5600 runs)
  • 3/31/2012 - 1-year complete (5600 runs). There have been a few nasty problems which need to be fixed before continuing.

    S/W component

    bug fix

    status

    New ROOT version

    5-min 'transaction timeout' triggered by xroot data server reboot

    done 4/3/2012

    New GlastRelease

    1) include new ROOT version (above); 2) exit with non-zero RC on ROOT write error

    done 4/5/2012, GR 17-35-24-rp04 (or -rp07)

    New GPL_TOOLS(question)

    check size/checksum of file written to xroot with known size/checksum

    pending

    Tuned xroot on new Dell servers

    silent file truncation when volume fills up JIRA

    done 4/4/2012 (100 MB min space limit -> 100 GB; file system space check cadence changed from 10 min to 2 min)

    New xroot client tools

    complain when xroot data server fails on write

    done 4/3/2012, v3.1.1

    New TSkim

    1) new ROOT version (above); 2) complain on ROOT write errors

    done 4/5/2012, v08-02-01

    New xroot redirector

    required step toward enabling HPSS staging

    done 4/3/2012, v3.1.1

    Note also that the FILTEREDMERIT files contain 42 more events than the EXTENDEDFT1 files; they should be identical.
  • 4/5/2012 - resume task. New goal is entire science dataset.
  • 4/10/2012 - Unknown 'glitch' may have caused a few 100's of jobs to crash and take sulky46 along with them.
  • 4/11/2012 - due to possible overload of sulky46/u18 writing a lot of core files, have introduced one change to processClumps.py: prepend "ulimit -c 0;" to gleam command to disable all core file generation. This starts approx with run 7605 (+/-).
  • 4/12/2012 - 10:40pm lightening strikes SLAC power lines. Site-wide power outage. Stream 7795 was the last stream submitted prior to the outage.
  • 4/15/2012 - Batch farm back in operation, resume task...
  • 5/9/2012 - major pipeline issue...shut down pipeline and allow to drain (due to tomorrow's major outage)
  • 5/10/2012 - 13:40 outage over.
    • Update GR from 17-35-24-rp04 to 17-35-24-rp07 in which the only change is replacing the 5-minute xroot time-out with 8 hours. This change effective with stream 14314 and previously failed pieces of four other runs: 14247.6, 14273.23, 14274.8, 14231.9.
    • Leon advises that as of today, calibrations are valid only thru ~15 Dec 2011 (run 345574915) - which is somewhere around stream 18,400. He asks Sasha to produce more up-to-date calibs.
  • 5/18/2012 - all calibrations now valid through 6 May 2012. No need to pause P202 task.
  • 5/28/2012 - 15:30 Complete (through 31 March 2012)
    • Data Catalog summary:  

      Name

      Type

      Files

      Events

      Size

      Created (UTC)

      Links

      CAL

      Group

      20229

      44,125,599,595

      128.7 TB

      25-Jan-2012 00:53:31

      Files

      ELECTRONMERIT

      Group

      20229

      90,904,582

      205.7 GB

      25-Jan-2012 00:53:32

      Files

      FILTEREDMERIT

      Group

      20229

      6,291,396,710

      5.3 TB

      25-Jan-2012 00:53:29

      Files

      GCR

      Group

      20229

      44,123,014,456

      942.7 GB

      25-Jan-2012 00:53:31

      Files

      MERIT

      Group

      20229

      44,125,679,961

      35.4 TB

      25-Jan-2012 00:53:30

      Files

      RECON

      Group

      20229

      44,123,612,977

      590.0 TB

      25-Jan-2012 00:53:33

      Files

      There are discrepancies to track down!
      Turns out to be three problematic runs/streams:
      • 272707024/5723 - I/O prob, corrupt files, entire stream rolled back
      • 279108810/6847 - xroot transient access prob., re-registered in dataCat
      • 284813327/7848 - xroot transient access prob., re-registered in dataCat
  • Final trickleStream configuration:
    Code Block
    ===============================================================================
      TRICKLE PARMS
    ===============================================================================
    task =  P202-ROOT
    maxRuns =  20229
    firstStep =  setupRun
    steps =  [['/processRun processClump', 2000, 21], ['mergeClumps', 200, 1]]
    maxStreamsPerCycle =  20
    timePerCycle =  300
    ------DEBUG----------------
    maxCycles =  0
    chatter =  False
    dryRun =  False
    ===============================================================================
    
  • 5/31/2012 - Cleanup and summary
    • Rolling back all or part of the three runs above solved the discrepancies in # events.  New dataCatalog tally looks like this:

      Name

      Type

      Files

      Events

      Size

      Created (UTC)

      Links

      CAL

      Group

      20229

      44,125,679,961

      128.7 TB

      25-Jan-2012 00:53:31

      Files

      ELECTRONMERIT

      Group

      20229

      90,904,582

      205.7 GB

      25-Jan-2012 00:53:32

      Files

      FILTEREDMERIT

      Group

      20229

      6,291,396,711

      5.3 TB

      25-Jan-2012 00:53:29

      Files

      GCR

      Group

      20229

      44,125,679,961

      942.7 GB

      25-Jan-2012 00:53:31

      Files

      MERIT

      Group

      20229

      44,125,679,961

      35.4 TB

      25-Jan-2012 00:53:30

      Files

      RECON

      Group

      20229

      44,125,679,961

      590.0 TB

      25-Jan-2012 00:53:33

      Files

    • Total run time for 20,229 runs was ~74 days (or about 273 runs/day reprocessed). This includes periods of changing trickleStream configuration as we figured out how much load we could safely put on the system.
  • 6/5/2012 - Three streams rolled back and minor code changes for cleanup (see FITS chronology below for details)
  • 8/10/2012 - Update task for a block3 of backfill (1 Apr 2012 - 31 July 2012) and restart reprocessing.
  • 8/22/2012 - backfill complete
  • 10/6/2012 - Rollback the following seven streams to fix apparently corrupt MERIT files.

    Stream

    Run

     

    3345

    259101994

    <- found by FSSC

    4122

    263571912

     

    4707

    266893978

     

    13927

    319436826

     

    16181

    332306548

    <- found by FSSC

    17430

    339161346

     

    17479

    339408141

     

  • 10/8/2012 - Update task for block4 of backfill (1051 new runs for a total of 23,141) and start reprocessing.

    First run of block 4

    365473283

    2012-08-01 00:21:20 UTC

    Last run of block 4

    371258376

    2012-10-06 23:19:33 UTC

  • 10/15/2012 - Block 4 complete. One problem with run 22240 (see below), rolled back successfully.
  • 12/13/2012 - Update task for block5 of backfill (1001 new runs for a total of 24,142)

    First run of block 5

    371264424

    2012-0810-07 01:00:21 :20 UTC

    Last run of block 5

    376959687

    2012-1012-06 11 23:1901:33 24 UTC

    Status of Calibrations: unknown
  • 12/14/2012 - Status of dataCatalog prior to beginning block 5 catch-up:

...

  • 3/2/2012 - Define block 1 as the 776 runs in P202-ROOT block 1. Configure trickleStream and begin (14:08)
  • 3/31/2012 - Define block 2 as 5600 runs. Reconfig trickleStream and begin (18:05)
  • 4/01/2012 - Block 2 complete (most of the 4824 jobs completed in about six hours w/1000 job limit).
  • 5/31/2012
    • Discover stream 5599 (run 271999199) requires rollback - new MERIT file version (v203->v205).
    • Regenerate runlist with 20,229 input MERIT files.
    • Minor config change: twoClumpMin=False (formerly True, but caused unnecessary extra processClump dummy jobs)
  • 6/5/2012 - Final cleanup
    • Five runs are responsible for discrepant event tallies:

      Run

      Stream

      Reason

      Action

      239557414

      0

      bad TCut

      rollback ROOT + FITS

      241599746

      352

      bad processing order

      rollback FITS

      245403855

      1019

      bad start time

      change start time in runList and rollback

      332661583

      16244

      silent root errors

      rollback ROOT

      339081502

      17416

      silent root errors

      rollback ROOT

    • Code changes:

      directory

      script

      modification

      commonTools/00-01-00

      repTools.py

      added new getKey() function to extract #events from FITS event files

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="626bad0b69660bab-c760fe1d-43e04860-9e7f92bd-2354681c2866475217af0872"><ac:plain-text-body><![CDATA[

      P202-FITS/config

      config.py

      added os.environ['HEADASNOQUERY']='true' to enable 'ftlist' to run in batch

      ]]></ac:plain-text-body></ac:structured-macro>

      P202-FITS/config

      makeFITS.py

      add diagnostic print of #evts in FITS files for each processing sub-step

    • After these actions, the dataCatalog tallies are now consistent:

      Name

      Type

      Files

      Events

      Size

      Created (UTC)

      Links

      ELECTRONFT1

      Group

      20229

      0

      8.5 GB

      02-Mar-2012 00:06:07

      Files

      ELECTRONMERIT

      Group

      20229

      90,904,582

      205.7 GB

      25-Jan-2012 00:53:32

      Files

      FT1

      Group

      20229

      189,323,074

      17.8 GB

      02-Mar-2012 00:06:06

      Files

      LS1

      Group

      20229

      1,325,204,821

      215.3 GB

      02-Mar-2012 00:06:08

      Files

      EXTENDEDFT1

      Group

      20229

      6,291,424,926

      574.7 GB

      02-Mar-2012 00:06:09

      Files

      EXTENDEDLS1

      Group

      20229

      6,291,424,926

      1,020.1 GB

      02-Mar-2012 00:06:09

      Files

      FILTEREDMERIT

      Group

      20229

      6,291,424,926

      5.3 TB

      25-Jan-2012 00:53:29

      Files

      MERIT

      Group

      20229

      44,125,679,961

      35.4 TB

      25-Jan-2012 00:53:30

      Files

      RECON

      Group

      20229

      44,125,679,961

      590.0 TB

      25-Jan-2012 00:53:33

      Files

      GCR

      Group

      20229

      44,125,679,961

      942.7 GB

      25-Jan-2012 00:53:31

      Files

      CAL

      Group

      20229

      44,125,679,961

      128.7 TB

      25-Jan-2012 00:53:31

      Files

      Note that the number of events in ELECTRONFT1 files are not currently tallied by the dataCatalog.
  • 8/24/2012 - Configure and run backfill through 31 July 2012
  • 10/7/2012 - Rollback the following seven streams to fix corrupt FITS files. The makeFT1 app received a large number of ROOT error, but terminated normally. The resulting FT1/LS1 files had multiple symptoms, including one discovered by the FSSC during transfer to them: one EVENT_ID was repeated ~1000 times. Scan of log files uncovered an additional five mergeClumps with these errors.

...

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b56c76d48ac8af29-262a6b90-4edd454f-b7acb39b-366d64e1df629e7c0fa690f3"><ac:plain-text-body><![CDATA[

Data Product

destination

data content [1]

event selection [1]

makeFT1

gtselect

gtdiffrsp

gtmktime

]]></ac:plain-text-body></ac:structured-macro>

EXTENDEDFT1

SLAC

FT1variables

((FT1EventClass & 0x00003EFF)!=0)
pass7.6_Extended_cuts_L1

(tick)

(error)

(tick)

(tick)

FT1

FSSC+SLAC

FT1variables

'source' and above
EVENT_CLASS bits 2,3,4
evclass=2 filtered from EXTENDEDFT1

(error)

(tick)

(inherited)

(tick)

EXTENDEDLS1

SLAC

LS1variables

((FT1EventClass & 0x00003EFF)!=0)
pass7.6_Extended_cuts_L1

(tick)

(error)

(tick)

(tick)

LS1

FSSC+SLAC

LS1variables

'transient' and above
EVENT_CLASS bits 0,2,3,4
evclass=0 filtered from EXTENDEDLS1

(error)

(tick)

(inherited)

(tick)

ELECTRONFT1

SLAC

FT1variables

CTBParticleType==1
pass7.6_Electrons_cuts_L1

(tick)

(error)

(error)

(tick)

...