...
- 2/13/2012 - begin trials with final calibration and alignments from Leon; 5 runs reprocessed
- 2/14/2012 - trials continue with blocks of 15, 20, 25 and 50 runs reprocessed (each run generates ~20 batch jobs)
- 2/16/2012 - begin trickleStream production. Initial config:
Code Block =============================================================================== TRICKLE PARMS =============================================================================== task = P202-ROOT maxRuns = 19172 firstStep = setupRun steps = [['/processRun processClump', 1500, 20], ['mergeClumps', 70, 1]] maxStreamsPerCycle = 20 timePerCycle = 300 ===============================================================================
- 2/21/2012 - One clump reprocessed with pointer to new mySQL DB (stream 710.0)
- 2/22/2012 - 776 runs complete. Pausing task.
- 3/15/2012 - resume task. New goal is 1-year of data (~5600 runs)
- 3/31/2012 - 1-year complete (5600 runs). There have been a few nasty problems which need to be fixed before continuing. Note also that the FILTEREDMERIT files contain 42 more events than the EXTENDEDFT1 files; they should be identical.
S/W component
bug fix
status
New ROOT version
5-min 'transaction timeout' triggered by xroot data server reboot
done 4/3/2012
New GlastRelease
1) include new ROOT version (above); 2) exit with non-zero RC on ROOT write error
done 4/5/2012, GR 17-35-24-rp04
New GPL_TOOLS
check size/checksum of file written to xroot with known size/checksum
pending
Tuned xroot on new Dell servers
silent file truncation when volume fills up JIRA
done 4/4/2012 (100 MB min space limit -> 100 GB; file system space check cadence changed from 10 min to 2 min)
New xroot client tools
complain when xroot data server fails on write
done 4/3/2012, v3.1.1
New TSkim
1) new ROOT version (above); 2) complain on ROOT write errors
done 4/5/2012, v08-02-01
New xroot redirector
required step toward enabling HPSS staging
done 4/3/2012, v3.1.1
Modified FILTEREDMERIT filter
CTBClassLevel>0 changed to ((FT1EventClass & 0x00003EFF)!=0)
done 3/14/2012
- 4/5/2012 - resume task. New goal is entire science dataset.
- 4/10/2012 - Unknown 'glitch' may have caused a few 100's of jobs to crash and take sulky46 along with them.
- 4/11/2012 - due to possible overload of sulky46/u18 writing a lot of core files, have introduced one change to processClumps.py: prepend "ulimit -c 0;" to gleam command to disable all core file generation. This starts approx with run 7605 (+/-).
- 4/12/2012 - 10:40pm lightening strikes SLAC power lines. Site-wide power outage. Stream 7795 was the last stream submitted prior to the outage.
- 4/15/2012 - Batch farm back in operation, resume task...
- 5/9/2012 - major pipeline issue...shut down pipeline and allow to drain (due to tomorrow's major outage)
- 5/10/2012 - 13:40 outage over.
- Update GR from 17-35-24-rp04 to 17-35-24-rp07 in which the only change is replacing the 5-minute xroot time-out with 8 hours. This change effective with stream 14314 and previously failed pieces of four other runs: 14247.6, 14273.23, 14274.8, 14231.9.
- Leon advises that as of today, calibrations are valid only thru ~15 Dec 2011 (run 345574915) - which is somewhere around stream 18,400. He asks Sasha to produce more up-to-date calibs.
- 5/18/2012 - all calibrations now valid through 6 May 2012. No need to pause P202 task.
- 5/28/2012 - 15:30 Complete (through 31 March 2012)
- Data Catalog summary: There are discrepancies to track down!
Name
Type
Files
Events
Size
Created (UTC)
Links
Group
20229
44,125,599,595
128.7 TB
25-Jan-2012 00:53:31
Group
20229
90,904,582
205.7 GB
25-Jan-2012 00:53:32
Group
20229
6,291,396,710
5.3 TB
25-Jan-2012 00:53:29
Group
20229
44,123,014,456
942.7 GB
25-Jan-2012 00:53:31
Group
20229
44,125,679,961
35.4 TB
25-Jan-2012 00:53:30
Group
20229
44,123,612,977
590.0 TB
25-Jan-2012 00:53:33
Turns out to be three problematic runs/streams:- 272707024/5723 - I/O prob, corrupt files, entire stream rolled back
- 279108810/6847 - xroot transient access prob., re-registered in dataCat
- 284813327/7848 - xroot transient access prob., re-registered in dataCat
- Data Catalog summary:
- Final trickleStream configuration:
Code Block =============================================================================== TRICKLE PARMS =============================================================================== task = P202-ROOT maxRuns = 20229 firstStep = setupRun steps = [['/processRun processClump', 2000, 21], ['mergeClumps', 200, 1]] maxStreamsPerCycle = 20 timePerCycle = 300 ------DEBUG---------------- maxCycles = 0 chatter = False dryRun = False ===============================================================================
- 5/31/2012 - Cleanup and summary
- Rolling back all or part of the three runs above solved the discrepancies in # events. New dataCatalog tally looks like this:
Name
Type
Files
Events
Size
Created (UTC)
Links
Group
20229
44,125,679,961
128.7 TB
25-Jan-2012 00:53:31
Group
20229
90,904,582
205.7 GB
25-Jan-2012 00:53:32
Group
20229
6,291,396,711
5.3 TB
25-Jan-2012 00:53:29
Group
20229
44,125,679,961
942.7 GB
25-Jan-2012 00:53:31
Group
20229
44,125,679,961
35.4 TB
25-Jan-2012 00:53:30
Group
20229
44,125,679,961
590.0 TB
25-Jan-2012 00:53:33
- Total run time for 20,229 runs was ~74 days (or about 273 runs/day reprocessed). This includes periods of changing trickleStream configuration as we figured out how much load we could safely put on the system.
- Rolling back all or part of the three runs above solved the discrepancies in # events. New dataCatalog tally looks like this:
...
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="49db4690e5b39012-d5d56dc7-48124b7c-83a6a5ac-8f26ead06e4875f9588d2776"><ac:plain-text-body><![CDATA[ | Data Product | destination | data content [1] | event selection [1] | makeFT1 | gtselect | gtdiffrsp | gtmktime | ]]></ac:plain-text-body></ac:structured-macro> |
---|---|---|---|---|---|---|---|---|---|
EXTENDEDFT1 | SLAC | FT1variables | ((FT1EventClass & 0x00003EFF)!=0) | | | | | ||
FT1 | FSSC+SLAC | FT1variables | 'source' and above | | | (inherited) | | ||
EXTENDEDLS1 | SLAC | LS1variables | ((FT1EventClass & 0x00003EFF)!=0) | | | | | ||
LS1 | FSSC+SLAC | LS1variables | 'transient' and above | | | (inherited) | | ||
ELECTRONFT1 | SLAC | FT1variables | CTBParticleType==1 | | | | |
...