status: Running Complete through 31 March 2012
last update: 31 May 2012
This page is a record of the configuration and execution of the P202 reprocessing project, full reprocessing from DIGIs using Pass7 analysis code. This project involves reprocessing with Pass7 classification trees and new IRFs. This task will read DIGI files and emit RECON, MERIT, GCR and CAL ROOT files, and the standard array of FITS files. It will be a CPU-intensive and storage-intensive enterprise requiring months of elapsed time and of order 1 Pbyte of storage. At the time of this task beginning, there will be about 20,000 science runs in Fermi (3.5 years accumulation).
...
- 2/13/2012 - begin trials with final calibration and alignments from Leon; 5 runs reprocessed
- 2/14/2012 - trials continue with blocks of 15, 20, 25 and 50 runs reprocessed (each run generates ~20 batch jobs)
- 2/16/2012 - begin trickleStream production. Initial config:
Code Block =============================================================================== TRICKLE PARMS =============================================================================== task = P202-ROOT maxRuns = 19172 firstStep = setupRun steps = [['/processRun processClump', 1500, 20], ['mergeClumps', 70, 1]] maxStreamsPerCycle = 20 timePerCycle = 300 ===============================================================================
- 2/21/2012 - One clump reprocessed with pointer to new mySQL DB (stream 710.0)
- 2/22/2012 - 776 runs complete. Pausing task.
S/W component
modification
status
FILTEREDMERIT TCut
CTBClassLevel>0 changed to ((FT1EventClass & 0x00003EFF)!=0)
done 3/14/2012
- 3/15/2012 - resume task. New goal is 1-year of data (~5600 runs)
- 3/31/2012 - 1-year complete (5600 runs). There have been a few nasty problems which need to be fixed before continuing. Note also that the FILTEREDMERIT files contain 42 more events than the EXTENDEDFT1 files; they should be identical.
S/W component
bug fix
status
New ROOT version
5-min 'transaction timeout' triggered by xroot data server reboot
done 4/3/2012
New GlastRelease
1) include new ROOT version (above); 2) exit with non-zero RC on ROOT write error
done 4/5/2012, GR 17-35-24-rp04
New GPL_TOOLS
check size/checksum of file written to xroot with known size/checksum
pending
Tuned xroot on new Dell servers
silent file truncation when volume fills up JIRA
done 4/4/2012 (100 MB min space limit -> 100 GB; file system space check cadence changed from 10 min to 2 min)
New xroot client tools
complain when xroot data server fails on write
done 4/3/2012, v3.1.1
New TSkim
1) new ROOT version (above); 2) complain on ROOT write errors
done 4/5/2012, v08-02-01
New xroot redirector
required step toward enabling HPSS staging
done 4/3/2012, v3.1.1
- 4/5/2012 - resume task. New goal is entire science dataset.
- 4/10/2012 - Unknown 'glitch' may have caused a few 100's of jobs to crash and take sulky46 along with them.
- 4/11/2012 - due to possible overload of sulky46/u18 writing a lot of core files, have introduced one change to processClumps.py: prepend "ulimit -c 0;" to gleam command to disable all core file generation. This starts approx with run 7605 (+/-).
- 4/12/2012 - 10:40pm lightening strikes SLAC power lines. Site-wide power outage. Stream 7795 was the last stream submitted prior to the outage.
- 4/15/2012 - Batch farm back in operation, resume task...
- 5/9/2012 - major pipeline issue...shut down pipeline and allow to drain (due to tomorrow's major outage)
- 5/10/2012 - 13:40 outage over.
- Update GR from 17-35-24-rp04 to 17-35-24-rp07 in which the only change is replacing the 5-minute xroot time-out with 8 hours. This change effective with stream 14314 and previously failed pieces of four other runs: 14247.6, 14273.23, 14274.8, 14231.9.
- Leon advises that as of today, calibrations are valid only thru ~15 Dec 2011 (run 345574915) - which is somewhere around stream 18,400. He asks Sasha to produce more up-to-date calibs.
- 5/18/2012 - all calibrations now valid through 6 May 2012. No need to pause P202 task.
- 5/28/2012 - 15:30 Complete (through 31 March 2012)
- Data Catalog summary: There are discrepancies to track down!
Name
Type
Files
Events
Size
Created (UTC)
Links
Group
20229
44,125,599,595
128.7 TB
25-Jan-2012 00:53:31
Group
20229
90,904,582
205.7 GB
25-Jan-2012 00:53:32
Group
20229
6,291,396,710
5.3 TB
25-Jan-2012 00:53:29
Group
20229
44,123,014,456
942.7 GB
25-Jan-2012 00:53:31
Group
20229
44,125,679,961
35.4 TB
25-Jan-2012 00:53:30
Group
20229
44,123,612,977
590.0 TB
25-Jan-2012 00:53:33
Turns out to be three problematic runs/streams:- 272707024/5723 - I/O prob, corrupt files, entire stream rolled back
- 279108810/6847 - xroot transient access prob., re-registered in dataCat
- 284813327/7848 - xroot transient access prob., re-registered in dataCat
- Data Catalog summary:
- Final trickleStream configuration:
Code Block =============================================================================== TRICKLE PARMS =============================================================================== task = P202-ROOT maxRuns = 20229 firstStep = setupRun steps = [['/processRun processClump', 2000, 21], ['mergeClumps', 200, 1]] maxStreamsPerCycle = 20 timePerCycle = 300 ------DEBUG---------------- maxCycles = 0 chatter = False dryRun = False ===============================================================================
- 5/31/2012 - Cleanup and summary
- Rolling back all or part of the three runs above solved the discrepancies in # events. New dataCatalog tally looks like this:
Name
Type
Files
Events
Size
Created (UTC)
Links
Group
20229
44,125,679,961
128.7 TB
25-Jan-2012 00:53:31
Group
20229
90,904,582
205.7 GB
25-Jan-2012 00:53:32
Group
20229
6,291,396,711
5.3 TB
25-Jan-2012 00:53:29
Group
20229
44,125,679,961
942.7 GB
25-Jan-2012 00:53:31
Group
20229
44,125,679,961
35.4 TB
25-Jan-2012 00:53:30
Group
20229
44,125,679,961
590.0 TB
25-Jan-2012 00:53:33
- Total run time for 20,229 runs was ~74 days (or about 273 runs/day reprocessed). This includes periods of changing trickleStream configuration as we figured out how much load we could safely put on the system.
- Rolling back all or part of the three runs above solved the discrepancies in # events. New dataCatalog tally looks like this:
- 6/5/2012 - Three streams rolled back and minor code changes for cleanup (see FITS chronology below for details)
Configuration
Task Location | /nfs/farm/g/glast/u38/Reprocess-tasks/P202-ROOT |
Task Status | |
GlastRelease | 17-35-24-gr17 (SCons RHEL4-32 build) |
Run Selection | based on a modified "standard" selection, see https://confluence.slac.stanford.edu/display/SCIGRPS/Official+LAT+Datasets |
s/c data | "standard" Public Release 2 https://confluence.slac.stanford.edu/display/SCIGRPS/Official+LAT+Datasets |
Input Run List | ftp://ftp-glast.slac.stanford.edu/glast.u38/Reprocess-tasks/P202-ROOT/config/runList.txt |
photonFilter | CTBParticleType==1 && ((FT1EventClass & 0x00003EFF)!=0) |
electronFilter | CTBParticleType==1 |
Code Variants | redhat4-i686-32bit-gcc34 (Optimized) |
jobOpts | ftp://ftp-glast.slac.stanford.edu/glast.u38/Reprocess-tasks/P202-ROOT/config/doRecon.txt |
Output Data Products |
...
- 3/2/2012 - Define block 1 as the 776 runs in P202-ROOT block 1. Configure trickleStream and begin (14:08)
- 3/31/2012 - Define block 2 as 5600 runs. Reconfig trickleStream and begin (18:05)
- 4/01/2012 - Block 2 complete (most of the 4824 jobs completed in about six hours w/1000 job limit).
- 5/31/2012
- Discover stream 5599 (run 271999199) requires rollback - new MERIT file version (v203->v205).
- Regenerate runlist with 20,229 input MERIT files.
- Minor config change: twoClumpMin=False (formerly True, but caused unnecessary extra processClump dummy jobs)
Configuration
- 6/5/2012 - Final cleanup
- Five runs are responsible for discrepant event tallies:
Run
Stream
Reason
Action
239557414
0
bad TCut
rollback ROOT + FITS
241599746
352
bad processing order
rollback FITS
245403855
1019
bad start time
change start time in runList and rollback
332661583
16244
silent root errors
rollback ROOT
339081502
17416
silent root errors
rollback ROOT
- After these actions, the dataCatalog tallies are now consistent:
NameTypeFilesEventsSizeCreated (UTC)Links| CAL | Group | 20229 | 44,125,679,961 | 128.7 TB | 25-Jan-2012 00:53:31 | Files |Group
20229
0
8.5 GB
02-Mar-2012 00:06:07
Group
20229
90,904,582
205.7 GB
25-Jan-2012 00:53:32
Group
20229
6,291,424,926
574.7 GB
02-Mar-2012 00:06:09
Group
20229
6,291,424,926
1,020.1 GB
02-Mar-2012 00:06:09
Group
20229
6,291,424,926
5.3 TB
25-Jan-2012 00:53:29
Group
20229
189,323,074
17.8 GB
02-Mar-2012 00:06:06
Group
20229
44,125,679,961
942.7 GB
25-Jan-2012 00:53:31
Group
20229
1,325,204,821
215.3 GB
02-Mar-2012 00:06:08
Group
20229
44,125,679,961
35.4 TB
25-Jan-2012 00:53:30
Group
20229
44,125,679,961
590.0 TB
25-Jan-2012 00:53:33
- Five runs are responsible for discrepant event tallies:
Configuration
Task Location | /nfs/farm/g/glast/u38/Reprocess-tasks/P202-FITS |
Task Status | |
Task Location | /nfs/farm/g/glast/u38/Reprocess-tasks/P202-FITS |
Task Status | http://glast-ground.slac.stanford.edu/Pipeline-II/task.jsp?task=75031156 |
Input Data | MERIT (direct from P202-ROOT) |
spacecraft data | same as P202-ROOT |
Input Run List | ftp://ftp-glast.slac.stanford.edu/glast.u38/Reprocess-tasks/P202-FITS/config/runList.txt |
evtClassDefs | 00-19-04 (6/23/2011) |
eventClassMap | EvtClassDefs_P7V6.xml |
ScienceTools | 09-27-01 (2/15/2012) |
Code Variants | redhat5-i686-32bit-gcc41 (Optimized) |
Diffuse Model | based on contents of /afs/slac.stanford.edu/g/glast/ground/GLAST_EXT/diffuseModels/v2r0 |
Diffuse Response | 'source' using P7SOURCE_V6 IRF |
IRFs | P6V7, contained within ScienceTools release |
Output Data Products |
...
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="1bb0e2ba94d62a4b-c94716c6-4e4c4a1c-9011b2ff-2716a844ae01657cb93be274"><ac:plain-text-body><![CDATA[ | Data Product | destination | data content [1] | event selection [1] | makeFT1 | gtselect | gtdiffrsp | gtmktime | ]]></ac:plain-text-body></ac:structured-macro> |
---|---|---|---|---|---|---|---|---|---|
EXTENDEDFT1 | SLAC | FT1variables | ((FT1EventClass & 0x00003EFF)!=0) | | | | | ||
FT1 | FSSC+SLAC | FT1variables | 'source' and above | | | (inherited) | | ||
EXTENDEDLS1 | SLAC | LS1variables | ((FT1EventClass & 0x00003EFF)!=0) | | | | | ||
LS1 | FSSC+SLAC | LS1variables | 'transient' and above | | | (inherited) | | ||
ELECTRONFT1 | SLAC | FT1variables | CTBParticleType==1 | | | | |
...