Page History

This page describes the design of CAL crystal recon. In particular, it is intended to contain the definition of the xtal status word, the algorithm to mitigate xtal saturation, the algorithms to mitigate electronic failures, etc.

It includes some discussion of mitigation of the failure of the HE channels of a single GCFE. There are 3072 GCFEs in CAL, two for each of the 1536 xtals. The The problem occurred in the run starting at MET = 301753824, which is MJD 55402.52108796, or 2010 Jul 25 at 12:30:22 UTC, or 2010 day 206 at 12:30:22 UTC.

Symptoms of the failure

The problem occurred in the run starting at MET = 301753824, which is MJD 55402.52108796, or 2010 Jul 25 at 12:30:22 UTC, or 2010 day 206 at 12:30:22 UTC.

On 27 Jul 2010 (Day 208), at 1:55 PM EDT, Anders Borgland wrote:

Starting with run 301753824 we have two problems:
1/
We do not see any signal in the high energy diode in tower 4, layer X1, column 4, + side. From Eric Siskind: "Whether the failure is within the diode itself or in the GCFE electronics chain (HE preamp or slow shaper) is currently unknown". You can read the whole thread here:
https://www-glast.stanford.edu/protected/mail/datamon/4835.html
2/ While problem 1/ affects all (high energy) events in that channel, GCR events (4-range and zero-suppressed events) tickle FSW bug 1156:
https://jira.slac.stanford.edu/browse/FSW-1156
This means that about 10 events per run will fail in the decompression. Because of the way the Halfpipe works we lose the complete datagram for each of these events. Since a datagram contains about 110 events we are currently losing about 1100 events per run. This corresponds to about 2.5 seconds of data for each 90 minute run.
The FSW group have a fix for bug 1156 and will upload a new build asap.
Note that currently there is no failure mode in CalRecon so events from this channel is not treated in any special way. NRL is working on this.
It should also be noted that the problem was caught immediately by two separate parts of the Data monitoring. The automatic alarms caught both the missing datagrams and the missing signal from the diode. These runs are marked as 'GOOD' by the DQM shifter, but with a comment attached to them. Obviously we will have to live with the missing diode signal from now on.
Some of you will not have failed to notice the irony that it's GCR events tickling FSW bug 1156 (hint: SSC-258)
anders

What has failed?

As soon as we understand exactly what has failed in this GCFE, I'll type something here.

Consequences of this failure

On 27 Jul 2010 (Day 208), at 2:15 PM EDT, J. Eric Grove wrote:

Additional clarification:

Wiki Markup

We do not see any signal in the high energy diode in tower 4, layer X1, column 4, + side. From Eric Siskind: "Whether
\[\]
Note that currently there is no failure mode in CalRecon so events from this channel is not treated in any special way. NRL is working on this.

What this means is that (until xtal recon is fixed):

Any photon that MISSES this one crystal is not affected. It is correctly and properly reconstructed.
Any photon that deposits LESS THAN about 1 GeV in this one crystal is not affected. It is correctly and properly reconstructed.
Any photon that deposits MORE THAN about 1 GeV in this one crystal has an incorrect energy and position measurement in this one crystal, and therefore has an incorrect reconstructed incident energy and direction. The level of error in reconstructed incident energy and direction is surely energy-dependent, and I don't have an estimate of the magnitude yet.

In the above sentences, "photon" means "any event that is not read out by Trigger Engine 4, i.e. any event that is not read out in 4-range, zero-suppressed mode". I used the word photon to focus the discussion.

The combination of cases (1) and (2) covers the overwhelming majority of photons in the LAT dataset, so most events are perfectly fine, but clearly we need to implement a fix for this particular failure in the code that reconstructs crystal energy and position.

Eric

Once CAL xtal recon has been modified with the changes outlined below, the performance of this xtal will be essentially nominal, and the performance of the LAT will be unaffected.

Changes necessary to recon

For more details and related discussion, see ?CAL July 2010 channel failure.

Changes necessary to recon

CalXtalRecAlg and related objects will be changed for Pass 8 (and perhaps earlier) for modify the xtal response in case of failures and for saturated readouts.

Crystal status word

We are adding status bits to xtal recon to indicate that the energy and position information returned by a xtal have been modified, along with some information about why the modification was necessaryAs of July 2010, CAL recon code contained no mitigation against failures, despite our having discussed adding such code for the last 10+ years. This failure requires that we do something. The intent here is to fix the reconstruction of events that hit this a xtal at the level of xtal recon so that the failure and fix are transparent to downstream energy reconstruction and clustering. We're adding status bits to xtal recon to indicate that the energy and position information returned by a xtal with compromised readout have been modified. Any reconstruction Reconstruction code that follows xtal recon could make use of those bits to accept/reject that xtal or modify higher level recon algorithms, if the authors and architects of that code wish.

On 4 Aug 2010 (Day 216), at 4:59 PM EDT, J. Eric Grove wrote:

All,

We at NRL had a very productive discussion with Tracy and Leon last week about how to mitigate against the failure in a CAL channel on 25 July. This email fleshes out a proposal for a new calibration DB quantity, how it'll be used, and requests your comments. And it requests that we start the process of adding this new calib quantity to the DB.

Over the years, we've discussed having a status word to indicate known h/w failures, and now we have a good reason to implement it. I wish we'd done it before. Our CAL online s/w used an "exception list" XML file to document failures or out-of-family conditions, and this is similar – but it's not XML.

Each of the 1536 xtals will have a 32-bit status word indicating its hardware and configuration status, and that 1536-element array will be stored as a calibration database quantity. The initial value of the status word for each event comes from the calibration database. In addition, some of the bits in that status word are used by xtal recon to indicate whether the energy and position calculated by xtal recon for a given event are good, need to be corrected, or have been corrected for the appropriate failure mode indicated by the hardware and config status bits.The bit map for the status word is given at the end of the email. Your comments or proposed modifications to the word are welcome.

The status word array needs to have an epoch of validity, like all other calib quantities. Until this failure, the status word for each xtal end was 0x0. Now we have one xtal with two bits set, the bits that indicate bad plus-face HEX8 and HEX1, and that condition began on 25 July.

...

I've divided the 32 bits into 8 16 for recon status (and used 510), 16 10 for hardware status (and used 8 so far), and 8 5+1 for config status (and used 4). Perhaps it'd make more sense to reserve more bits for recon status. Note that I've included two bits for Bill's mods to the longitudinal position calculation to indicate whether he's corrected for direct light in the near diode or corrected for an ambiguous asymmetry. Those can be set in the initial CalXtalRecAlg pass and used as desired later.

...

– bit map for CalXtalRecData status word –

(rev. 16 Feb 2010, Pass 8 discussion Grove, Bruel, Baldini, Usher)

bit	function
	status of recon calculation
0	bad energy
1	bad longitudinal position
2	energy has been provided by external means
3	energy has been calculated by failure mitigation algorithm
4	energy has been calculated for corrected longitudinal position
5	minus-face energy measurement is saturated
6	plus-face energy measurement is saturated
7	position has been provided by external means
8	longitudinal position has been corrected for direct light
9	longitudinal position has been corrected for ambiguous ratio
10-15	unused
	status of h/w
8 16	bad minus-face LEX8
9 17	bad minus-face LEX1
10 18	bad minus-face HEX8
11 19	bad minus-face HEX1
12 20	bad plus-face LEX8
13 21	bad plus-face LEX1
14 22	bad plus-face HEX8
15 23	bad plus-face HEX1
16 24-2325	unused
	status of config
24 26	minus-face LE autoranging disabled
25 27	minus-face HE autoranging disabled
26 28	plus-face LE autoranging disabled
27 29	plus-face HE autoranging disabled
28 30-31	unused

Algorithms in CalXtalRecData

Algorithms for correcting current and not-unlikely future failures in FixXtalResp

On 5 Aug 2010 (Day 217), at 5:24 PM EDT, J. Eric Grove wrote:

Chul,
unmigrated-wiki-markup
Here I've detailed the actions for FixXtalResp in the cases we currently understand. I've added a new CalXtalRecData status word definition \ [at the bottom of this email\] adding Bill's corrected-energy bit and a bit to indicate that we've mitigated the failure.
Eric

Code Block

case == bad plus-face HEX8 && bad plus-face HEX1 && ! plus-face HE autoranging disabled
/* this is the case for the current failure and current configuration */
/* if one of the HEX ranges is best range, fix it */
/* if plus-face first range == (HEX8 || HEX1),
	calculate E using the opposite face and the externally provided longitudinal position
	clear the bad energy bit
	set the externally provided longitudinal position bit
	set the failure-mitigation energy bit
*/


case == bad plus-face HEX8 && bad plus-face HEX1 && plus-face HE autoranging disabled
/* this is the case for the current failure, but it requires a configuration we have not used */
/* if LEX1 is best range and saturated, fix it */
/* if plus-face first range == LEX1 && plus-face > 4050 (i.e. saturated),
	calculate E using the opposite face and the externally provided longitudinal position
	clear the bad energy bit
	set the externally provided longitudinal position bit
	set the failure-mitigation energy bit
*/


case == bad plus-face LEX8 && bad plus-face LEX1 && bad plus-face HEX8 && bad plus-face HEX1
/* this is the case we may get to soon if this GCFE continues to degrade and the LE ranges fail */
/* in all cases,
	calculate E using the opposite face and the externally provided longitudinal position
	clear the bad energy bit
	set the externally provided longitudinal position bit
	set the failure-mitigation energy bit
*/

Fix for Failed Electronics on Cal Crystal

...

Algorithms for correcting for saturated xtals

This is the rough outline that we (Eric G., Philippe, Tracy and myself) agreed on during the F2F pass8 meeting in Pisa (Feb 2011).

Perform the start xtal rec step.
- We'll put in all the new bits to keep track of what's happening, according to the scheme outlined above. Chul and Tracy will follow up on this and get it done.
If there's one or more saturated xtal(s), the first thing we'll do in CalRecon is a fit using the transverse information only with the full xtal collection (i.e. before the clustering) to be used to fix the position for saturated xtals. Luca B. and Philippe will take care of this.
- This will require moving the current fitting routine (now living in CalRecon/src/Clustering/MomentsClusterInfo.cxx) in some common area (Tracy suggests CalRecon/src/Utilities/) to be used in more than one place. Luca B. will take care of this.
- As a side project, the possibility of not using Minuit for the fit will be investigated, in order to gain in speed and avoid a dependency on ROOT. This is low-priority project that Luca B. will take a look into.
At this point (if there are saturated xtals) from CalRecon we will call a dedicated tool in CalXtalResponse to set the longitudinal position to our best value, based on the transverse fit (will need some coordination between all of us).
- After this point the CalXtalRecData objects in the TDS will have a reasonable longitudinal position information even for the saturated xtals, and the correct information will propagate trasparently to the downstream code (with the proper status bit set).
The next step is the calorimeter clustering (i.e. the splitting of the xtal collection in clusters).
And after that we'll go through the standard transverse fit/moments analysis/classification on a cluster by cluster basis.
- We decided to keep the CalMomentsData as the basic object the CAL moments analysis acts on. Luca B. will take care of propagating all the new information (i.e. the status bits) to the object.
- Further possible adjustments to the xtals that need transverse fit at the cluster level will be done by calling the proper tool at this point.

Child pages

Versions Compared

Old Version 1

New Version Current

Key

Symptoms of the failure

What has failed?

Consequences of this failure

Changes necessary to recon

Changes necessary to recon

Crystal status word

– bit map for CalXtalRecData status word –

Algorithms in CalXtalRecData

Algorithms for correcting current and not-unlikely future failures in FixXtalResp

Fix for Failed Electronics on Cal Crystal

Algorithms for correcting for saturated xtals