Issues

List of Issues and Problems encountered

Crashes of Run Control and Pipeline processing of the LDF data files

In the initial days we had several runs that would cause either Run Control or the Offline LDF to Digi conversion program to crash. Particular runs of note are 700000609, 610 and 611. The conditions were new to the hardware and software: non-trivial events with non-trivial timing at fairly high rate.

Mon 7/31/2006 10:57 AM

Yes, the bottom line in all of this is that, in the last 2 days, we've found two new bugs in the TEM. One only affects your old EM TEMs, while the other affects flight as well. The good news is that the software work-around (your change to the LDF parser) for the one that affects flight is simple and the data lost are essentially meaningless (the details of the errors encountered by the GTCCs as a result of the original TEM bug). On the other hand, I'm still worried about the "off-by-one" problem as well as the rate of the phasing errors introduced by the original TEM bug.

Mon 7/31/2006 9:59 AM

1) I can't do much more on the "off-by-one" issue until we have another example or some data from the transition to this state to look at.

2) The "iterated past the end" problem looks like a bug in the GTCC firmware that can't always handle the bitstream produced by some variant of the original TEM bug. However, the problem can be avoided by detecting the overall "cable phasing error" in the error summary and subsequently NOT trying to parse the GTCC error contributions to determine the detailed cause of the GTCC errors (including TKR FIFO overflows) in those events.

3) The seg fault error MAY be caused by the failure to tag all truncated TEM event contributions with the "truncated" status in the EBM word. This is hard to confirm because the evidence is destroyed when the seg fault occurs. For the moment, it certainly seems like the "grin and bear it approach" is going to be employed. In conjunction with the increased dead time from the ancillary DAQ, this problem is likely to recede as far as the running to determine PSF and other physics-related issues in the BT is concerned, but remains an issue in proving that the overall LAT DAQ is functional in a high-rate beam environment.

4) The overall rate of the original TEM bug remains a concern in the high-rate high-event-size environment.

5) The rate of TKR FIFO overflows in that environment also remains a concern. We'll have to wait for beam to see what effect increasing the "almost full" threshold actually has.

Sun 7/30/2006 11:40 PM

The short form of this is that, in the presence of the TEM bug, the GTCC can lose synch with the incoming bitstream from the GTRC daisy chain.

It appears that the GTCC firmware assumes a minimum time period between the last parity bit at the end of one GTRC's bitstream and the start bit at the beginning of the next GTRC's bitstream. During that time period, it writes (into the GTCC error FIFO) the projected word count (if a TKR FIFO flow has occurred) and then the layer error summary, if a flag (ErrorFlagLayer) is set that indicates that there was any error encountered in processing the bitstream from the current layer. It looks to me as if that flag is cleared by the arrival of the start bit from the next GTRC in anticipation of processing the bitstream from that new GTRC. If the start bit for the next GTRC arrives immediately after the parity bit at the end of the current GTRC's bitstream, the flag will be cleared too early, the layer error summary for the previous layer won't be written into the error FIFO, and the corresponding count of words written into that FIFO won't be incremented.

During normal operation, the process of passing the token from one GTRC to the next one enforces a minimum space between the bitstreams from one GTRC and the next one. However, in the presence of the TEM bug, if the GTCC actually does lose synch with the incoming bitstream, its understanding of the boundaries between data from one GTRC and the next is incorrect. In this situation, the enforcement of the minimum space can be violated and the start bit from the next layer can immediately follow the parity bit at the end of the previous layer.

Note that this problem is distinct from the possible bug in EM TEM GTIU firmware (for contribution truncation) discovered yesterday, and unlike that one, it definitely does affect flight hardware.

At this point, there is obviously no way that we can either correct the TEM bug (in the GTIU FPGA firmware) or this new shortcoming in the firmware for the GTCC ASIC. We'll just have to live with it, and make the software that parses error contributions more robust about missing words in the GTCC's error contribution. This shouldn't be very difficult, as the word count for each GTCC error contribution still correctly describes the number of words actually present. It's just that some of the words shown in figure 132 of the TEM document can be missing. Assuming that we can keep the number of TKR FIFO overflow errors to a reasonable rate by appropriate configuration, the code that parses the GTCC error contributions is run relatively infrequently, and thus we can afford to make it somewhat inefficient by checking that the parsing pointer doesn't overflow the actual contribution on a word-by-word basis. Note that since parameters are pushed into the error FIFO before their corresponding descriptors, one always parses the error contribution from any one GTCC from the tail towards the head. All one has to due is ensure that one doesn't parse past the head. Since the TEM bug is always signified by overall phasing errors, one can even just skip parsing the individual GTCC error contributions once the overall phasing error has been detected.

Sun 7/30/2006 9:10 AM

Although it is not at all a matter for me to say anything about, the BT coordinators must weigh both the ability to do the physics tests (PSF et al.) that are the raison d'etre for the BT, as well as the ability to demonstrate that the DAQ holds together at high rate in a real beam. If we don't demonstrate that the DAQ works at high rate with the flight-like firmware in the CU, then our only demonstration is the use of a high rate of Monte Carlo events on the test bed. Maybe that's enough - it's not for me to judge. However, so far it appears that the runs taken using the CU in self-trigger mode (analogous to flight) seem to be putting far more stress on the DAQ than anything that we've yet done on the test bed. This is probably the result of the fact that the majority of the rate is concentrated in a single tower in the CU, while it is spread over all 16 towers in any Monte Carlo dataset that we've run to date on the test bed. In particular, when compared with the test bed runs:

1) There is a higher than usual rate of TKR FIFO overflows when using the "nominal" value of the "almost full" threshold in the GTCC data FIFOs.

2) There is a higher than usual rate of incidents of the "TEM bug" causing phasing errors and the like.

3) The rate of truncation is much higher than I believe was modeled early on by Curt Brune. In Mike's original design conception, a truncation was a very rare, but not completely avoidable, error much akin to a parity error. Although the flow control model was changed to eliminate truncation on the grounds that it might introduce a physics bias, there certainly was no expectation before that change that the likelihood of truncation would be anything like that seen with the CU in self-trigger mode.

Child pages