This page is preliminary until confirmed by LAT Electronics

List of Issues and Problems encountered

Data anomolies

In the initial days we had several runs that would cause either Run Control or the Offline LDF to Digi conversion program to crash.  Particular runs of note are 700000609, 610 and 611.  The conditions were new to the hardware and software: non-trivial events with non-trivial timing at fairly high rate.  It should be noted that the CU is built with Version 9 TEMs (EM) whilst the LAT TEMs (Flight) are version 11.  One of the differences between these versions is the flow control model used to transfer data from the TEM to the EBM.

The summary is that there are three issues:

  1. All events in run 700000611 have a TEM 3 contribution with an event number that is one behind all the other contributions in the event.  The TEM 3 contributions are presumibly from the previous event and indicate a failure by the CU hardware to build the event properly.  Only one such run has been found so far.  The cause of this problem is not understood.  This problem is referred to as the "off-by-one" problem, below.
  2. Corrupted TEM error contributions in the presence of a "cable phasing error".  This problem leads to reports of "<N> bytes found after the end of the TEM <N> contribution" messages by the LDF parser.  This is a manifestation of "the TEM bug" that exists in the flight TEMs as well.  Attempts to parse such error contributions can lead to segmentation faults.  Since it is recognizable, the LDF parser has been modified to not parse such contributions.  Reports about junk existing after the end of the contribution still appear and the cable phasing error is still reported, but the details are not analyzed.
  3. Failure to tag some truncated events as truncated.  Because of the mismatch between the EM TEM and EBM flow control model, truncated events can occur.  These are normally reported as a "packet error" with the "truncated" bit set.  Eric found an error in the firmware logic in which the truncated bit is not always set.  It's not clear that we've seen any instances of this issue. 

The following are from correspondence between Ric and Eric Siskind: 

Mon 7/31/2006 10:57 AM - EJS

Yes, the bottom line in all of this is that, in the last 2 days, we've found two new bugs in the TEM.  One only affects your old EM TEMs, while the other affects flight as well.  The good news is that the software work-around (your change to the LDF parser) for the one that affects flight is simple and the data lost are essentially meaningless (the details of the errors encountered by the GTCCs as a result of the original TEM bug).  On the other hand, I'm still worried about the "off-by-one" problem as well as the rate of the phasing errors introduced by the original TEM bug.

Mon 7/31/2006 9:59 AM - EJS

1)     I can't do much more on the "off-by-one" issue until we have another example or some data from the transition to this state to look at.

2)     The "iterated past the end" problem looks like a bug in the GTCC firmware that can't always handle the bitstream produced by some variant of the original TEM bug.  However, the problem can be avoided by detecting the overall "cable phasing error" in the error summary and subsequently NOT trying to parse the GTCC error contributions to determine the detailed cause of the GTCC errors (including TKR FIFO overflows) in those events.

3)     The seg fault error MAY be caused by the failure to tag all truncated TEM event contributions with the "truncated" status in the EBM word.  This is hard to confirm because the evidence is destroyed when the seg fault occurs.  For the moment, it certainly seems like the "grin and bear it approach" is going to be employed.  In conjunction with the increased dead time from the ancillary DAQ, this problem is likely to recede as far as the running to determine PSF and other physics-related issues in the BT is concerned, but remains an issue in proving that the overall LAT DAQ is functional in a high-rate beam environment.

4)     The overall rate of the original TEM bug remains a concern in the high-rate high-event-size environment.

5)     The rate of TKR FIFO overflows in that environment also remains a concern.  We'll have to wait for beam to see what effect increasing the "almost full" threshold actually has.
 

Sun 7/30/2006 11:40 PM - EJS

The short form of this is that, in the presence of the TEM bug, the GTCC can lose synch with the incoming bitstream from the GTRC daisy chain.

It appears that the GTCC firmware assumes a minimum time period between the last parity bit at the end of one GTRC's bitstream and the start bit at the beginning of the next GTRC's bitstream.  During that time period, it writes (into the GTCC error FIFO) the projected word count (if a TKR FIFO flow has occurred) and then the layer error summary, if a flag (ErrorFlagLayer) is set that indicates that there was any error encountered in processing the bitstream from the current layer.  It looks to me as if that flag is cleared by the arrival of the start bit from the next GTRC in anticipation of processing the bitstream from that new GTRC.  If the start bit for the next GTRC arrives immediately after the parity bit at the end of the current GTRC's bitstream, the flag will be cleared too early, the layer error summary for the previous layer won't be written into the error FIFO, and the corresponding count of words written into that FIFO won't be incremented.

During normal operation, the process of passing the token from one GTRC to the next one enforces a minimum space between the bitstreams from one GTRC and the next one.  However, in the presence of the TEM bug, if the GTCC actually does lose synch with the incoming bitstream, its understanding of the boundaries between data from one GTRC and the next is incorrect.  In this situation, the enforcement of the minimum space can be violated and the start bit from the next layer can immediately follow the parity bit at the end of the previous layer.

Note that this problem is distinct from the possible bug in EM TEM GTIU firmware (for contribution truncation) discovered yesterday, and unlike that one, it definitely does affect flight hardware.

At this point, there is obviously no way that we can either correct the TEM bug (in the GTIU FPGA firmware) or this new shortcoming in the firmware for the GTCC ASIC.  We'll just have to live with it, and make the software that parses error contributions more robust about missing words in the GTCC's error contribution.  This shouldn't be very difficult, as the word count for each GTCC error contribution still correctly describes the number of words actually present.  It's just that some of the words shown in figure 132 of the TEM document can be missing.  Assuming that we can keep the number of TKR FIFO overflow errors to a reasonable rate by appropriate configuration, the code that parses the GTCC error contributions is run relatively infrequently, and thus we can afford to make it somewhat inefficient by checking that the parsing pointer doesn't overflow the actual contribution on a word-by-word basis.  Note that since parameters are pushed into the error FIFO before their corresponding descriptors, one always parses the error contribution from any one GTCC from the tail towards the head.  All one has to due is ensure that one doesn't parse past the head.  Since the TEM bug is always signified by overall phasing errors, one can even just skip parsing the individual GTCC error contributions once the overall phasing error has been detected.

Sun 7/30/2006 9:10 AM - EJS

Although it is not at all a matter for me to say anything about, the BT coordinators must weigh both the ability to do the physics tests (PSF et al.) that are the raison d'etre for the BT, as well as the ability to demonstrate that the DAQ holds together at high rate in a real beam.  If we don't demonstrate that the DAQ works at high rate with the flight-like firmware in the CU, then our only demonstration is the use of a high rate of Monte Carlo events on the test bed.  Maybe that's enough - it's not for me to judge.  However, so far it appears that the runs taken using the CU in self-trigger mode (analogous to flight) seem to be putting far more stress on the DAQ than anything that we've yet done on the test bed.  This is probably the result of the fact that the majority of the rate is concentrated in a single tower in the CU, while it is spread over all 16 towers in any Monte Carlo dataset that we've run to date on the test bed.  In particular, when compared with the test bed runs:

1) There is a higher than usual rate of TKR FIFO overflows when using the "nominal" value of the "almost full" threshold in the GTCC data FIFOs.

2) There is a higher than usual rate of incidents of the "TEM bug" causing phasing errors and the like.

3) The rate of truncation is much higher than I believe was modeled early on by Curt Brune.  In Mike's original design conception, a truncation was a very rare, but not completely avoidable, error much akin to a parity error.  Although the flow control model was changed to eliminate truncation on the grounds that it might introduce a physics bias, there certainly was no expectation before that change that the likelihood of truncation would be anything like that seen with the CU in self-trigger mode.

The "TEM bug" (EJS)

A bit more information about the TEM bug.  It actually comes in 3 flavors:

1) You expect an event with tag "N" but you actually receive the subsequent event, with tag "N+1" from all GTRCs on a given TKR cable.  In this case, the cable only contributes to the overall phasing error, which is reported when the tag from GTRC 0 on any cable doesn't match the tag from the event summary word.

2) You expect an event with tag "N" but you actually receive a duplicate of the previous event, with tag "N-1" from GTRC 0 on a given TKR cable.  However, by the time layer 0 finishes sending its data to the GTCC, the data for the subsequent event which populates the "wrong" GTRC buffer, with tag "N+1" have arrived in the remaining GTRCs.  Therefore, layers 1-8 send the event with tag "N+1."  This results in an overall phasing error (because layer 0 has an incorrect tag) plus GTRC phasing errors for layers 1-8 on a given cable (because the tags for layers 1-8 do not match the tag for layer 0).

3) The worst case occurs when the GTRC for layer 0 fails to send a self-consistent bitstream.  In particular, a GTRC can send a header with tag "N-1" and include in that header an asserted bit indicating that strip data are to follow, but then fail to send the subsequent word count, strip data, and TOT.  In that case, the GTCC interprets the subsequent blank bitstream as a zero word count (which it interprets as 128 words, consisting of 127 strip addresses plus a TOT) with a parity error, and then interprets subsequent bits as a mix of strip data (frequently with parity errors) and headers for subsequent layers with parity errors. You can also accidentally pick up other errors if you accidentally interpret other bits in the subsequent bitstream as layer headers for subsequent GTRCs.  In effect, the GTCC loses synch with the incoming bitstream from the GTRCs once the GTRC for layer 0 sends a bitstream that is not self-consistent.

If you recall, the TEM has to send 3 different commands, on 3 different wires, to the TKR front ends in order to read out an event.  First, there is the TACK, which is sent on its own dedicated wire.  This instructs the GTFEs to latch the hit strip bit pattern in one of its 4 buffers.  Next, there is the "read event" command to the GTRCs.  This is an ordinary data-less command that gets sent on the usual command line rather than on a dedicated wire.  This instructs the GTRCs to scan one of the 4 GTFE buffers and store the resulting hit strip addresses in one of the 2 GTRC buffers.  Finally, there is the readout token.  This is a daisy-chained dedicated wire that instructs a GTRC to send the contents of one of its 2 buffers to the GTCC.

At any time, the TEM may have up to 4 outstanding TACKs, up to 2 outstanding "read event" commands, and up to 1 outstanding token.

If the TKR FIFOs are "almost full" then the TEM defers sending even its one token until the "almost full" condition is removed.

The TEM bug bites when the TEM has deferred sending the token for the next event (with tag "N") because of backpressure from the TKR FIFOs in one or more GTCCs.  Now, along comes a trigger which causes the TEM to send the "read event" command for its second outstanding GTRC buffer, with tag "N+1."  The bug exists because the "read event" is "on the wire" for thirty-something clock ticks, because it is a full-fledged data-less command on the usual command wire.  The token, on the other hand, is only a 3-bit bitstream on a dedicated wire.  If the backpressure from the TKR FIFOs is removed while the "read event" command is on-the-wire, the bug causes the TEM to send the token for event "N+1" instead of event "N," and this token may arrive at some point before the "read event" command is decoded.  The token only specifies the GTRC buffer to read, which is a one-bit value.  Thus the token for "N-1" is identical to that for "N+1."  If the token arrives before GTRC 0 has decoded the new "read event" command, then that buffer is full and still contains the data from event "N-1," but if the command has already been processed then the buffer is empty and is starting to fill with data from event "N+1."  The worst case occurs when the buffer is still full with data from event "N-1" and there was a non-zero count of hit strips in that layer for that event.  In that case the GTRC sends a layer header for tag "N-1" along with an asserted "data present" flag.  However, by the time the strobed copy of that header has been serialized and sent to the GTCC, if the "read event" command has now been processed, then the "data present" flag will have been cleared and the serialization logic thinks that no word count, list of hit strip addresses, and TOT should follow the layer header.  It simply passes the token to the GTRC for layer 1.  That GTRC in turn waits until the "read event" command for event "N+1" has been completely processed before sending its own layer header.  This can be a variable time delay, depending on the contents of the event data stored in the GTFEs for that event.  The bottom line is that the logic gets seriously confused if GTRC 0 fails to send a self-consistent bitstream, and that can happen if the TEM sends an incorrect token because of its internal logic bug.
 

  • No labels