2021, March 8-16


hardware:

  • can not work with 12dB signal loss data transmission chain
  • ITK IS demo 19-1 expect ~14-16 dB, expected readout by linkAgg+ phase-2 FELIX DAQ
    • software and part of firmware will be the same for the direct-FELIX and phase-2 FELIX

firmware: 

  • data receiving do not work every time: need to power cycle rd53a several times to establish a stable data communication. 
    • FW bit alignment is suspicious
    • waiting for a reply from ANL

software

  • felixCore SW updated from felix-04-01-01-rc3 to felix-04-01-01
    • do not need to restart the felixCore for every single chip scan. 
  • bugfix and performance improvement for the felix-version of YARR SW
    • a new git branch created, devel_rd53a_felixNetio_multichip_rebase_zixu_Working. 
    • git branch merging is requested, waiting for approval. 

Achievement: 

  • can get stable digital scan results
  • can run a digital scan with the SW Triggering Frequence from 1k Hz at 30k Hz: digital scan time reduced from 40 sec to 20sec. 
  • get Sync FE working


Appendix:

==========================================================================
Issue 1) need to re-start felixcore for every single YARR scan:
solution: 
a) update to the latest felixcore version felix-04-01-01-stand-alone (The rc version has bugs); 
b) need to unsubscribe Rx sockets, https://gitlab.cern.ch/YARR/YARR/-/merge_requests/311/diffs?commit_id=0159358c0039c0f2931745c85acd44f47ae40900
now, can keep run YARR digital for hundreds times without re-start the felixcore. only need occasional restart the felixcore when some crash happened. 

==========================================================================
Issue 2) the Sync FE is not working
solution: 
a) need GlobalPulse for chip configration; https://gitlab.cern.ch/YARR/YARR/-/merge_requests/311/diffs?commit_id=fc8d980d8ad16d2dd6e4699ba5ee05538c2622e3 
b) the netio Tx was using buffered socket to send normal Cmds, and low_latency socket to send Trigger. I changed it to only use the low_latency socket. 
For the buffered socket, the problem is that sometimes we need to keep a time gap between Cmds, but the buffering could absorb the time gap. So, to keep things simple, I only use the low_latency socket for Tx. https://gitlab.cern.ch/YARR/YARR/-/merge_requests/311/diffs?commit_id=9665d45622264606d38a88dd7429b757a2c825fe 

Now, Sync FE is working fine for digital scan. (may still have issue for analog or other type of scan, but that's different issue, or intrinsic "feature" of the Sync FE)

==========================================================================

Issue 3) some pixels (in special pattern) lost  1 hitthis is observed and reported by Egor long time agoThis issue happened once in ~10 digital scan, on average. 

we were guessing that, the missing hits are because of a data packet lost in the chain, FELIX FW -> FELIX SW -> YARR SW. Marco is pretty sure that no packet loss between FELIX FW to FELIX SW. And the YARR SW is always suspicious. After testing with the newio_cat, I could confirm that YARR SW and the netio_cat always receive same packets. 

After a long chasing, finally, I believe the problem is happened in the Trigger Cmd sending

see here: 
https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L47-67

Each trigger() func will send the 16 words. 
word[15, 14] are hit injection, https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L48
word[14-(delay/8)-i] is the Rd53a Trig Cmd, https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L60
by default, the delay value is 56. that means, the Trig cmds are in word[6, 7]

all the 16 words must be sent to rd53a in sequence, and continuously without any stop. Then, we will receive all the expected event headers and hits. 

But if in the felix SW/FW, for some reason, hit injection words and trigger words are split into 2 packets. Then, there will be a time gap between then. In this case, we still receive the expected number of event headers, because all the Trig Cmd is still valid. But because of the latency between hit injection and trig Cmd is too large, we will never see any hits from rd53a. 

So, I reduced the delay value from 56 to 24, for example, the distance between the Trig word[10,11] and hit injection word[14,15] is shorter than before. Now, I never see this missing hits issue in more than 500 times of digital scan.  (reminder, this issue was happening once in ~10 digital scans, on average). So, looks like the word-split does not happen. 

So, this is a workaround: reduce the delay value, https://gitlab.cern.ch/YARR/YARR/-/merge_requests/311/diffs?commit_id=77eb79088799a680e85ac3caa25d489de3bded1c

But we need a real solution: need to check felix FW/SW to confirm the word-split issue. and need to find a way to protect the a blob of words not be further split to smaller object. 

==========================================================================
Issue 4) digital scan only run with Trig Freq=1kHz
now, I could run Trig Freq=30kHz, with the default setup for digital scan 

==========================================================================
Issue 5) with short cable, GBT_ALIGNMENT_DONE is already up, but no data return from Rd53a
Issue 6) with short cable, GBT_ALIGNMENT_DONE is already up, but received data are very strange: event header number is much less than expectation, and hit Occupancy or Masking got a strange pattern. the pattern is different time by time. 

I think issue 5 and 6 are from the same reason. A workaround is to keep power-cycle Rd53a, once getting correct digital scan results, will always getting correct digital scan results. but next time power-cycle rd53a, this issue may come back again. 


  • No labels