Carrier Tests

Bring up environment variables

The prototype is located in a chamber, and with external cooling.


Carrier 1

Carrier 1 contains a single ASIC functioning. Once the setup is brought up, the current consumption indicates 954mA. With one ASIC, if that ASIC is enabled the current consumption rises to 1.435A. 



The temperature of the ASIC was measured during operation on a 100Hz trigger and was measured to be 27C in these conditions.


Carrier 2

Carrier 2 contains all 4 ASICs functional. Once the setup is brought up with this carrier the current consumption indicates is also 954mA. Once the carrier is powered up, the consumed current reaches 2.164A, and once the ASICs are configured, the consumed current reaches 2.68Amps. The temperature seems to reach 54 degrees C.



Carrier 2 laser tests

The laser images were generated for all 4 ASICs as follows


First laser light ePixHrM platform

carrier 2 - ASIC0 (U1) all lanes locked


Carrier 2 - ASIC3 (U4) - one lane not locked


Carrier 2 - ASIC1 (U2) - 2 lanes unlocked, for some reason, laser light is very fade. Temperature reached 64 degrees.


Carrier 2 - ASIC2 (U3) - 7 lanes unlocked


Digital board 001

The status of the component tests are shown in the following table

ModuleDescriptionSimulationTest in hardware
RegisterControlDualClockTested AXI lite reads and writes, and waveform generationTestedTested
TrigControlAxiSoftware and hardware triggerTestedTested at 250MHzt
AXiStreamRepeater


DigitalAsicStreamAxiV2Generated data and sent to softwareTested
AxiStreamBatcherEventBuilderGenerated data and sent to softwareTested
AxiLiteSaciMasterRead all values from ASIC
Tested
AppClkAll logic using clocks seem to workOKOK
AppDeser


PwrCtrlEnabled and disable powerTestedPartially tested
DAC - Max5443Changes output in software and probed
Tested 0x0/0xffff → 0/3V
DAC - DacWaveformGenAxiapply and measure on board
Tested 0x0/0xfffff → 0/2.5V
Slow ADCs

Work in progress. P&CB ADC work, but digital board ADC not responding.
Fast ADCs


Oscope


TimingRx


Chip scope proTrigger on read of AXI lite
JTAG and virtual cable tested
SI5345 Jitter cleanerWrote new csv files and testpoint
Tested
PROMWrote to PROM
Tested
JTAGwrite bitstreams-tested
Ref ClockAXI clock working
tested
ASIC lanes

ASUC U3 (2) is outputting data on lane 1. debugging
Serial number

Carrier serial number still does not work


DAC measurements at C559 on digital board

Applied digital valueMeasured value (V)
0x1ffff

0.32

0x2ffff0.477
0x3ffff0.633
0x4ffff0.789
0x5ffff0.946
0x6ffff1.102
0x7ffff1.259
0x8ffff1.415
0x9ffff1.572
0xaffff1.728
0xbffff1.884
0xcffff2.041
0xdffff2.197
0xeffff2.352
0xfffff2.497


SN testing

Probed R7 on the carrier board. Was able to verify that the voltage on the DS2411 is as expected (2.5V). I was also able to observe the input signal coming from the FPGA. Seems like the FPGA is not interpreting this result correctly, or not reading it. Ila shows that the signal in the FPGA is dead. 



Switching the pin assignment of serialNumber[2] and serialNumber[0] , the digital serial number died. so this means that the problem is somewhere outside of the FPGA.


Slow ADC testing

The digital board ADC is not responding. The first time the FPGA tried to interact with the ADC, I can see some signals on the digital output, and on the CLK input of the ADC. After that the FPGA waits infinitely for a reply from the ADC. The voltages of the ADC were both checked and verified to be as expected.

I would ask Lupe to soldier more wires to test, and I would probe the ADC and make sure the other pins are as expected.


Digital board C00-02

Increased digital and analog voltage to 2.65V instead of 2.5V. Characterizing lanes again. For the meantime, Serial number does not seem to work as well. Something seems to be wrong on the path from the FPGA to the Carrier board.

ASICFunctioning lanes (automatic calibration)
0All (locked 0xffffff)
12 lanes unlocked (0x20001)
25 lanes unlocked (0xd90000)
32 lanes unlocked (0x010100)


Slow ADC testing


Work in progress. P&CB ADC work, but digital board ADC not responding.

Carrier 3 testing w/ digital board 002

Carrier 3 has all 4 ASICs functional. Once the setup is brought up with this carrier the current consumption indicates is 965mA. Once the carrier is powered up, the consumed current reaches 2.29A, and once the ASICs are configured, the consumed current reaches 2.808A. The temperature seems to reach 35.7 C. Triggering the system the total current consumption goes above 3A (3.030A).



Carrier 3 laser tests automatic lock

ASIC0 (U1) two lanes disabled (0x1002)

ASIC1 (U2) two lanes disabled (0x100001)

ASIC2 (U3) 4 lanes disabled (0xc90000)

ASIC3 (U4) All lanes active


ASIC3 DAC test 


ASIC3 Charge injection columns 50 to 100

Carrier 3 lane delay eye plots

ASIC0 (all lanes recovered) 

After testing with EQ_LEVEL0 discovered: 
  • Lane 1 least error is ~1/second. At no delay value 0 is achieved while no trigger is provided. Also eye diagram seems to change when power cycling the ASIC. Setting the delay to a value of 0 seems to get it to lock permanently after a while. Although on high speeds (5000FPS) lane 1 does seem to cause some timeouts. Will be disabled.
  • Although Lane 8 is locked and never counts an error, it occasionally times out in the DigAsicStrmRegister0. this lane is disabled in yml. There were times where we tested with this lane enabled
  • The frames of the first second are always lost
    • 2000 frames @ 1000 FPS : 1001 frames arrive
    • 1000 frames @ 1000 FPS : 1 frame
    • 5000 frames @ 5000 FPS : 0 frames
    • 8000 frames @ 2000 FPS : 6000 frames
    • 8000 frames @ 1000 FPS : 6991 frames
    • 15000 frames @ 5000 FPS : 10483 frames (5784 at writer. Buffer size not enough. After talking to Ryan and Ben, seems like corruption is happening.)
  • Occasionally lanes timed out. In a 15000 acquisition, 
    •       DigAsicStrmRegisters0:
              enable: True
              FrameCount: 10483
              FrameSize: 3071
              FrameMaxSize: 3071
              FrameMinSize: 3071
              asicDataReq: 3071
              DisableLane: 0x100
              EnumerateDisLane: 0xffffff
              TimeoutCntLane[0]: 3
              TimeoutCntLane[1]: 98
              TimeoutCntLane[2]: 3
              TimeoutCntLane[3]: 3
              TimeoutCntLane[4]: 0
              TimeoutCntLane[5]: 2
              TimeoutCntLane[6]: 2
              TimeoutCntLane[7]: 2
              TimeoutCntLane[8]: 72
              TimeoutCntLane[9]: 3
              TimeoutCntLane[10]: 3
              TimeoutCntLane[11]: 4
              TimeoutCntLane[12]: 2
              TimeoutCntLane[13]: 3
              TimeoutCntLane[14]: 3
              TimeoutCntLane[15]: 2
              TimeoutCntLane[16]: 25
              TimeoutCntLane[17]: 3
              TimeoutCntLane[18]: 2
              TimeoutCntLane[19]: 3
              TimeoutCntLane[20]: 2
              TimeoutCntLane[21]: 2
              TimeoutCntLane[22]: 3
              TimeoutCntLane[23]: 3
Testing with EQ_LEVEL1:
  • Lane 1 still times out and has errors detected.
  • Lane 8 seems to have improved slightly. No timeouts were observed during the tests. Increasing the delay to 400 seemed to have resolved it's problems.

ASIC1 (1 lane not recovered - lane 0)

After testing with EQ_LEVEL0 discovered: 
  • Lane 0 dead
  • The frames of the first second are always lost
    • 5000 frames @ 1000 FPS : 4000 frames arrive
EQ_LEVEL1:
  • Lane 0 still dead

ASIC2 (4 lanes not recovered - lanes 23,22, 19, 16)

After testing with EQ_LEVEL0 discovered: 
  • lanes 23,22, 19, 16 are dead. no delay brings seems to bring them back to life

  • 5000 frames @ 1000 FPS : 3997 frames arrive
  •       DigAsicStrmRegisters2:
            enable: True
            FrameCount: 8430
            FrameSize: 3071
            FrameMaxSize: 3071
            FrameMinSize: 3071
            asicDataReq: 3071
            DisableLane: 0xc90000
            EnumerateDisLane: 0xffffff
            TimeoutCntLane[0]: 3
            TimeoutCntLane[1]: 3
            TimeoutCntLane[2]: 3
            TimeoutCntLane[3]: 3
            TimeoutCntLane[4]: 3
            TimeoutCntLane[5]: 3
            TimeoutCntLane[6]: 3
            TimeoutCntLane[7]: 3
            TimeoutCntLane[8]: 3
            TimeoutCntLane[9]: 3
            TimeoutCntLane[10]: 3
            TimeoutCntLane[11]: 3
            TimeoutCntLane[12]: 3
            TimeoutCntLane[13]: 3
            TimeoutCntLane[14]: 3
            TimeoutCntLane[15]: 3
            TimeoutCntLane[16]: 0
            TimeoutCntLane[17]: 3
            TimeoutCntLane[18]: 3
            TimeoutCntLane[19]: 0
            TimeoutCntLane[20]: 3
            TimeoutCntLane[21]: 3
            TimeoutCntLane[22]: 0
            TimeoutCntLane[23]: 0

ASIC3 (all lanes ok)


After testing with EQ_LEVEL0 discovered: 
  • Manual delay used, and all lanes seem ok
  • 4000 frames @ 1000 FPS : 3000 frames arrive


Changed the IOSTANDARD on all data lanes to the ASIC to EQ_LEVEL1 in an attempt to imrpove lane locking, but no difference observed.

set_property -dict {IOSTANDARD LVDS DIFF_TERM_ADV TERM_100 DQS_BIAS TRUE EQUALIZATION EQ_LEVEL1} [get_ports {asicDataP[*][*]}]

set_property -dict {IOSTANDARD LVDS DIFF_TERM_ADV TERM_100 DQS_BIAS TRUE EQUALIZATION EQ_LEVEL1} [get_ports {asicDataN[*][*]}]


Laser images with the new descrambling

U1 (ASIC0) All lanes active


U2 (ASIC1) 1 lane not locked

U3 (ASIC2) 4 lanes not locked

U4 (ASIC3) all lanes locked

After Fixing the descrambling by inverting the reading inside ADC from top to bottom right to left

A1 (ASIC0)

U2 (ASIC1)

U3 (ASIC2)

U4 (ASIC3)

Rogue bugs

Issue 1

Occasionally interface errors happen. not clear what is the reason. Cannot write or read anything.

1689356297.917865:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356298.918923:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356299.919966:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356300.921032:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356301.922090:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356302.923134:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356303.924196:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356304.925257:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356305.926327:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356306.927388:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356307.928423:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356308.929486:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356309.929565:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.
1689356310.930659:pyrogue.axi.AxiStreamDma: AxiStreamDma::acceptReq: Timeout waiting for outbound buffer after 1.0 seconds! May be caused by outbound back pressure.


Ben believes that the blowoff is creating corrupted frames that the software unbatcher then segfaults while trying to parse. He needs a dump to debug.


Issue 2

I have this issue happening when I send a lot of frames and the software can't seem to be able to handle it. I configured the hardware to transmit 15000 frames at 5000 frames / second. Seems like the software can't keep up.


1690836274.390990:pyrogue.batcher.CoreV1: Not enough space (131648) for frame (147504)
1690836274.390995:pyrogue.batcher.CoreV1: Not enough space (146144) for frame (147504)
1690836274.390998:pyrogue.batcher.CoreV1: Not enough space (146144) for frame (147504)
1690836274.391000:pyrogue.batcher.CoreV1: Not enough space (146144) for frame (147504)
1690836275.476448:pyrogue.batcher.CoreV1: Not enough space (23264) for frame (1240877856)
1690836275.476462:pyrogue.batcher.CoreV1: Not enough space (23264) for frame (1240877856)
1690836275.476466:pyrogue.batcher.CoreV1: Not enough space (23264) for frame (1240877856)
1690836275.476469:pyrogue.batcher.CoreV1: Not enough space (134240) for frame (147504)
1690836275.476473:pyrogue.batcher.CoreV1: Not enough space (134240) for frame (147504)
1690836275.476475:pyrogue.batcher.CoreV1: Not enough space (134240) for frame (147504)

Ryan: That error means that their is an error parsing the frame, meaning that it needs to process 1240877856 bytes but only has 23264 remaining in the current frame.

Ben : A fifo is overflowing and truncating frames, corrupting them.


10000 frames @ 1000 FPS caused corruption as well.

Started investigation at the DigitalAsicStreamAxiV2.vhd. Surprisingly discovered that corruption seems to happen at a much slower rate 2 frames @10 Hz. Examining the overflow counters, they start counting already on the second frame arrival. Not clear yet how the second image is arriving to the server. 

For each asic there are 24 fifos (1 per lane) of depth 512 of 19 bits running on a speed of 42 MHz ( ASICs readout at 168 MHz ). One lane is 3072 pixels / lane. Data is read from these fifos, a header is appended and fed into a single fifo of depth 8192 of 48 bytes at speed of 42 Mhz but read on speed of 156.25 MHz. Then through an AxiStreamResizer the width is reduced to 16 bytes. Then these 16 bytes are combined with timing and sent to core which also has a clock of 156.25 MHz. On several stages back pressure is applied and propagates all the way to the dual clock fifo, and the dual clock fifo is generating the full flag that is used to increment the overflow counter.

Overflow detected

Got to the bottom of this. The number of triggers sent to the ASICs is higher than that sent to the hardware. That said, the logic was not ready, and the fifos were storing data. Fixing the order in software (StrtAutoTrigger function) fixed the issue. Acquisition of 5000 frames upto 5000 FPS were done with ASIC3 with no overflow detected. 

Next, testing storing data, and all ASICs together.

Continue: When storing data is enabled, data overflow is observed in the fifos of the DigAsicStrmRegisters. The backpressure seems to propagate all the way to the beginning (Which is reasonable and good). Meaning that the all  the Fifos are used too maximum potential. Around 70000 frames can be written to disk before an overflow can be detected. 


Mysterious horizontal flipped lines

The origin of the horizontal lines denoted in this figure cannot be explained. The first line seems to be flipped with the last. 

A series of tests were done to identify if these lines come from the same image or from the previous acquisition.

I performed the following acquisition in sequence

  1. I acquired an image without any special setting
  2. Enabled charge injection on columns 80 to 100, and acquired an image
  3. I acquired another image with no special setting.

The charge injection image is as follows

The sabtraction of 1 and 3 is the following

bottom line, there does not seem to be any artifact propagating from one image to another, so we will solve the issue by exchanging these lines in the descrambler.

After investigation, seems like

  • Data of row 47 comes in row 95
  • Data of row 95 comes in row 143
  • Data of row 143 comes in row 191 
  • Data of row 191 comes in row 47

See image below after organizing the lines

This shift does not happen when lanes are disabled and enumerate feature is activated; meaning that the bug is anywhere before the enumerate assignment and is not the descrambler. 

Next a simulation from end to end was done by injecting a fixed pattern in each lane and picking it up on software. The horizontal line shift is evident, but that is because the patch is applied. See images below

Row 142 is the row before the color shifts, and it should be row 143. Same is for ow 95 and 47 and 191. Debugging in simulation.

Once the patch is removed, the images come out without shift. So this rules out that the bug is in the firmware or the descrambler, and has to be in the ASIC.


After discussing with Lorenzo and Dionisio, we did some tests to inject some patterns before the 8b10Encoder by setting the ro_mode_i register to 0x1 (veritical strips), and 0x3 (ramp) without the work around.

The upper wrokaround did not seem reasonable as the rows are not from the same bank (lane), and it had to be within the bank (lane), so our next guess is the following

Shifting all row by 1 downwards, then setting

  • row 47 to 0
  • row 95 to 48
  • row 143 to 96
  • row 191 to 144

In other words, bank down rotation. The final image with a cross laser is as as follows

Applying the work around will screw up the image coming from any source on or after the 8b10bencoder. Here is an example of setting the ASIC register ro_mode_i to 0x3. The rows that stand out are 96 and 144 that come from 144 and 191 respectively.


Locking to LCLS-II timing


The l2si-xpm server configuration is here . Some extra tips:

XPM/EVR: It means your triggering either follows the timing input XPM Partition/ReadoutGroup selection (XPM source) or it is decided separately from the timing input and EVR type event logic (FixedRate + Destination) which appears in EvrV2ChannelReg/EvrV2TriggerReg modules.

you can enable loopback in the xpm-server by choosing the link number, then setting the loopback register to 1.

you can test timing by looping back both sides and see if the link locks.

Two issues are observed with the LCLS-II timing integration. 

  1. Timing does not lock at all. Decoder and disparity errors infinitely counts
  2. Decoder and disparity errors stop counting, sof and eof start counting but RxLinkUp never goes to up. 

For 1, for some reason, the transceiver is not locking. For 2, the state machine that sets RxLinkUp does not seem to function correctly. With a bad link, the transceiver seems to struggle to lock, and the signals do not seem to be set in the expected sequence. While if Ila is synthesized to monitor these signals, locking happens everytime, and relatively fast.


RAM usage

sds

+--------------------------+------------+-----------+-------+-----------+
| Memory Type              | Total Used | Available | Util% | Inferred% |
+--------------------------+------------+-----------+-------+-----------+
| URAM                     |          0 |       128 |  0.00 |      0.00 |
| BlockRAM                 |      500.5 |       984 | 50.86 |    100.00 |
|  RAMB36E2                |        444 |           |       |    100.00 |
|  RAMB18E2                |        113 |           |       |    100.00 |
| LUTMs as Distributed RAM |       7868 |    161280 |  4.88 |    100.00 |
|  LUTMs as RAM32X1D       |        704 |           |       |    100.00 |
|  LUTMs as RAM32M16       |       4112 |           |       |    100.00 |
|  LUTMs as RAM32M         |        380 |           |       |    100.00 |
|  LUTMs as RAM256X1D      |       2672 |           |       |    100.00 |
+--------------------------+------------+-----------+-------+-----------+


URAM usage is 0. URAM size is 288Kb

Debugging failing lanes

ASIC 2 lanes 23 22 and 16 seem to be always dead. Here is an acquisition from chipscope. From the looks of it, lanes 23 and 22 seem out of the norm. 16 seems to be sending data. Next analyse data



Talked to Dionisio and conclusion are

  • If it were the ASIC, we would see the issue in all ASICs
  • If it were the carrier board, since the schematic is hierarchical, we would see the issue in all ASICs
  • Since all carriers were tested with only one digital board, the defect could be anywhere on the digital board.
  • We plan to try continuity tests from the connector to the AC coupler capacitor on the digital board. 

Observations

digitalcarrierimagenotes
C00-02

Copper strong back

John doe (used in ASC)

ASIC 2 lanes 22 and 16 working

C00-02RX000

ASIC 2 lanes 23, 22, 21, 16, 13, 9, 5, 1 not working
C00-01RX000


ASIC 2 lanes 23, 22 and 16 not working
C00-03RX001

ASIC 2 lane 22, 21, 20, 19, 16, 13, 5, 1 not working
C00-01

Copper strong back

John doe (used in ASC)

No thermal pad. no screws.

C00-01

Copper strong back

John doe (used in ASC)

No thermal pad. With carrier cover and with screws.













and we suspect that the digital board has a problem, most probably the AC coupling capacitors. Test with the prototype in ASC and one of our carriers to see if the lane failure problem persists.

List of tasks


Add this submodule to both projects https://github.com/slaclab/AsicRegMapping

version stuff and reproduce timing lock success at least 3 times: bitstream 6bbaaf4

Make power cable (Julian) - on it

Test all ASICs together at 5000 FPS

Port ADC stuff from TXI

Img descrambling in firmware

Connect transceivers to MM-SM converter

Fix timing: What goes on when RxLinkUp is up?

Make script for lorenzo

Send digital board to increase analog voltage to 2V0

Fix scrambling: replace first and last lines in each horizontal line and test

High speed acquisition seems to cause corruption. Fifo in firmware seems to get full. See what is going on.

Investigate delays

Fixed descrambling algorithm

Make a jupyter for Lorenzo

Charge injection seems messed up for columns

Test MM to SM converter boxes




  • No labels