Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

June 17, 2024

With Ryan, Larry, Mudit, Ric, Gabriel, cpo

...

Debugging session with Jeremy, Ryan, Gabriel, CPO and Ric
  - The issue with test_dma.cu failing on rdsrv403 is that the frame is too big for the buffer
    - Change PrbsTx.PacketLength to 0xff from 0xfff
    - test_dma then works on rdsrv403
    - Ryan found this from either AxiGpuAsyncCore.WriteAxiErrorVal or ReadAxiErrorVal having value 0x4
  - We modified test_dma.cu to write the AxiPcieCore.AxiVersion.scratchpad register with 0xdeadbeef
    - We saw 0xdeadbeef appear in the GUI
    - So GPU to KCU writes seem to be working
      - There is some worry that we don't know whether the path to the KCU might be going through the CPU
  - We modified test_dma.cu to replace the spin on the handshake location with a
    getchar() so we can dump after we trigger the read
    - We see that all the data is zeros rather than junk or random data
    - This explains why the spin never returns
    - AxiGpuAsyncCore.ReadAxiErrorVal shows an error value of 0x3 after
      attempting to transfer one frame
    - PrbsTx.Busy is 0
    - (Didn't catch the meaning of the error code other than that the transfer failed)
  - Jeremy determined that on our system gpu001 we're using the closed source nvidia
    driver rather than the open source one installed on rdsrv403
    - He'll set us up with the open source driver
  - Ryan points out that the two GPU  cards are different
    - CPO will lend them gpu002's A5000 to try in rdsrv403
    - rdsrv403 looks to have only 1 root complex with a number of hubs and
      bridges, different from our nodes'
  - If the problem is the root complex, it's not clear that we can rearrange the
    cards in our nodes to be on the same root complex due to slots and space constraints
    - CPO suggests moving to the HSD box in Lab 3 in that case because it has
      a much larger PCIe bus

June 24, 2024

We learned that there will be two separate free-lists: one for CPU, one for GPU

Two firmware requests:

  • LCLS would like a GPU-enabled TDet firmware (timing system triggers a fake-payload of user-defined size)
    • generates fake payloads 
    • Ryan said that TID needs to come up with a block-diagram for this
  • For real detector: LCLS would like multiple detector lanes (8 for kcu) built by the Batching EB in in kcu1500 firmware
    • this is separate from the BEB on the FEB which joins timing to data 
    • a "partial" event builder (detectors still split 
    • this is currently done for epixHR (we think the old pre-existing timing-stream is disabled in the KCU BEB)
    • Mudit could modify the existing epixHR kcu1500 firmware, but eliminate timing system and expand to 8 data lanes
    • could we eliminate the front-end BEB to avoid a level of tail-iteration?  Ryan thinks maybe even could avoid the kcu tail-iteration somehow

There are two different uses of the word "batching": batching-event-builder (which cpo thinks of as being just an event builder) and batching in the sense of giving the KCU one "wakeup" call for multiple events (a "batch").

Example of two-level batching event builder (BEB) in the FEB and KCU:

FEB "a" event-build for event 1 with two detector segments a,b and a timestamp t1: t1,d1a,d1b event-built into "f1a"
FEB "b" event-build for event 1 with two detector segments c,d and a timestamp t1: t1,d1c,d1d event-built into "f1b"

KCU batch event-build for event 1: f1a,f1b (static size for HR, UHR, but variable for sparkPIX).  Note that this is a partial event-build because other detector segments will be connected to other KCUs.  Existing LCLS software does that high-level event-build.

Note: will always get a packet from sparkpix from empty payload

July 1, 2024

  • we think gpu can write to kcu, but Jeremy will try to confirm that this is really the case.  we think kcu to gpu is broken.
  • Mudit will work on TDet firmware
  • Jeremy and TID will use rdsrv419
  • LCLS should find another machine where we can control the root-complex topology better.  lab3 daq-tst-dev06?
    • could be that the CPU datadev driver is incompatible with the GPU datadev driver
    • should probably find a different machine.  a fee-alcove machine?
  • chris and gabriel on vacation July 6-20
  • Jeremy unavailable July 8-10

July 8, 2024

  • Jeremy and CPO are on vacation
  • Chris and I have set up daq-tst-dev06 in Lab 3 for testing with a GPU & KCU
    • The KCU appears as /dev/datagpu_1
    • After some debugging to get the software to recognize /dev/datagpu_1 (as opposed to _0) the interCardGui comes up and shows sensible values
    • test_dma also runs, but the AxiVersion.ScratchPad register does not go from 0x0 to 0xdeadbeef
      • Neither does test_dma see any DMAs
  • Larry urges to move axi-pcie-devel forward to v4.1.0 rather than working with v4.0.0
    • Mudit has created v4.2.0 (CPU/GPU DMA switch?) but it has not been tested with hardware yet
    • It is unclear whether Jeremy has gotten a baseline working system yet
  • Ryan agreed that my hacks of test_dma.cu and _Root.py to target/dev/datagpu_1 should be sufficient
  • Ryan would like us all to work on one machine and get confidence in it before we branch out to different machines
  • I suggested that we could pull out the current /dev/datagpu_0 from dev06 so that the KCU of interest becomes _0
    • Ryan & Larry suggested pullling the other ones out as well
  • Ryan suggests setting iommu=no in the BIOS as well as on the command line
    • I later found that there is no iommu parameter in dev06's BIOS
  • There's no experience with multiple KCU cards and the datagpu driver - it is untested
    • Larry is aware of some issue with the usual datadev driver and multiple KCUs
  • Work on rdsrv419
    • A GPU is installed
    • CUDA is installed
    • A KCU is installed and has PCIe slot 03:00

July 15, 2024

  • CPO and Ryan on vacation
  • Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
  • Ric to look into the BIOS settings again to see if we can glean something

  • Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0

    • So far, he's baffled