Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Jeremy and CPO are on vacation
  • Chris and I have set up daq-tst-dev06 in Lab 3 for testing with a GPU & KCU
    • The KCU appears as /dev/datagpu_1
    • After some debugging to get the software to recognize /dev/datagpu_1 (as opposed to _0) the interCardGui comes up and shows sensible values
    • test_dma also runs, but the AxiVersion.ScratchPad register does not go from 0x0 to 0xdeadbeef
      • Neither does test_dma see any DMAs
  • Larry urges to move axi-pcie-devel forward to v4.1.0 rather than working with v4.0.0
    • Mudit has created v4.2.0 (CPU/GPU DMA switch?) but it has not been tested with hardware yet
    • It is unclear whether Jeremy has gotten a baseline working system yet
  • Ryan agreed that my hacks of test_dma.cu and _Root.py to target/dev/datagpu_1 should be sufficient
  • Ryan would like us all to work on one machine and get confidence in it before we branch out to different machines
  • I suggested that we could pull out the current /dev/datagpu_0 from dev06 so that the KCU of interest becomes _0
    • Ryan & Larry suggested pullling the other ones out as well
  • Ryan suggests setting iommu=no in the BIOS as well as on the command line
    • I later found that there is no iommu parameter in dev06's BIOS
  • There's no experience with multiple KCU cards and the datagpu driver - it is untested
    • Larry is aware of some issue with the usual datadev driver and multiple KCUs
  • Work on rdsrv419
    • A GPU is installed
    • CUDA is installed
    • A KCU is installed and has PCIe slot 03:00

July 15, 2024

  • CPO and Ryan on vacation
  • Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
  • Ric to look into the BIOS settings again to see if we can glean something

  • Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0

    • So far, he's baffled

July 22, 2024

  • rdsrv419 scratchpad read/write tests: Jeremy/Ryan don't understand the failure mode:  Seems to only read the register at offset 0 (the version register).  Lower priority since it is for register-space stuff which we don't need.
  • Jeremy wants to work on implementing the cuda-graph control mode (one for each event buffer).  cuda-graphs can't launch themselves, but there are "scheduling kernels" that can do "tail launches".
  • Larry update on configuring rdsrv415: can build kernel driver, but complains about an incompatibility between kernel driver and user application.  Ryan will work Larry on this.
  • Report from Ric on dev06 and gpu001:
    • looked at differences with rdsrv419.  made as many of them the same as possible, but didn't help.
    • gpu001: noticed the gpu001 drivers were proprietary (dkms).  Tried to install open drivers but cuda magically updated from 12.3 to 12.4.  Status is the same: can write scratchpad register but dma's still don't work.
      • Gabriel suggested access control ("ACS") settings, but so far hasn't helped.
    • dev06: pulled out all the other cards from dev06, but scratchpad register write and dma don't work.
      • changed bios settings (disabling access-control-system parameter) and that allowed scratchpad to work!
  • mudit tdet firmware:
    • firmware is done with a switch to move data between cpu/gpu
    • learning about scripts from Matt/Ric to test.  needs a machine to test: just to CPU for now (until we get GPU dma working)
      • matt suggests using kcu in dev02 as an xpm and drive a kcu in dev06 and use the datadev_gpu driver there to test.  cpo will work with mudit to get this working.
  •  some ideas for debugging DMA issue
    • test gpu-direct with standard gpu-gpu (cpo/ric/Gabriel).  use dev06 for this.
    • test gpu-direct with kcu-kcu (can control both sides).  TID has a "pip" tool for this (perhaps was used for Ryan Coffee's stuff or dune in 2019?).  https://github.com/slaclab/axi-pcie-core/blob/main/protocol/pip/rtl/AxiPciePipCore.vhd. Larry/Mudit could work on this?  But Larry gone until Aug 7 or so. use rdsrv416 for this (mudit's machine).  
    • pcie bus analyzer? Ryan thinks it doesn't exist.