Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • CPO and Ryan on vacation
  • Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
  • Ric to look into the BIOS settings again to see if we can glean something

  • Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0

    • So far, he's baffled

July 22, 2024

  • rdsrv419 scratchpad read/write tests: Jeremy/Ryan don't understand the failure mode:  Seems to only read the register at offset 0 (the version register).  Lower priority since it is for register-space stuff which we don't need.
  • Jeremy wants to work on implementing the cuda-graph control mode (one for each event buffer).  cuda-graphs can't launch themselves, but there are "scheduling kernels" that can do "tail launches".
  • Larry update on configuring rdsrv415: can build kernel driver, but complains about an incompatibility between kernel driver and user application.  Ryan will work Larry on this.
  • Report from Ric on dev06 and gpu001:
    • looked at differences with rdsrv419.  made as many of them the same as possible, but didn't help.
    • gpu001: noticed the gpu001 drivers were proprietary (dkms).  Tried to install open drivers but cuda magically updated from 12.3 to 12.4.  Status is the same: can write scratchpad register but dma's still don't work.
      • Gabriel suggested access control ("ACS") settings, but so far hasn't helped.
    • dev06: pulled out all the other cards from dev06, but scratchpad register write and dma don't work.
      • changed bios settings (disabling access-control-system parameter) and that allowed scratchpad to work!
  • mudit tdet firmware:
    • firmware is done with a switch to move data between cpu/gpu
    • learning about scripts from Matt/Ric to test.  needs a machine to test: just to CPU for now (until we get GPU dma working)
      • matt suggests using kcu in dev02 as an xpm and drive a kcu in dev06 and use the datadev_gpu driver there to test.  cpo will work with mudit to get this working.
  •  some ideas for debugging DMA issue
    • test gpu-direct with standard gpu-gpu (cpo/ric/Gabriel).  use dev06 for this.
    • test gpu-direct with kcu-kcu (can control both sides).  TID has a "pip" tool for this (perhaps was used for Ryan Coffee's stuff or dune in 2019?).  https://github.com/slaclab/axi-pcie-core/blob/main/protocol/pip/rtl/AxiPciePipCore.vhd. Larry/Mudit could work on this?  But Larry gone until Aug 7 or so. use rdsrv416 for this (mudit's machine).  
    • pcie bus analyzer? Ryan thinks it doesn't exist.