Page History
...
- CPO and Ryan on vacation
- Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
Ric to look into the BIOS settings again to see if we can glean something
Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0
So far, he's baffled
July 22, 2024
- rdsrv419 scratchpad read/write tests: Jeremy/Ryan don't understand the failure mode: Seems to only read the register at offset 0 (the version register). Lower priority since it is for register-space stuff which we don't need.
- Jeremy wants to work on implementing the cuda-graph control mode (one for each event buffer). cuda-graphs can't launch themselves, but there are "scheduling kernels" that can do "tail launches".
- Larry update on configuring rdsrv415: can build kernel driver, but complains about an incompatibility between kernel driver and user application. Ryan will work Larry on this.
- Report from Ric on dev06 and gpu001:
- looked at differences with rdsrv419. made as many of them the same as possible, but didn't help.
- gpu001: noticed the gpu001 drivers were proprietary (dkms). Tried to install open drivers but cuda magically updated from 12.3 to 12.4. Status is the same: can write scratchpad register but dma's still don't work.
- Gabriel suggested access control ("ACS") settings, but so far hasn't helped.
- dev06: pulled out all the other cards from dev06, but scratchpad register write and dma don't work.
- changed bios settings (disabling access-control-system parameter) and that allowed scratchpad to work!
- mudit tdet firmware:
- firmware is done with a switch to move data between cpu/gpu
- learning about scripts from Matt/Ric to test. needs a machine to test: just to CPU for now (until we get GPU dma working)
- matt suggests using kcu in dev02 as an xpm and drive a kcu in dev06 and use the datadev_gpu driver there to test. cpo will work with mudit to get this working.
- some ideas for debugging DMA issue
- test gpu-direct with standard gpu-gpu (cpo/ric/Gabriel). use dev06 for this.
- test gpu-direct with kcu-kcu (can control both sides). TID has a "pip" tool for this (perhaps was used for Ryan Coffee's stuff or dune in 2019?). https://github.com/slaclab/axi-pcie-core/blob/main/protocol/pip/rtl/AxiPciePipCore.vhd. Larry/Mudit could work on this? But Larry gone until Aug 7 or so. use rdsrv416 for this (mudit's machine).
- pcie bus analyzer? Ryan thinks it doesn't exist.
Overview
Content Tools