Page History
...
- Jeremy and CPO are on vacation
- Chris and I have set up
daq-tst-dev06
in Lab 3 for testing with a GPU & KCU- The KCU appears as
/dev/datagpu_1
- After some debugging to get the software to recognize
/dev/datagpu_1
(as opposed to_0
) theinterCardGui
comes up and shows sensible values test_dma
also runs, but theAxiVersion.ScratchPad
register does not go from0x0
to0xdeadbeef
- Neither does
test_dma
see any DMAs
- Neither does
- The KCU appears as
- Larry urges to move axi-pcie-devel forward to v4.1.0 rather than working with v4.0.0
- Mudit has created v4.2.0 (CPU/GPU DMA switch?) but it has not been tested with hardware yet
- It is unclear whether Jeremy has gotten a baseline working system yet
- Ryan agreed that my hacks of
test_dma.cu
and_Root.py
to target/dev/datagpu_1
should be sufficient - Ryan would like us all to work on one machine and get confidence in it before we branch out to different machines
- I suggested that we could pull out the current
/dev/datagpu_0
from dev06 so that the KCU of interest becomes _0- Ryan & Larry suggested pullling the other ones out as well
- Ryan suggests setting
iommu=no
in the BIOS as well as on the command line- I later found that there is no iommu parameter in dev06's BIOS
- There's no experience with multiple KCU cards and the datagpu driver - it is untested
- Larry is aware of some issue with the usual datadev driver and multiple KCUs
- Work on rdsrv419
- A GPU is installed
- CUDA is installed
- A KCU is installed and has PCIe slot 03:00
July 15, 2024
- CPO and Ryan on vacation
- Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
Ric to look into the BIOS settings again to see if we can glean something
Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0
So far, he's baffled
July 22, 2024
- rdsrv419 scratchpad read/write tests: Jeremy/Ryan don't understand the failure mode: Seems to only read the register at offset 0 (the version register). Lower priority since it is for register-space stuff which we don't need.
- Jeremy wants to work on implementing the cuda-graph control mode (one for each event buffer). cuda-graphs can't launch themselves, but there are "scheduling kernels" that can do "tail launches".
- Larry update on configuring rdsrv415: can build kernel driver, but complains about an incompatibility between kernel driver and user application. Ryan will work Larry on this.
- Report from Ric on dev06 and gpu001:
- looked at differences with rdsrv419. made as many of them the same as possible, but didn't help.
- gpu001: noticed the gpu001 drivers were proprietary (dkms). Tried to install open drivers but cuda magically updated from 12.3 to 12.4. Status is the same: can write scratchpad register but dma's still don't work.
- Gabriel suggested access control ("ACS") settings, but so far hasn't helped.
- dev06: pulled out all the other cards from dev06, but scratchpad register write and dma don't work.
- changed bios settings (disabling access-control-system parameter) and that allowed scratchpad to work!
- mudit tdet firmware:
- firmware is done with a switch to move data between cpu/gpu
- learning about scripts from Matt/Ric to test. needs a machine to test: just to CPU for now (until we get GPU dma working)
- matt suggests using kcu in dev02 as an xpm and drive a kcu in dev06 and use the datadev_gpu driver there to test. cpo will work with mudit to get this working.
- some ideas for debugging DMA issue
- test gpu-direct with standard gpu-gpu (cpo/ric/Gabriel). use dev06 for this.
- test gpu-direct with kcu-kcu (can control both sides). TID has a "pip" tool for this (perhaps was used for Ryan Coffee's stuff or dune in 2019?). https://github.com/slaclab/axi-pcie-core/blob/main/protocol/pip/rtl/AxiPciePipCore.vhd. Larry/Mudit could work on this? But Larry gone until Aug 7 or so. use rdsrv416 for this (mudit's machine).
- pcie bus analyzer? Ryan thinks it doesn't exist.
Overview
Content Tools