Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Install CUDA 12.3 on all machines
    • Unsuccessful on gpu001, which went to 12.4
  • Install the NVIDIA open drivers v545 on all machines
    • Unsuccessful on gpu001, which went to v550
  • Disable the IOMMU using iommu=off, amd_iommu=off and/or intel_iommu=off on GRUB_CMDLINE_LINUX
  • Ensure Rogue is at the latest version (v6.1.3)
  • Place the KCU and GPU in dev06 with just one PCIe bridge between them (adjacent slots)
  • Remove all but one KCU from dev06
  • Turn off the Access Control System (ACS) using setpci or the BIOS setting
    • Disabling the ACS Control parameter in the dev06 BIOS appeared to allow the GPU in dev06 to access the KCU's registers.  However, doing the equivalent with setpci seemed to have no effect in gpu001 (which doesn't appear to have such a BIOS parameter)

Resolution

It appeared that gpu001 and dev06 were able to access the KCU's registers and the DMA was being started.  This was surmised by the latency counters in the interCardGui starting to count when the DMA was triggered by the test_dma code.  By modifying the code, both the write and read DMAs could be started.  Judging by the counters never stopping, neither ever completed.  The read DMAs caused the ReadAxiErrorCnt to count and a value of 3 was seen in ReadAxiErrorVal.  Sometimes this was also seen for the write DMAs.  Changing the DMA sizes (even down to 0) had no effect in on the behavior.

There is a parameter in AxiPcieCore.AxiVersion block shown by interCardGui called AXI_PCIE_CONFIG_C_ADDR_WIDTH_C.  Its value was originally 40.  Looking at the addresses displayed in AxiGpuAsyncCore and in the datagpu driver output in dmesg, it was noted that on the TID machines the PCI addresses are 40 bits wide and on gpu001 they are 42 bits wide.  On dev06, they are 47 bits wide.  Larry updated all affected firmware to allow up to 64-bit PCIe addresses.  This allowed the test to work correctly on gpu001 and dev06.  The AXI_PCIE_CONFIG_C_ADDR_WIDTH_C parameter now has a value of 64 in interCardGui.