The axi-pcie-devel package is supplied with a test application to verify KCU/GPU interaction.  It is comprised of the interCardGui application and the test_dma program.  These are used with the InterCardTest firmware resident on the KCU.  The interCardGui brings up a devGui-like GUI that provides access to the registers provided by the firmware.

Running the test

After building and installing the datagpu driver using comp_and_load_drivers.sh (in the aes-stream-drivers package), go to the gpu directory in a axi-pcie-devel check-out to launch the interCardGui, e.g.:

cd ~/git/axi-pcie-devel/software/gpu/
../scripts/interCardGui

Launching it from another directory was problematic at one point, but this may have been fixed.  The GUI connects with the /dev/datagpu_0 device by default, but another may be select by using the --dev option.  The following GUI should come up:

Depending on the state of the system, it may be desirable to reset the firmware state by opening the AxiPcieCore.AxiVersion blocks and clicking on UserRst to clear any previous state:

From another prompt, start the test_dma program:

cd ~/git/axi-pcie-devel/software/gpu/
sudo ./bin/test_dma

This program also interacts with /dev/datagpu_0 by default.  Another device may be selected using the -d option.  The program will print some stuff and then pause:

(rogue_v6.1.3) claus@drp-srcf-gpu001:gpu$ sudo ./bin/test_dma
[sudo] password for claus:
Total devices 1
Selected device: NVIDIA RTX A5000
Global memory: 24026 MB
64-bit Memory Address support
Setting write pointer: 0x7f9885600000 - 65536
Setting read pointer: 0x7f9885610000 - 65536
Done with pointers
Mapping FPGA registers
swFpgaRegs = 0x7f98a36bd000
Enabling IO memory for FPGA registers
Mapping write start register
Mapping read start register
Mapped FPGA registers
Create stream
write memory
Trigger write
Wait memory value

At this point it is useful to click Read All at the bottom of the interCardGui to verify that various registers look reasonable.  The dmesg program also shows some output from the datagpu driver that might be of interest.

To trigger a DMA sequence, click on the OneShot Exec button in the PrbsTx block of the interCardGui.  This should result in additional printout from test_dma:

Done waiting
Stream Sync
Context sync
data: 0 0x 2000000 - 0x       1
data: 1 0x    2000 - 0x       0
data: 2 0x       0 - 0x       0
data: 3 0x       0 - 0x       0
data: 4 0x       0 - 0x       0
data: 5 0x       0 - 0x       0
data: 6 0x       0 - 0x       0
data: 7 0x       0 - 0x       0
data: 8 0x       1 - 0x      ff
data: 9 0x       0 - 0x       0
...

Click on Read All in interCardGui again to update the register values.  The AxiGpuAsyncCore block shows some statistics that may be of interest.  Here it indicates that both the write and read DMAs completed successfully:

Problems encountered

When we started working with the InterCardTest, we found that the DMAs worked fine on some machines but would not occur on others.  The systems that worked fine are ones used by the TID development group, rdsrv419, rdsrv403, etc.  These run the Ubuntu 22.04 OS.  On the LCLS nodes drp-srcf-gpu001 and daq-tst-dev06, the test would fail.  These hosts run RHEL7 3.10.0-1160.

Further, when Ryan guided us to modify the test_dma.cu source code to write to the AxiPcieCore.AxiVersion.ScratchPad register, we found it would succeed on gpu001 and fail on dev06.

Larry, Jeremy and Ric tried a number of things to try to resolve the issue.  This list may be incomplete:

  • Focus on one version of CUDA - install 12.3 on all machines
    • Unsuccessful on gpu001, which automagically went to 12.4
  • Focus on one version of NVIDIA open drivers - install v545 on all machines
    • Unsuccessful on gpu001, which went to v550
  • Check whether the proprietary NVIDIA drivers lead to better results
  • Disable the IOMMU using iommu=off, amd_iommu=off and/or intel_iommu=off on GRUB_CMDLINE_LINUX
  • Ensure Rogue is at the latest version (v6.1.3)
  • Place the KCU and GPU in dev06 with just one PCIe bridge between them (adjacent slots)
  • Remove all but one KCU from dev06
  • Turn off the Access Control System (ACS) using setpci or a BIOS setting
    • Disabling the ACS Control parameter in the dev06 BIOS appeared to allow the GPU to access the KCU's registers.  However, doing the equivalent with setpci seemed to have no effect in gpu001 (which doesn't appear to have a similar BIOS parameter)

None of these seemed to have a conclusive effect.  Possibly some or all are necessary but insufficient.

Resolution

It appeared that gpu001 and dev06 were able to access the KCU's registers and the DMA is being started.  This was surmised by the latency counters in the interCardGui starting to count when the DMA is triggered by the test_dma code.  By modifying the code, both the write and read DMAs could be started.  Judging by the counters never stopping, neither ever complete.  The read DMAs caused the ReadAxiErrorCnt to count and ReadAxiErrorVal becomes 3.  Sometimes this was also seen for the write DMAs.  Changing the DMA sizes (even down to 0) had no effect on the behavior, so perhaps there's something wrong with the DMA address.  interCardGui and dmesg show the physical PCIe addresses, whereas test_dma shows the virtual addresses.

There is a parameter in the AxiPcieCore.AxiVersion block called AXI_PCIE_CONFIG_C_ADDR_WIDTH_C.  Originally it had a value of 40.  Looking at the addresses displayed in AxiGpuAsyncCore and in the datagpu driver output, it was noted that the TID machines report PCI addresses that are 40 bits wide while on gpu001 they are 42 bits wide.  dev06 shows 46-bit addresses.  Larry indicated that this implies PCIe addresses are being truncated and set about updating all affected firmware to allow 64-bit PCIe addresses.  After updating to the newer firmware, the DMA test works correctly on gpu001 and dev06, as well as on the TID machines.  The AXI_PCIE_CONFIG_C_ADDR_WIDTH_C parameter now shows a value of 64 in interCardGui.

  • No labels