Page History
The axi-pcie-devel package is supplied with a test application to verify KCU/GPU interaction. It is comprised of the interCardGui
application and the test_dma
program. These are used with the InterCardTest
firmware that should be loaded resident on the KCU. The interCardGui
brings up a devGui
-like GUI that provides access to the registers provided by the InterCardTest
firmware on the KCU.
Running the test
After building and installing the datagpu
driver using comp_and_load_drivers.sh (in the aes-stream-drivers package), go to the gpu
directory in a axi-pcie-devel
check-out to launch the interCardGui
, e.g.:
Code Block | ||
---|---|---|
| ||
cd ~/git/axi-pcie-devel/software/gpu/ ../scripts/interCardGui |
There was an issue launching Launching it from another directory was problematic at one point, but this may have been fixed. The GUI connects with the /dev/datagpu_0
device by default, but another may be select by using the --dev
option. The following GUI should come up:
Depending on the state of the system, it may be desirable to reset the firmware state by opening the AxiPcieCore.AxiVersion
blocks and clicking on UserRst
to clear out any previous state:
From another prompt, start the test_dma
program:
Code Block | ||
---|---|---|
| ||
cd ~/git/axi-pcie-devel/software/gpu/ sudo ./bin/test_dma |
...
At this point it is useful to click Read All
at the bottom of the interCardGui
to verify that various registers look reasonable. The dmesg
program also shows some output from the datagpu
driver that might be of interest.
...
When we started working with the InterCardTest
, we found that the DMAs worked fine on some machines but would not occur on others. The systems that worked fine are ones used by the TID development group, rdsrv419
, rdsrv403
, etc. These run the Ubuntu 22.04 OS. On the LCLS nodes drp-srcf-gpu001
and daq-tst-dev06
, the test would fail. These hosts run RHEL7 3.10.0-1160.
Further, when we modified Ryan guided us to modify the test_dma.cu
source code to write to the AxiPcieCore.AxiVersion.ScratchPad
register, we found it would succeed fro on gpu001
and fail for on dev06
.
A Larry, Jeremy and Ric tried a number of things were tried to try and to resolve the issue. This list may be an incomplete list:
- Install CUDA Focus on one version of CUDA - install 12.3 on all machines
- Unsuccessful on gpu001, which automagically went to 12.4
- Install the Focus on one version of NVIDIA open drivers - install v545 on all machines
- Unsuccessful on gpu001, which went to v550
- Check whether the proprietary NVIDIA drivers lead to better results
- Disable the IOMMU using
iommu=off
,amd_iommu=off
and/orintel_iommu=off
onGRUB_CMDLINE_LINUX
- Ensure Rogue is at the latest version (v6.1.3)
- Place the KCU and GPU in
dev06
with just one PCIe bridge between them (adjacent slots) - Remove all but one KCU from
dev06
- Turn off the Access Control System (ACS) using
setpci
or the a BIOS setting- Disabling the ACS Control parameter in the
dev06
BIOS appeared to allow the GPU in dev06 to access the KCU's registers. However, doing the equivalent withsetpci
seemed to have no effect ingpu001
(which doesn't appear to have such a similar BIOS parameter)
- Disabling the ACS Control parameter in the
None of these seemed to have a conclusive effect. Possibly some or all are necessary but insufficient.
Resolution
It appeared that gpu001
and dev06
were able to access the KCU's registers and the DMA was is being started. This was surmised by the latency counters in the interCardGui
starting to count when the DMA was is triggered by the test_dma
code. By modifying the code, both the write and read DMAs could be started. Judging by the counters never stopping, neither ever completedcomplete. The read DMAs caused the ReadAxiErrorCnt
to count and a value of 3 was seen in ReadAxiErrorVal
ReadAxiErrorVal becomes 3
. Sometimes this was also seen for the write DMAs. Changing the DMA sizes (even down to 0) had no effect in on the behavior, so perhaps there's something wrong with the DMA address. interCardGui
and dmesg
show the physical PCIe addresses, whereas test_dma
shows the virtual addresses.
There is a parameter in the AxiPcieCore.AxiVersion
block shown by interCardGui
called AXI_PCIE_CONFIG_C_ADDR_WIDTH_C
. Its value was originally Originally it had a value of 40. Looking at the addresses displayed in AxiGpuAsyncCore
and in the datagpu
driver output in dmesg
, it was noted that on the TID machines the report PCI addresses that are 40 bits wide and while on gpu001
they are 42 bits wide. On dev06
, they are 47 bits wide dev06
shows 46-bit addresses. Larry updated indicated that this implies PCIe addresses are being truncated and set about updating all affected firmware to allow up to 64-bit PCIe addresses. This allowed the test to work After updating to the newer firmware, the DMA test works correctly on gpu001
and dev06
, as well as on the TID machines. The AXI_PCIE_CONFIG_C_ADDR_WIDTH_C
parameter now has shows a value of 64 in interCardGui
.