This was done by Jeremy Lorelli on May 30, 2024 on drp-srcf-gpu001.  His instructions (in addition to "./comp_and_load_drivers.sh" in aes-stream-drivers/data_gpu/driver).  He found we don't need gcc12.  The system 4.8.5 is good enough.

The system changes:
Disabled the nvidia driver in DKMS with sudo dkms uninstall nvidia/545.23.08
Disabled+stopped nvidia_smi_exporter.service with sudo systemctl disable nvidia_smi_exporter.service
Stopped FAHClient with good ol' sudo killall FAHClient (We need to figure out what service is launching this)
The modprobe.d changes were unneeded so I removed them
PR is open with the changes to aes-stream-drivers: https://github.com/slaclab/aes-stream-drivers/pull/140
#140 More sanity checks for comp_and_load_drivers.sh
Description
Added more sanity checks to comp_and_load_drivers.sh
Details
• Ensure the compiler version we're building the NVIDIA drivers and datagpu with match what the kernel was built with.
• Don't enforce gcc-12, only enforce what the kernel was built with
• Only load the ecc module if it exists. On older kernels (i.e. RHEL7's 3.10.X), ecc doesn't exist and isn't needed by the nvidia driver.

Use the branch cuda-12-rhel7-fixes on axi-pcie-devel and it will compile:
cd axi-pcie-devel
cd software/gpu
make
To run test_dma: (assuming same working directory as above)
sudo ./bin/test_dma
And the debugging screens can be launched with: (same working dir, again)
../setup_env_slac.sh
../scripts/interCardGui &
Of course, this assumes the cards have already been programmed with the appropriate firmware.
PR with the compile fixes here: https://github.com/slaclab/axi-pcie-devel/pull/10

For testing, get "intercard" firmware from here: https://github.com/slaclab/axi-pcie-devel/releases/tag/v4.0.0

  • No labels