Task Overview

Goals:

Approach:

Algorithms:

Infrastructure:

Driver

Meeting with Larry, Ryan, Matt, Ric, cpo on Feb. 9, 2024

To address concerns about LCLS-II HE large area detectors

run a different driver (same github repo as datadev):  requires "Quadro" ability.  RTX A5000 (in drp-srcf-cmpNNN) seems to have this ability.
additional dma engine needs slightly different dma engine
"metadata" word write triggers gpu
  - two physical addresses: destination for data and mailbox address
    which gpu would wait on
TID driver uses "GPUDirect" to move data directly to GPU memory over pcie
does a cpu need to be involved in the path?  unclear.  maybe a little.
multiple gpus? might need firmware development.  currently 1-to-1.  Matt points out that if GPUs support 30GB/s of SZ compression that matches about what one KCU1500/C1100 could do, so maybe have to have 1-to-1.

have to disable something in linux kernel to i2o or p2p that
allow the direct pci computation

include in makefile the path to nvidia drivers

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/MUG16_GPU_tutorial_V5.pdf

ryan/larry will resurrect the driver
cpo try to install kcu in our gpu nodes
cpo gets account number

research nvlink: direct to gpu, avoids pcie, still a closed protocol?)

Jana writes that Ryan/Larry can charge 19330-44523 for a few months.

Nvidia Meeting

Diagram of system from conversation with Quincey Koziol and Rebanta Mitra on March 20th, 2024

NVLink

Update about putting SLAC’s FPGA’s on the NVLink bus, from someone at NVIDIA who’s close to the NVLink work:  

- It's not possible today
- Or better said, it would be very hard today. :-)
- It could be possible to connect them to the Arm C2C link, which speaks the standard Arm CHI protocol.
- NVLink is a multiplanar network. You would need to connect all of the FPGAs to all 18 planes of the network because the GPU does an address based spray across the planes.

    In that direction, here’s info about NVLink-C2C (which is what I believe that he was referring to):   https://www.nvidia.com/en-us/data-center/nvlink-c2c/ and I think this quote from that page is relevant:

"Supports Arm’s AMBA CHI (Coherent Hub Interface) or Compute Express Link (CXL) industry standard protocols for interoperability between devices.”

GPU Direct Storage

a.k.a GDS.  Supported by weka: https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#gds-config-file-changes

datadev Driver Meeting With TID

On May 1, 2024

Slides from Ryan and Jeremy: https://docs.google.com/presentation/d/1yJ-WIs73lon9LzBYxIhKKyNvoYWAq8OUNDiN_TOD2Pw/edit?usp=sharing

Useful docs and links