Task Overview

Goals:

  • 35kHz 16Mpx 1TB/s MFX area detectors
  • 50GB/s per into GPU (SZ/LC and calibration, or calibration+ML), avoid CPU where possible, 5GB/s output
    • "pipeline" as much as possible. hopefully enough buffering to handle the software trigger latency (Ric's TEB)

Approach:

  • use nvidia/cuda for now
    • in future consider kokkos/hip etc.
  • python will be less useful
    • still useful for managing the slow transitions (right now C++ handles the 35kHz trigger data)
    • not as useful if we can keep the L1Accept path entirely on the GPU
  • nvlink would be nice, but likely the protocol isn't open to us
  • hopefully GPU can launch its own kernels
  • CPU would be used for:
    • handling the transitions (configure, beginrun, etc...)
    • monitoring and trigger info transmit/receive

Algorithms:

  • SZ3/LC compression
  • ML algorithms (peak-finding, triggering on teb?)
  • detector calibration (Gabriel?)
  • validation of reduction algorithms (stefano)

Infrastructure:

  • worry about pcie performance (pcie7)
  • use drp-srcf-gpu[001-004]. cuda 12
  • datadev driver (TID)
  • GPU-based file-writing with gpu-direct (Gabriel?)
    • generating correct xtc headers
  • drp-gpu executable (ric?)
    • need to solve the trigger-problem
    • need to solve the monitoring-problem
  • move to spack
  • test that GPU-compressed data can be CPU-decompressed by psana (especially SZ3)

Driver

Meeting with Larry, Ryan, Matt, Ric, cpo on Feb. 9, 2024

To address concerns about LCLS-II HE large area detectors

run a different driver (same github repo as datadev):  requires "Quadro" ability.  RTX A5000 (in drp-srcf-cmpNNN) seems to have this ability.
additional dma engine needs slightly different dma engine
"metadata" word write triggers gpu
  - two physical addresses: destination for data and mailbox address
    which gpu would wait on
TID driver uses "GPUDirect" to move data directly to GPU memory over pcie
does a cpu need to be involved in the path?  unclear.  maybe a little.
multiple gpus? might need firmware development.  currently 1-to-1.  Matt points out that if GPUs support 30GB/s of SZ compression that matches about what one KCU1500/C1100 could do, so maybe have to have 1-to-1.

have to disable something in linux kernel to i2o or p2p that
allow the direct pci computation

include in makefile the path to nvidia drivers

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/MUG16_GPU_tutorial_V5.pdf

ryan/larry will resurrect the driver
cpo try to install kcu in our gpu nodes
cpo gets account number

research nvlink: direct to gpu, avoids pcie, still a closed protocol?)

Jana writes that Ryan/Larry can charge 19330-44523 for a few months.

Nvidia Meeting

Diagram of system from conversation with Quincey Koziol and Rebanta Mitra on March 20th, 2024

Update about putting SLAC’s FPGA’s on the NVLink bus, from someone at NVIDIA who’s close to the NVLink work:  

- It's not possible today
- Or better said, it would be very hard today. :-)
- It could be possible to connect them to the Arm C2C link, which speaks the standard Arm CHI protocol.
- NVLink is a multiplanar network. You would need to connect all of the FPGAs to all 18 planes of the network because the GPU does an address based spray across the planes.

    In that direction, here’s info about NVLink-C2C (which is what I believe that he was referring to):   https://www.nvidia.com/en-us/data-center/nvlink-c2c/ and I think this quote from that page is relevant:

"Supports Arm’s AMBA CHI (Coherent Hub Interface) or Compute Express Link (CXL) industry standard protocols for interoperability between devices.”

GPU Direct Storage

a.k.a GDS.  Supported by weka: https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#gds-config-file-changes

datadev Driver Meeting With TID

On May 1, 2024

Slides from Ryan and Jeremy: https://docs.google.com/presentation/d/1yJ-WIs73lon9LzBYxIhKKyNvoYWAq8OUNDiN_TOD2Pw/edit?usp=sharing

Useful docs and links

Hardware Dataflow

See slides here:

A talk from Ryan Herbst:

Software DataFlow

  • t=0:
    • switch to cpu-mode dataflow
    • load a detector-specific shareable library
    • Ric currently has GPUDetector object
      • GPUDetector object doesn't have enough info to be able to call detector-specific 
      • should be changed to load a Detector object which then returns a GPUDetector.
        • calibration fetching would be done in the Detector object (or a hook into it)
      • GPUDetector is a base class to all GPU-based detectors.  Has the same thing as PGPDetector (e.g. transition handling, collector, gpu/cpu mode-switch).  Also has call to cpu-side of the datagpu driver (dmaReadBulkIndex, although there is a PGPReader layer in the middle).
  • phase1 (ZMQ) of transitions remains the same.  This in GPUDetectorApp (analog of PGPDetectorApp).
    • Code has been added to PGPDetectorApp to support scans.  Either refactored into reusable code or copied (bad?) into the GPUDetectorApp.  Remains to be done.
    • to be done: phase1 of configure starts up some cpu-threads.  Each cpu thread instantiates and launches N GPU streams, each of which waits on a single-event-buffer.  (a copy of Jeremy's stuff).  Each GPU stream is the equivalent of one worker in the cpu version.
    • each GPU worker has both GPU (cuda) and CPU code in it
    • calibration work (getting constants, transfer to the GPU) is done on phase1 of beginrun.  m_det object returns m_gpu object, and may have a method store_calibration_constants()?
  • a lot of code in PGPDetector for scan+other support that doesn't exist in GPU:
    • ideally we would refactorize in order to avoid copy/paste
  • phase2 is done with the mode-switch set to the cpu.  Exactly the same as the existing CPU code (no communication with the GPU, in general).
  • on enable transition flip the mode switch to GPU and then L1Accepts and SlowUpdates flow through the GPU.  SlowUpdates don't do anything, but need to keep them for Mona.
  • on disable transition flip the mode switch to CPU.
  • L1Accepts:
    • missing jumping-l1accept checking, and common-readout-group error checking (need to understand error word in the header from Mudit)
    •  (separate to make it easier to manage?) kernels:
      • calibration
      • data reduction
      • trigger computation
    • Gabriel's 5us kernel launch overhead is worrisome
    • cuda-graph may help (although this may no longer work?)
      • can't have the driver calls in the graph
    • trying to send batches to Collector: a pebble "starting index" and "number of events".  
    • pebble is on the CPU
      • GPU holds timing header + event data in its own memory
      • CPU pebble only has small 32-byte datagrams in it
        • necessary for the TEB and possibly for the FileWriter (which would the GPU data with the pebble Dgram header)
oeb1   oeb2    (one event buffers)
  v     v
 w1str  w2str  (workers)
  |     |
  v     v
  ---|---
     |
   cpu collector (if the last dgram is a disable then flip back to cpu)
     |
    teb

  • No labels