Task Overview

Goals:

35kHz 16Mpx 1TB/s MFX area detectors
50GB/s per into GPU (SZ/LC and calibration, or calibration+ML), avoid CPU where possible, 5GB/s output
- "pipeline" as much as possible. hopefully enough buffering to handle the software trigger latency (Ric's TEB)

Approach:

use nvidia/cuda for now
- in future consider kokkos/hip etc.
python will be less useful
- still useful for managing the slow transitions (right now C++ handles the 35kHz trigger data)
- not as useful if we can keep the L1Accept path entirely on the GPU
nvlink would be nice, but likely the protocol isn't open to us
hopefully GPU can launch its own kernels
CPU would be used for:
- handling the transitions (configure, beginrun, etc...)
- monitoring and trigger info transmit/receive

Algorithms:

SZ3/LC compression
ML algorithms (peak-finding, triggering on teb?)
detector calibration (Gabriel?)
validation of reduction algorithms (stefano)

Infrastructure:

worry about pcie performance (pcie7)
use drp-srcf-gpu[001-004]. cuda 12
datadev driver (TID)
GPU-based file-writing with gpu-direct (Gabriel?)
- generating correct xtc headers
drp-gpu executable (ric?)
- need to solve the trigger-problem
- need to solve the monitoring-problem
move to spack
test that GPU-compressed data can be CPU-decompressed by psana (especially SZ3)

Driver

Meeting with Larry, Ryan, Matt, Ric, cpo on Feb. 9, 2024

To address concerns about LCLS-II HE large area detectors

run a different driver (same github repo as datadev): requires "Quadro" ability. RTX A5000 (in drp-srcf-cmpNNN) seems to have this ability.
additional dma engine needs slightly different dma engine
"metadata" word write triggers gpu
- two physical addresses: destination for data and mailbox address
which gpu would wait on
TID driver uses "GPUDirect" to move data directly to GPU memory over pcie
does a cpu need to be involved in the path? unclear. maybe a little.
multiple gpus? might need firmware development. currently 1-to-1. Matt points out that if GPUs support 30GB/s of SZ compression that matches about what one KCU1500/C1100 could do, so maybe have to have 1-to-1.

have to disable something in linux kernel to i2o or p2p that
allow the direct pci computation

include in makefile the path to nvidia drivers

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/MUG16_GPU_tutorial_V5.pdf

ryan/larry will resurrect the driver
cpo try to install kcu in our gpu nodes
cpo gets account number

research nvlink: direct to gpu, avoids pcie, still a closed protocol?)

Jana writes that Ryan/Larry can charge 19330-44523 for a few months.

Nvidia Meeting

Diagram of system from conversation with Quincey Koziol and Rebanta Mitra on March 20th, 2024

Page tree

GPU Data Reduction

Task Overview

Driver

Nvidia Meeting