Task Overview
Goals:
- 35kHz 16Mpx 1TB/s MFX area detectors
- 50GB/s per GPU (SZ/LC and calibration, or calibration+ML), avoid CPU where possible
- "pipeline" as much as possible. hopefully enough buffering to handle the software trigger latency (Ric's TEB)
Approach:
- use nvidia/cuda for now
- in future consider kokkos/hip etc.
- python will be less useful
Algorithms:
- SZ3/LC compression
- ML algorithms (peak-finding, triggering on teb?)
- detector calibration
- validation of reduction (stefano)
Infrastructure:
- use drp-srcf-gpu[001-004]. cuda 12
- datadev driver (TID)
- GPU-based file-writing with gpu-direct
- generating correct xtc headers
- drp-gpu executable
- move to spack
Driver
Meeting with Larry, Ryan, Matt, Ric, cpo on Feb. 9, 2024
To address concerns about LCLS-II HE large area detectors
run a different driver (same github repo as datadev): requires "Quadro" ability. RTX A5000 (in drp-srcf-cmpNNN) seems to have this ability.
additional dma engine needs slightly different dma engine
"metadata" word write triggers gpu
- two physical addresses: destination for data and mailbox address
which gpu would wait on
TID driver uses "GPUDirect" to move data directly to GPU memory over pcie
does a cpu need to be involved in the path? unclear. maybe a little.
multiple gpus? might need firmware development. currently 1-to-1. Matt points out that if GPUs support 30GB/s of SZ compression that matches about what one KCU1500/C1100 could do, so maybe have to have 1-to-1.
have to disable something in linux kernel to i2o or p2p that
allow the direct pci computation
include in makefile the path to nvidia drivers
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/MUG16_GPU_tutorial_V5.pdf
ryan/larry will resurrect the driver
cpo try to install kcu in our gpu nodes
cpo gets account number
research nvlink: direct to gpu, avoids pcie, still a closed protocol?)
Jana writes that Ryan/Larry can charge 19330-44523 for a few months.
Nvidia Meeting
Diagram of system from conversation with Quincey Koziol and Rebanta Mitra on March 20th, 2024