To Do List
With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.
Priorities:
- get first-pass TDet firmware from Mudit
- drp-executable infrastructure?
- makefile support (Gabriel?). Ric feels this will be fairly straightforward
- switch to spack (Gabriel?). our make system may depend on conda. Gabriel thinks switch to gpu/spack may be straightforward.
- gpu-direct to weka (gabriel?)
- dgram creation/handling not optimal for gpu? use cpu? (ric)
- reverse BEB-packet iteration not optimal for gpu? (low priority)
- slow transitions (e.g. configure) go to cpu, fast transitions (l1accept, slowupdate) go to gpu (ric)
- have to make the teb/meb work with "trigger primitives" (ric)
- has to be on the CPU to communicate with IB via libfabric
- do need the meb buffer-number stuff to work "soon-ish"
- full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s
- makefile support (Gabriel?). Ric feels this will be fairly straightforward
algorithms? (stefano)
- managing calibration constants on the GPU
- cpu has access to these (for the portion of the detector that the drp node sees) in psana via drp-python (as a python dictionary in det.calibconst)
- need to "copy the python dictionary" with cudaMemCopy to the GPU: pedestal/gain/bad-pixels. have to figure out how to do this elegantly. constants can change at every beginrun transition.
- the dictionary is detector-specific.
- could be a std::map? Ric says that std::map may not be implemented on GPU.
- LC, cuSZ (work on cpu decompression in psana)
- ROI (trivial)
- others like radial average with masks (standalone and pyfai)
- cube
- peakfinders (GPU peakfinder8 is in pyFAI)
LC performance
Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:
segments | Gbytes/s |
---|---|
1 | 6 |
2 | 12 |
4 | 21 |
6 | 31 |
8 | 38 |
16 | 60 |
8x16 | 102 |
100x16 | 111 |
May 14 and 20, 2024
- Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
- Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
- have asked Silke for her opinion about leaving out common mode in drp
(From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
(from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
- another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
- Gabriel will investigate
- multiple streams
- cpo thinks multiple-events maps best to multiple-streams
- multiple-segments should be done in one stream?
- the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)
- multiple streams
June 3, 2024
- Gabriel looked at launching calibrate/compress in one kernel or two. Two kernels feels better: more modular (#blocks, #threads, use of shared-memory). Continuing to look at streaming.
- cpo should ask TID if they have "TDet" firmware that can receive timing packets. Maybe we don't need this and can just reuse existing TDet firmware?
- have asked TID if the gpu/cpu switchyard exists. Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
- need a timing fiber
- lower priority: ideally need mellanox IB fixed
- could also ideally test a second node by adding a kcu1500
Aug 5, 2024
- Stefano is achieving 140GB/s on with his standalone angular integration (includes masking) for a 16Mpixel image.
- Stefano will work on pyFAI
- Stefano and cpo talked with Jiannan and Robert:
- suggested using graphs for launching LC kernels (Stefano will look at this)
- were worried about "outliers" getting clipped in LC. But talked with Martin and he said there is no clipping.
- priorities for Robert/Jiannan: make sure cupy bug is fixed so Stefano benchmark cuSZ, streams with cuSZ, support cpu-decompression
- Ric is working with Mudit
- having difficulty enabling pcie bifurcation on dev06
- we think we need to have two gpus to take advantage of bifurcation
- notional idea of kcu/gpu structure: have multiple kcu's (~6 existing kcu's to get 50GB/s into GPU, only if pcie bus can do it). pcie bandwidths are here: https://en.wikipedia.org/wiki/PCI_Express
- supermicro update utility seems to report non-bifurcated when we think it is?
- suggested mudit write a pgpread-style program for gpu: mudit's stuff doesn't work with gpu yet
- ric working on drp-gpu code (targeting cpu side first)
- tried running mudit firmware on gpu001 (non-bifurcated). saw events coming into cpu with pgpread. turning on bifurcation worked on gpu001. dev06 bifurcation-issue seems to be specific to that machine (matt claims it used to work though)
- Jeremy has example of cuda graphs if that ends up being useful
- having difficulty enabling pcie bifurcation on dev06
Aug 12, 2024
- Stefano working on cuSZ (Robert has provided cupy fix), pyFAI, and pi-slices azimuthal integration with standalone kernel
- Ric getting to the point where he needs Mudit's stuff to work.
- Makefile system is kludgy
- Can get through phase1 of configure
- Any interaction with firmware messes up dma's to CPU. register map assumed to be same as interlard test? only know the mode-select address for certain.
Sept 9, 2024
Gabriel will put his ethernet weka performance numbers on confluence. Sees only 2GB/s single-writer where Ric/cpo recall seeing 5GB/s previously.
Ric's code to flip transitions to GPU only for dma lane 0. Questions:
- Why so many bits?
- Axi stream demux (from Ben): lower 0x0000 configures a route-mask and a destination here: https://github.com/slaclab/axi-pcie-core/blob/main/python/axipcie/_AxiGpuAsyncCore.py#L163-L178. upper 0xffff0000. One stream in, two streams out (cpu and gpu). Don't overthink it: just two values (otherwise get into the swamp of the general demux written by Ben that is being reused here)
- How do we support multiple GPUs?
- How do we support multiple pgp cards per GPU?
- Would like power-on default to be to vector the data to the CPU
- Mudit/Ryan say that software initialization has to be done already (e.g. setting up clocks) so not a big deal to do this in software
- FYI for TID: Aiming to have the same software work for multiple detectors: e.g. TDet, epixM, epixUHR. "GpuDetector" is analog of "PgpDetector".
- Aiming to have the same firmware for GPU/CPU e.g. for TDet.
- Mudit confirms that we will be able to use same firmware for both CPU/GPU deployments
- Should be true for other detectors as well, like epixUHR
- Question for Mudit: most TDet registers are in the 0x00a00000 space, but a few (e.g. the switch between cpu/gpu directions) are in the 0x00d00000 space. What's the difference between the two spaces and should the "switch" be moved to 0x00a? (along with everything else, so nothing would be in 0x00d)
- 0xa00000 space has the axi-lite registers for TDetsemi module, following Matt's original code
- 0xd00000 space has the axi-lite registers for the dmux/asyncgpu module
Oct. 28, 2024
- Stefano has pyFAI results to compare with his standalone method
- Gabriel will perhaps start working on calibration in the GPU
- Valerio is working on other projects. regarding move to spack:
- first get psana2 working in s3df with spack
- then get cpu-based daq working with spack
- then incorporate ric's gpu work into spack
- we will split spack in the daq/mini-psana (old system) and big-psana (s3df)
- ric has committed his code to git
GPU Software Dataflow
see GPU Data Reduction#SoftwareDataFlow
Nov. 4, 2024
- Gabriel is working on getting gpu003 running. Needs advice from Ric.
- Stefano is only getting 12GB/s from his own angular integration for 512x512 pixels (need 50GB/s) which is about 4 or 5 segments (each 384*336)
- need to understand that
- still working on testing pyFAI
Nov. 11, 2024
- Stefano's angular integration: large frame 4Kx4k 140GB/s. 1kx1k 46GB/s. 350x380 5GB/s. Two ideas:
- if we batch a few events, can we "separate" the outputs? (necessary for downstream parallelization of event analysis)
- would this give us better performance. if yes, why?
- test if multi-streaming helps
- waiting for pyFAI results
- if we batch a few events, can we "separate" the outputs? (necessary for downstream parallelization of event analysis)
- Gabriel working on gpu003.
- node is running
- daq is running (transitions into CPU, L1Accepts into GPU)
- latest aes-stream-drivers (some changes required in daq)
- goal: fetch calibration constants and calibrate the detector in gpu (calling out to Python)
- went to cuda 12.6 on gpu003 but gpu004 is on 12.3 and gpu001 is on 12.4. Need 12.4 because 12.3 doesn't support gcc 13. daq-tst-dev06 is also on 12.3 (Jeremy and Mudit have been using this)
- see psdaq/psdaq/drp/epixHrEmu.py and psdaq/drp/EpixHRemu.cc
- cpo order out for the 4U supermicro server plus H100NVL (supermicro says that the 4U server has the PLX pcie switches).
- Ric worked on epixUHR a lot.
- worked with Jeremy to fix the deadlock:
- needed an async-stream based cuda-memcpy from GPU to CPU to avoid deadlocks, not standard cuda-memcpy. this is a general rule with streams.
- Ric created a pgpread-gpu (not pretty yet)
- flips the switch to the GPU automatically, (but option for CPU as well)
- mudit has switch-flipping stuff in special kcuSim (ric thinks better to put it in pgpread-gpu instead)
- thinking about Matt's point about the multi-FPGA-board support
- already have a software event-builder to build multiple lanes, could that idea be reused for multi-pga boards
- Ric says, maybe the idea can be reused, but how to redesign the memPool to support multiple PGP cards without breaking original cpu daq? psdaq/drpGpu/ is a parallel directory (like epicsArch/) where ric is trying ideas.
- already have a software event-builder to build multiple lanes, could that idea be reused for multi-pga boards
- worked with Jeremy to fix the deadlock:
Nov. 18, 2024
- 12GB/s from pyFAI? https://www.silx.org/doc/pyFAI/latest/usage/tutorial/Parallelization/MultiGPU.html (see conclusion)
- Stefano getting worrisome numbers for standalone angular integration (as low as 3GB/s but up to 50GB/s if we batch)
- will run the profiler
- pyFAI is not yet running on GPU: will work on that
- Gabriel moved the GPU nodes to the FEE alcove xpm10.
- 1 free connection to xpm10 in the BOS
- all streams are blocked on dma single-event-buffers, so need to create a separate stream to transfer calibration constants on beginrun
- maybe make this "maintenance" stream permanent?
- couldn't launch kernels on anything except default stream because cuda context hadn't been bound to CPU threads. async GPU processes failed with no error.
- set up basic skeleton for transferring the calibration constants to a gpuMemPool (transitions, constants (maybe on the beginrun transition?)) and launches a dummy calibration on L1Accepts
- did both the full serial number and a subset (2 segments) to fetch calibration
- calibration constant dictionary hierarchy: pedestal and gain array (one per gain range). need status/geometry/. could try to do something general? map a python dictionary to a std::map?
- Fundamental idea in CPU-DRP is a MemPool with a file-descriptor
- changing it affects many things
- put many file-descriptors in there? not easy.
- GPUMemPool and CPUMemPool
- both inherit from base MemPool which inherits pebble/transitions
- derived classes handle file-descriptor and dma buffers
- the cudacontext broke the symmetry between the cpu/gpu side
- separated mempool into a PGP side and a pebble side.
- got cpu-drp to run with the new structure
- tried on gpu-drp, but issues arose. made progress over weekend.
- missing a lot of code for multiple file descriptors
Nov. 25, 2024
- Gabriel battled with a bug where the logging broke when mpi4py initializes
- worked around it by setting an environment variable that stopped mpi4py initialization
- changed buffer-size given to the datagpu driver to be able to holda full detector segment (since we don't yet have a tdetsim.service file)
- found that we could have dynamic kernel parameters defined by virtual functions
- grid, blocksize (potentially 3D arrays, but currently using 144grid size and 192 block size (size of 1 asic)).
- running with 4 pixels per thread sequentially (from a previous early study). this factor of 4 brings us up to the size of a full segment (4 asics)
- worry about the 5us kernel-launch overhead
- cuda-graphs reduce this overhead.
- python dictionary is mapped to a gpu-structure
- waiting for Ric's latest
- on the to-do list: cufile (goal is 5GB/s per GPU, 2 GPUs, 10GB/s/node), cuda-graphs
- seeing messages about getting cached memory for the DMA. ask TID about this.
- Stefano running pyFAI on GPU
- much worse performance (24GB/s for 4kx4k) than standalone (126GB/s). Gabriel mentions that pyFAI may do additional tasks (e.g. solid-angle correction)
- pixel "interpolation/smoothing" is not the reason
- for 1 segment 384x384 get 6GB/s standalone and 1GB/s with pyFAI
- Sparse-matrix-matrix with 10 events was 3GB/s
- Sparse-matrix-vector with 16 events was 50GB/s
- profiler work for standalone pyFAI is in-progress
- for standalone version using cusparse called by cupy
- cuSZ is also in-progress
- possible future projects: optimize standalone angular integration, peakfinder8
Dec. 2, 2024
- Gabriel seeing 1-2GB/s on ampere with libCuFile with some variability (writing a single file). Will try gpu004 (less variable usage of filesystem, and has IB).
- Having trouble installing libpressio with spack. working with Robert. Will continue to work on angular integration profiling if blocked on spack.
Dec. 9, 2024
- Stefano working with Robert on libpressio installation issues
- Gabriel putting cufile results on confluence. Tried gpu004, getting 1-2GB/s there as well. Can write with 4GB/s from GPU to pinned memory. Was straightforward to use cufile. Only way to configure it is with json (e.g. block size). Using Nvidia-provided gdsio application. Looking at cuda-graphs as well. First event to "record" the graph takes a long time (100's of microseconds). Don't think there is a way to record the graph across daq restarts. Will give cufile results to weka people when they are ready on confluence.
- Ric got the daq to work with transitions through the GPU. Still a bunch of stuff needed to support multi-fpga. Some important decisions need to be made. Matt is making changes to XpmDetector that may impact this (only in the short-term while we use the emulator, since XpmDetector is used . Current approach is to event-build the various FPGA streams (and eventually go to one file) but need to decide how to handle configure data, for example. How to break the wait if we need to reconfigure? Maybe use a spin-loop? cpo votes for a multi-person design brainstorming session to discuss the issues that Ric has found (has the additional advantage that it educates multiple people). Ric suggests: maybe run N drp's on a box (one per fpga) so we don't event-build multiple fpga's? Disadvantage: per-event data becomes smaller, so algorithms become less efficient unless we (temporarily) create batches of multiple events (we can do this for angular integration, but not for SZ since output is not "separable" into individual events). How do we handle differences between the XpmDetector (used for emulation) and the real epixUHR? Maybe have 1 drp process launch the 5 (or N) processes?