To Do List
With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.
Priorities:
- get first-pass TDet firmware from Mudit
- drp-executable infrastructure?
- makefile support (Gabriel?). Ric feels this will be fairly straightforward
- switch to spack (Gabriel?). our make system may depend on conda. Gabriel thinks switch to gpu/spack may be straightforward.
- gpu-direct to weka (gabriel?)
- dgram creation/handling not optimal for gpu? use cpu? (ric)
- reverse BEB-packet iteration not optimal for gpu? (low priority)
- slow transitions (e.g. configure) go to cpu, fast transitions (l1accept, slowupdate) go to gpu (ric)
- have to make the teb/meb work with "trigger primitives" (ric)
- has to be on the CPU to communicate with IB via libfabric
- do need the meb buffer-number stuff to work "soon-ish"
- full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s
- makefile support (Gabriel?). Ric feels this will be fairly straightforward
algorithms? (stefano)
- managing calibration constants on the GPU
- cpu has access to these (for the portion of the detector that the drp node sees) in psana via drp-python (as a python dictionary in det.calibconst)
- need to "copy the python dictionary" with cudaMemCopy to the GPU: pedestal/gain/bad-pixels. have to figure out how to do this elegantly. constants can change at every beginrun transition.
- the dictionary is detector-specific.
- could be a std::map? Ric says that std::map may not be implemented on GPU.
- LC, cuSZ (work on cpu decompression in psana)
- ROI (trivial)
- others like radial average with masks (standalone and pyfai)
- cube
- peakfinders (GPU peakfinder8 is in pyFAI)
LC performance
Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:
segments | Gbytes/s |
---|---|
1 | 6 |
2 | 12 |
4 | 21 |
6 | 31 |
8 | 38 |
16 | 60 |
8x16 | 102 |
100x16 | 111 |
May 14 and 20, 2024
- Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
- Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
- have asked Silke for her opinion about leaving out common mode in drp
(From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
(from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
- another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
- Gabriel will investigate
- multiple streams
- cpo thinks multiple-events maps best to multiple-streams
- multiple-segments should be done in one stream?
- the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)
- multiple streams
June 3, 2024
- Gabriel looked at launching calibrate/compress in one kernel or two. Two kernels feels better: more modular (#blocks, #threads, use of shared-memory). Continuing to look at streaming.
- cpo should ask TID if they have "TDet" firmware that can receive timing packets. Maybe we don't need this and can just reuse existing TDet firmware?
- have asked TID if the gpu/cpu switchyard exists. Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
- need a timing fiber
- lower priority: ideally need mellanox IB fixed
- could also ideally test a second node by adding a kcu1500
Aug 5, 2024
- Stefano is achieving 140GB/s on with his standalone angular integration (includes masking) for a 16Mpixel image.
- Stefano will work on pyFAI
- Stefano and cpo talked with Jiannan and Robert:
- suggested using graphs for launching LC kernels (Stefano will look at this)
- were worried about "outliers" getting clipped in LC. But talked with Martin and he said there is no clipping.
- priorities for Robert/Jiannan: make sure cupy bug is fixed so Stefano benchmark cuSZ, streams with cuSZ, support cpu-decompression
- Ric is working with Mudit
- having difficulty enabling pcie bifurcation on dev06
- we think we need to have two gpus to take advantage of bifurcation
- notional idea of kcu/gpu structure: have multiple kcu's (~6 existing kcu's to get 50GB/s into GPU, only if pcie bus can do it). pcie bandwidths are here: https://en.wikipedia.org/wiki/PCI_Express
- supermicro update utility seems to report non-bifurcated when we think it is?
- suggested mudit write a pgpread-style program for gpu: mudit's stuff doesn't work with gpu yet
- ric working on drp-gpu code (targeting cpu side first)
- tried running mudit firmware on gpu001 (non-bifurcated). saw events coming into cpu with pgpread. turning on bifurcation worked on gpu001. dev06 bifurcation-issue seems to be specific to that machine (matt claims it used to work though)
- Jeremy has example of cuda graphs if that ends up being useful
- having difficulty enabling pcie bifurcation on dev06
Aug 12, 2024
- Stefano working on cuSZ (Robert has provided cupy fix), pyFAI, and pi-slices azimuthal integration with standalone kernel
- Ric getting to the point where he needs Mudit's stuff to work.
- Makefile system is kludgy
- Can get through phase1 of configure
- Any interaction with firmware messes up dma's to CPU. register map assumed to be same as interlard test? only know the mode-select address for certain.