To Do List

With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.

Priorities:

get first-pass TDet firmware from Mudit
drp-executable infrastructure?
- makefile support (Gabriel?). Ric feels this will be fairly straightforward
  - switch to spack (Gabriel?). our make system may depend on conda. Gabriel thinks switch to gpu/spack may be straightforward.
- gpu-direct to weka (gabriel?)
- dgram creation/handling not optimal for gpu? use cpu? (ric)
- reverse BEB-packet iteration not optimal for gpu? (low priority)
- slow transitions (e.g. configure) go to cpu, fast transitions (l1accept, slowupdate) go to gpu (ric)
- have to make the teb/meb work with "trigger primitives" (ric)
  - has to be on the CPU to communicate with IB via libfabric
  - do need the meb buffer-number stuff to work "soon-ish"
  - full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s

algorithms? (stefano)

managing calibration constants on the GPU
- cpu has access to these (for the portion of the detector that the drp node sees) in psana via drp-python (as a python dictionary in det.calibconst)
- need to "copy the python dictionary" with cudaMemCopy to the GPU: pedestal/gain/bad-pixels. have to figure out how to do this elegantly. constants can change at every beginrun transition.
- the dictionary is detector-specific.
- could be a std::map? Ric says that std::map may not be implemented on GPU.
LC, cuSZ (work on cpu decompression in psana)
ROI (trivial)
others like radial average with masks (standalone and pyfai)
cube
peakfinders (GPU peakfinder8 is in pyFAI)

LC performance

Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:

segments	Gbytes/s
1	6
2	12
4	21
6	31
8	38
16	60
8x16	102
100x16	111

May 14 and 20, 2024

Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
  - have asked Silke for her opinion about leaving out common mode in drp
  - (From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
  - (from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
  - another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
Gabriel will investigate
- multiple streams
  - cpo thinks multiple-events maps best to multiple-streams
  - multiple-segments should be done in one stream?
  - the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)

June 3, 2024

Gabriel looked at launching calibrate/compress in one kernel or two. Two kernels feels better: more modular (#blocks, #threads, use of shared-memory). Continuing to look at streaming.
cpo should ask TID if they have "TDet" firmware that can receive timing packets. Maybe we don't need this and can just reuse existing TDet firmware?
have asked TID if the gpu/cpu switchyard exists. Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
need a timing fiber
lower priority: ideally need mellanox IB fixed
could also ideally test a second node by adding a kcu1500

Aug 5, 2024

Stefano is achieving 140GB/s on with his standalone angular integration (includes masking) for a 16Mpixel image.
Stefano will work on pyFAI
Stefano and cpo talked with Jiannan and Robert:
- suggested using graphs for launching LC kernels (Stefano will look at this)
- were worried about "outliers" getting clipped in LC. But talked with Martin and he said there is no clipping.
- priorities for Robert/Jiannan: make sure cupy bug is fixed so Stefano benchmark cuSZ, streams with cuSZ, support cpu-decompression
Ric is working with Mudit
- having difficulty enabling pcie bifurcation on dev06
  - we think we need to have two gpus to take advantage of bifurcation
  - notional idea of kcu/gpu structure: have multiple kcu's (~6 existing kcu's to get 50GB/s into GPU, only if pcie bus can do it). pcie bandwidths are here: https://en.wikipedia.org/wiki/PCI_Express
- supermicro update utility seems to report non-bifurcated when we think it is?
- suggested mudit write a pgpread-style program for gpu: mudit's stuff doesn't work with gpu yet
- ric working on drp-gpu code (targeting cpu side first)
- tried running mudit firmware on gpu001 (non-bifurcated). saw events coming into cpu with pgpread. turning on bifurcation worked on gpu001. dev06 bifurcation-issue seems to be specific to that machine (matt claims it used to work though)
- Jeremy has example of cuda graphs if that ends up being useful

Aug 12, 2024

Stefano working on cuSZ (Robert has provided cupy fix), pyFAI, and pi-slices azimuthal integration with standalone kernel
Ric getting to the point where he needs Mudit's stuff to work.
- Makefile system is kludgy
- Can get through phase1 of configure
- Any interaction with firmware messes up dma's to CPU. register map assumed to be same as interlard test? only know the mode-select address for certain.

Page tree

Core Group Meeting Notes