LC performance
Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:
segments | Gbytes/s |
---|---|
1 | 6 |
2 | 12 |
4 | 21 |
6 | 31 |
8 | 38 |
16 | 60 |
8x16 | 102 |
100x16 | 111 |
May 14 and 20, 2024
- Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
- Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
- have asked Silke for her opinion about leaving out common mode in drp
(From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
(from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
- another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
- Gabriel will investigate
- multiple streams
- cpo thinks multiple-events maps best to multiple-streams
- multiple-segments should be done in one stream?
- the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)
- multiple streams
June 3, 2024
- Gabriel looked at launching calibrate/compress in one kernel or two. Two kernels feels better: more modular (#blocks, #threads, use of shared-memory). Continuing to look at streaming.
- cpo should ask TID if they have "TDet" firmware that can receive timing packets. Maybe we don't need this and can just reuse existing TDet firmware?
- have asked TID if the gpu/cpu switchyard exists. Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
- need a timing fiber
- lower priority: ideally need mellanox IB fixed
- could also ideally test a second node by adding a kcu1500
July 5, 2024
With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.
Priorities:
- get gpu working in daq-tst-dev06
- get first-pass TDet firmware from Mudit
drp-executable infrastructure?
- makefile support (gabriel?)
- switch to spack (gabriel?)
- gpu-direct to weka (gabriel?)
- dgram creation/handling not optimal for gpu? use cpu? (gabriel)
- reverse BEB-packet iteration not optimal for gpu?
- tid fpga->gpu data transfer
- slow transitions (e.g. configure) go to cpu
- fast transitions (l1accept, slowupdate) go to gpu
- (later) have to make the teb/meb work with "trigger primitives"
o has to be on the CPU to communicate with IB via libfabric
o do need the meb buffer-number stuff to work "soon-ish"
o full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s
algorithms? (stefano)
- LC, cuSZ (work on cpu decompression in psana)
- ROI
- others like radial average with masks (look at pyfai), cube? not that many?
- peakfinders (peakfinder8 might already run on a GPU)