Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:
segments | Gbytes/s |
---|---|
1 | 6 |
2 | 12 |
4 | 21 |
6 | 31 |
8 | 38 |
16 | 60 |
8x16 | 102 |
100x16 | 111 |
May 14 and 20, 2024
- Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
- Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
- have asked Silke for her opinion about leaving out common mode in drp
(From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
(from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
- another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
- Gabriel will investigate
- multiple streams
- cpo thinks multiple-events maps best to multiple-streams
- multiple-segments should be done in one stream?
- the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)
- multiple streams
June 3, 2024
- Gabriel looked at launching calibrate/compress in one kernel or two. Two kernels feels better: more modular (#blocks, #threads, use of shared-memory). Continuing to look at streaming.
- cpo should ask TID if they have "TDet" firmware that can receive timing packets. Maybe we don't need this and can just reuse existing TDet firmware?
- have asked TID if the gpu/cpu switchyard exists. Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
- need a timing fiber
- lower priority: ideally need mellanox IB fixed
- could also ideally test a second node by adding a kcu1500