Page History
...
Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is: (this is done without streams, the numbers with streams are better).
segments | Gbytes/s |
---|---|
1 | 6 |
2 | 12 |
4 | 21 |
6 | 31 |
8 | 38 |
16 | 60 |
8x16 | 102 |
100x16 | 111 |
May 14 and 20, 2024
- Stefano showed 68GB/s (later: this feels too high, perhaps it was 6-8GB/s?) LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment. 88GB/s for 16 segments.
- Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
- Stefano will post his results to confluence and we will share with Martin
- Stefano will also continue to work on cuSZ
- Stefano has verified that different data compresses at the same 68GB/s for 1 segment
- cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
- Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
- 1 segment (352x384) - 11-12 us -> 22.755 GB/s
- 6 segments (6x352x384) - 40 us -> 40.109 GB/s
- 12 segments (12x352x384) - 72 us -> 44.598 GB/s
- maybe we don't need common-mode in the drp? do it offline? Gabriel points out that this is different than what is currently being done. cpo has a gut feeling it's still OK.
- have asked Silke for her opinion about leaving out common mode in drp
(From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks
(from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.
- another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
- Gabriel will investigate
- multiple streams
- cpo thinks multiple-events maps best to multiple-streams
- multiple-segments should be done in one stream?
- the writing of data with GPU-direct-storage might be in a different stream?
- combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)
- multiple streams
...
Dec. 2, 2024
- Gabriel seeing 21-3GB2GB/s on ampere with libCuFile with some variability (writing a single file). Will try gpu004 (less variable usage of filesystem, and has IB).
- Having trouble installing libpressio with spack. working with Robert. Will continue to work on angular integration profiling if blocked on spack.
Dec. 9, 2024
- Stefano working with Robert on libpressio installation issues
- Gabriel putting cufile results on confluence. Tried gpu004, getting 1-2GB/s there as well. Can write with 4GB/s from GPU to pinned memory. Was straightforward to use cufile. Only way to configure it is with json (e.g. block size). Using Nvidia-provided gdsio application. Looking at cuda-graphs as well. First event to "record" the graph takes a long time (100's of microseconds). Don't think there is a way to record the graph across daq restarts. Will give cufile results to weka people when they are ready on confluence.
- Ric got the daq to work with transitions through the GPU. Still a bunch of stuff needed to support multi-fpga. Some important decisions need to be made. Matt is making changes to XpmDetector that may impact this (only in the short-term while we use the emulator, since XpmDetector is used . Current approach is to event-build the various FPGA streams (and eventually go to one file) but need to decide how to handle configure data, for example. How to break the wait if we need to reconfigure? Maybe use a spin-loop? cpo votes for a multi-person design brainstorming session to discuss the issues that Ric has found (has the additional advantage that it educates multiple people). Ric suggests: maybe run N drp's on a box (one per fpga) so we don't event-build multiple fpga's? Disadvantage: per-event data becomes smaller, so algorithms become less efficient unless we (temporarily) create batches of multiple events (we can do this for angular integration, but not for SZ since output is not "separable" into individual events). How do we handle differences between the XpmDetector (used for emulation) and the real epixUHR? Maybe have 1 drp process launch the 5 (or N) processes?
Dec. 16, 2024
- Stefano has successfully installed the latest cuSZ, but the results are puzzling and not reproducible. This is all with Robert's libpressio (python interface to different algorithms). SZ3 on CPU (0.14GB/s). If data is on CPU cuSZ is .45GB/s and if it's on GPU then 0.58GB/s. Seems significantly slower than LC (tested without libpressio previously). Using a large 4Mpx image. LC on CPU with pressio is 0.6GB/s.
- Jiannan/Robert say time can be saved by reusing Huffman-encoding tree for every event (partially computed on the CPU?). Assumes images are roughly the same (some risk, e.g. with ice formation for crystallography)
- Can do cuSZ measurements with/without pressio, but easier to do with it.
- Stefano should put all his results (and instructions for reproducing) on the Benchmarks confluence page. pyFAI, custom angular integration, LC, cuSZ (and SZ3 on CPU). With/without cuda streams.
- Difficulties installing cuSZ with spack: major problem was rhel7. gcc/git were very old. Robert figured out how to call more modern pre-compiled gcc inside spack.
- Valerio had to patch many packages to get spack working on rhel7
- conda is also having problems with rhel7
- Gabriel did some fancy stuff compiled his own glibc with a more modern compiler
- "the end is coming"
- spack works naturally on s3df (rhel8), so psana is fine. just the daq is a problem.
- going forward: try to use spack on rocky9 only (unless rhel7 works trivially using Valerio's existing package-patching work)
- feels like we update gpu003 to rocky9 (has a kcu, but no IB) leave gpu004 as rhel7 so Stefano can complete his measurements
- make new H100 node rocky9
- Valerio does fancy stuff for libnl for rdma-core (he rebuilt this with conda because we used more modern compilers with different ABI). maybe we don't need to do with rocky9/spack? just reuse the system libnl rdma-core, hopefully (spack supports reuse of system libraries better than conda).
- Waiting to get in touch with weka about Gabriel's cuFile result
Jan. 6, 2025
- Stefano is looking into the cusz performance issues. With LC-GPU gets 60GB/s with 4 streams with 1 segment, and 6GB/s. Two questions:
- why does performance scale better than the number of streams?
- why is 1-stream 1-segment cusz so much worse (0.6GB/s) than LC-GPU (6GB/s)?
- some possible reasons that were suggested: compiler options in spack/conda? timing calculation incorrect for LC? error in the splitting up of the data into single-segments?
- could look at the performance in the profiler, although this will underestimate the eventual performance because of profiler overhead.
- next priorities for Stefano: see if we can improve angular integration performance to 50GB/s without batching events (which we can do because the outputs are "separable" into events, but it adds complexity). Note that SZ compression with batches of events is NOT "separable". Another project is the peak-finding performance with peakfinder8 in pyFAI.
Jan. 13, 2025
- Valerio is going to move psana2 on s3df to spack in the next few weeks
- Ric has the "graphs" approach to kernel launching is working. Tracking down a tricky segfault after 300 events.
- Stefano working on streams. Having trouble reproducing previous compilation: LC is broken with spack (unhappy with flags). Getting advice from Gabriel and Valerio. Looks like old versions of compiler are being picked up (gcc4). Valerio and Gabriel provided guidance for how to fix that.