Stefano has successfully installed the latest cuSZ, but the results are puzzling and not reproducible. This is all with Robert's libpressio (python interface to different algorithms). SZ3 on CPU (0.14GB/s). If data is on CPU cuSZ is .45GB/s and if it's on GPU then 0.58GB/s. Seems significantly slower than LC (tested without libpressio previously). Using a large 4Mpx image. LC on CPU with pressio is 0.6GB/s.
Jiannan/Robert say time can be saved by reusing Huffman-encoding tree for every event (partially computed on the CPU?). Assumes images are roughly the same (some risk, e.g. with ice formation for crystallography)
Can do cuSZ measurements with/without pressio, but easier to do with it.
Stefano should put all his results (and instructions for reproducing) on the Benchmarks confluence page. pyFAI, custom angular integration, LC, cuSZ (and SZ3 on CPU). With/without cuda streams.
Difficulties installing cuSZ with spack: major problem was rhel7. gcc/git were very old. Robert figured out how to call more modern pre-compiled gcc inside spack.
- Valerio had to patch many packages to get spack working on rhel7
- conda is also having problems with rhel7
- Gabriel did some fancy stuff compiled his own glibc with a more modern compiler
- "the end is coming"
- spack works naturally on s3df (rhel8), so psana is fine. just the daq is a problem.
- going forward: try to use spack on rocky9 only (unless rhel7 works trivially using Valerio's existing package-patching work)
- feels like we update gpu003 to rocky9 (has a kcu, but no IB) leave gpu004 as rhel7 so Stefano can complete his measurements
- make new H100 node rocky9
- Valerio does fancy stuff for libnl for rdma-core (he rebuilt this with conda because we used more modern compilers with different ABI). maybe we don't need to do with rocky9/spack? just reuse the system libnl rdma-core, hopefully (spack supports reuse of system libraries better than conda).
Waiting to get in touch with weka about Gabriel's cuFile result

Jan. 6, 2025

Stefano is looking into the cusz performance issues. With LC-GPU gets 60GB/s with 4 streams with 1 segment, and 6GB/s. Two questions:
- why does performance scale better than the number of streams?
- why is 1-stream 1-segment cusz so much worse (0.6GB/s) than LC-GPU (6GB/s)?
- some possible reasons that were suggested: compiler options in spack/conda? timing calculation incorrect for LC? error in the splitting up of the data into single-segments?
- could look at the performance in the profiler, although this will underestimate the eventual performance because of profiler overhead.
next priorities for Stefano: see if we can improve angular integration performance to 50GB/s without batching events (which we can do because the outputs are "separable" into events, but it adds complexity). Note that SZ compression with batches of events is NOT "separable". Another project is the peak-finding performance with peakfinder8 in pyFAI.

Page tree

Versions Compared

Old Version 25

New Version 26

Key

Jan. 6, 2025

Page tree

Page History

Versions Compared

Old Version 25

New Version 26

Key

Jan. 6, 2025