Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Executive Summary of cuNumeric Studies

Transcribed by cpo listening to Seshu on Jan. 31, 2024

psana on GPU

advantage of cunumeric vs cupy: psana needed invasive changes with cupy, and then won't work on both cpu/gpu.  cunumeric changes were small and would work on both cpu/gpu. A detail: we don't use the legion-cunumeric data distribution, e.g. GASNET since we use MPI to distribute data.

Seshu didn't see a significant performance increase (both cupy/cunumeric) because det.calib didn't have enough instructions per memory-transfer.  cunumeric is implementing more stuff.  also need to rewrite some psana  loops to use more vector-stuff).  We will eventually add the gpu-based det.calib processing to the gpu-based SZ compression, so that will solve the instructions-per-memory-transfer issue.

Large Scale hdf5 Analysis on GPU

for hdf5 analysis cupy can only run on 1 GPU on one node, where cunumeric can scale.  cpo says: not yet clear how big lcls2 h5 files will be or if we will need heavy-duty algorithms running on GPU.

Detector Corrections

We compare running default and common mode (cmpars=(7,0,100)) detector corrections for the epix10ka with numpy, cupy, and cunumeric using data from the uedcom103 experiment run 796. The following are results from running with 7 ranks (1 rank/gpu) and 11 ranks (2 ranks/gpu) on S3DF ampere nodes.

...

Additionally this loop is problematic for cunumeric performance: https://github.com/slac-lcls/lcls2/blob/cunumeric/psana/psana/detector/UtilsAreaDetector.py#L98

Default Corrections 7 ranks

Default Corrections 11 ranks

Common Mode Corrections cmpars=(7,0,100) 7 ranks - NO MASKING

Common Mode Corrections cmpars=(7,0,100) 11 ranks - NO MASKING