Page History
Executive Summary of cuNumeric Studies
Transcribed by cpo listening to Seshu on Jan. 31, 2024
psana on GPU
advantage of cunumeric vs cupy: psana needed invasive changes with cupy, and then won't work on both cpu/gpu. cunumeric changes were small and would work on both cpu/gpu. A detail: we don't use the legion-cunumeric data distribution, e.g. GASNET since we use MPI to distribute data.
Seshu didn't see a significant performance increase (both cupy/cunumeric) because det.calib didn't have enough instructions per memory-transfer. cunumeric is implementing more stuff. also need to rewrite some psana loops to use more vector-stuff). We will eventually add the gpu-based det.calib processing to the gpu-based SZ compression, so that will solve the instructions-per-memory-transfer issue.
Large Scale hdf5 Analysis on GPU
for hdf5 analysis cupy can only run on 1 GPU on one node, where cunumeric can scale. cpo says: not yet clear how big lcls2 h5 files will be or if we will need heavy-duty algorithms running on GPU.
Detector Corrections
We compare running default and common mode (cmpars=(7,0,100)) detector corrections for the epix10ka with numpy, cupy, and cunumeric using data from the uedcom103 experiment run 796. The following are results from running with 7 ranks (1 rank/gpu) and 11 ranks (2 ranks/gpu) on S3DF ampere nodes.
...
Additionally this loop is problematic for cunumeric performance: https://github.com/slac-lcls/lcls2/blob/cunumeric/psana/psana/detector/UtilsAreaDetector.py#L98