Executive Summary of cuNumeric Studies

Transcribed by cpo listening to Seshu on Jan. 31, 2024

...

Seshu didn't see a significant performance increase (both cupy/cunumeric) because det.calib didn't have enough instructions per memory-transfer. cunumeric is implementing more stuff. also need to rewrite some psana loops to use more vector-stuff). We will eventually add the gpu-based det.calib processing to the gpu-based SZ compression, so that will solve the instructions-per-memory-transfer issue.

Large Scale hdf5 Analysis on GPU

for hdf5 analysis cupy can only run on 1 GPU on one node, where cunumeric can scale. cpo says: not yet clear how big lcls2 h5 files will be or if we will need heavy-duty algorithms running on GPU.

Detector Corrections

We compare running default and common mode (cmpars=(7,0,100)) detector corrections for the epix10ka with numpy, cupy, and cunumeric using data from the uedcom103 experiment run 796. The following are results from running with 7 ranks (1 rank/gpu) and 11 ranks (2 ranks/gpu) on S3DF ampere nodes.

...

Additionally this loop is problematic for cunumeric performance: https://github.com/slac-lcls/lcls2/blob/cunumeric/psana/psana/detector/UtilsAreaDetector.py#L98

Page tree

Versions Compared

Old Version 11

New Version Current

Key

Executive Summary of cuNumeric Studies

Large Scale hdf5 Analysis on GPU

Detector Corrections

Default Corrections 7 ranks

Default Corrections 11 ranks

Common Mode Corrections cmpars=(7,0,100) 7 ranks - NO MASKING

Common Mode Corrections cmpars=(7,0,100) 11 ranks - NO MASKING

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Executive Summary of cuNumeric Studies

Large Scale hdf5 Analysis on GPU

Detector Corrections

Default Corrections 7 ranks

Default Corrections 11 ranks

Common Mode Corrections cmpars=(7,0,100) 7 ranks - NO MASKING

Common Mode Corrections cmpars=(7,0,100) 11 ranks - NO MASKING