Executive Summary of cuNumeric Studies

Transcribed by cpo listening to Seshu on Jan. 31, 2024

psana on GPU

advantage of cunumeric vs cupy: psana needed invasive changes with cupy, and then won't work on both cpu/gpu. cunumeric changes were small and would work on both cpu/gpu. A detail: we don't use the legion-cunumeric data distribution, e.g. GASNET since we use MPI to distribute data.

Seshu didn't see a significant performance increase (both cupy/cunumeric) because det.calib didn't have enough instructions per memory-transfer. cunumeric is implementing more stuff. also need to rewrite some psana loops to use more vector-stuff). We will eventually add the gpu-based det.calib processing to the gpu-based SZ compression, so that will solve the instructions-per-memory-transfer issue.

Large Scale hdf5 Analysis on GPU

for hdf5 analysis cupy can only run on 1 GPU on one node, where cunumeric can scale. cpo says: not yet clear how big lcls2 h5 files will be or if we will need heavy-duty algorithms running on GPU.

Detector Corrections

We compare running default and common mode (cmpars=(7,0,100)) detector corrections for the epix10ka with numpy, cupy, and cunumeric using data from the uedcom103 experiment run 796. The following are results from running with 7 ranks (1 rank/gpu) and 11 ranks (2 ranks/gpu) on S3DF ampere nodes.

For common mode corrections both cupy and cunumeric do not implement masked arrays so masking was turned off.

The cupy results used a circular buffer of 2 streams alternating between images. This showed better performance than 1 stream, while using 4 streams resulted in worse performance.

cunumeric is currently missing np.flipupd, np.fliplr, and np.select and falls back to CPU numpy for these operations resulting in copies from GPU to CPU for each operation.

Additionally this loop is problematic for cunumeric performance: https://github.com/slac-lcls/lcls2/blob/cunumeric/psana/psana/detector/UtilsAreaDetector.py#L98

Confluence and Jira now require federated login. Read more.

Page tree

Executive Summary of cuNumeric Studies

Large Scale hdf5 Analysis on GPU

Detector Corrections

Default Corrections 7 ranks

Default Corrections 11 ranks

Common Mode Corrections cmpars=(7,0,100) 7 ranks - NO MASKING

Common Mode Corrections cmpars=(7,0,100) 11 ranks - NO MASKING

Confluence and Jira now require federated login. Read more.

Page tree

GPUs with cuNumeric

Executive Summary of cuNumeric Studies

Large Scale hdf5 Analysis on GPU

Detector Corrections

Default Corrections 7 ranks

Default Corrections 11 ranks

Common Mode Corrections cmpars=(7,0,100) 7 ranks - NO MASKING

Common Mode Corrections cmpars=(7,0,100) 11 ranks - NO MASKING