Detector Corrections
We compare running default and common mode (cmpars=(7,0,100) detector corrections for the epix10ka with numpy, cupy, and cunumeric using data from the uedcom103 experiment run 796. The following are results from running with 7 ranks (1 rank/gpu) and 11 ranks (2 ranks/gpu) on S3DF ampere nodes.
For common mode corrections both cupy and cunumeric do not implement masked arrays so masking was turned off.
The cupy results used a circular buffer of 2 streams alternating between images. This showed better performance than 1 stream, while using 4 streams resulted in worse performance.
cunumeric is currently missing np.flipupd, np.fliplr, and np.select and falls back to CPU numpy for these operations resulting in copies from GPU to CPU for each operation.
Additionally this loop is problematic for cunumeric performance: https://github.com/slac-lcls/lcls2/blob/cunumeric/psana/psana/detector/UtilsAreaDetector.py#L98