Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
title2024-01-04 email from Chris
collapsetrue
2024-01-04 1:22pm
Li, Yee Ting;
Kroeger, Wilko
Hi Mikhail, (cc: Yee, Wilko since they may have an interest in this)

When you have a chance, could you look at the scaling behavior of this script.
It tries to "prefetch" the data from disk (for 100 events per core) with det.raw,
and then times the behavior of det.calib:

~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py

To do this get exclusive access to a batch node with this command:

srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash

The in another window ssh to the node that you have been allocated
(the reason this is necessary is obscure: it has to do with an incompatible
version of pmix used by srun that our mpi is unhappy with).
In that new window run these two commands:

mpirun -n 1 python cxix1000021_scaling.py
mpirun -n 80 python cxix1000021_scaling.py

For me the first one shows about .11s per event while the second shows about .25s per event,
which is the poor scaling behavior we would like to understand.
You can see logfiles in ~cpo/problems/cxix1000021_scaling in interactive_1core.log and interactive_80core.log.

I think the fundamental question that I hope you could answer is:
do all numpy operations in det.calib slow down on 80 cores,
or is there some other cause for the poor scaling?
From discussions we've had with Yee we suspect the poor scaling may
be due to behavior of the hardware, but we don't know for sure.

Happy to discuss this with you on zoom at any time, of course.
Thanks for looking at this!

chris

p.s. following a suggestion by someone (Yee? Wilko?)
I watched the cpu clock speeds in /proc/cpuinfo while the 80 core job was running.
To my eye they seemed to vary between ~2.4MHz and ~3.0Mhz which didn't feel like quite enough

Test environment and commands

Code Block
collapsetrue
> s3dflogin
> psana
srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash
Ex:
ana-4.0.58-py3 [dubrovin@sdfiana001:~/LCLS/con-py3]$ srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash
srun: job 37602834 queued and waiting for resources
srun: job 37602834 has been allocated resources
In .bashrc on host sdfiana001
In .bash_login
ana-4.0.58-py3 [dubrovin@sdfmilan031:~/LCLS/con-py3]$

grab example from Chris:
~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py
and modify it as explained below.

In other window
> s3dflogin
ssh -Y sdfmilan031
cd LCLS/con-py3
source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/psconda.sh
source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/setup_testrel

mpirun -n 1 python test-cxix1000021_scaling.py
mpirun -n 80 python test-cxix1000021_scaling.py

...

Code Block
titlecalib_jungfrau with inserted timing point
collapsetrue
def calib_jungfrau(det, evt, cmpars=(7,3,200,10), **kwa):
    """
    Returns calibrated jungfrau data

    - gets constants
    - gets raw data
    - evaluates (code - pedestal - offset)
    - applys common mode correction if turned on
    - apply gain factor

    Parameters

    - det (psana.Detector) - Detector object
    - evt (psana.Event)    - Event object
    - cmpars (tuple) - common mode parameters
        - cmpars[0] - algorithm # 7-for jungfrau
        - cmpars[1] - control bit-word 1-in rows, 2-in columns
        - cmpars[2] - maximal applied correction
    - **kwa - used here and passed to det.mask_v2 or det.mask_comb
      - nda_raw - if not None, substitutes evt.raw()
      - mbits - DEPRECATED parameter of the det.mask_comb(...)
      - mask - user defined mask passed as optional parameter
    """

    #print('XXX: ====================== det.name', det.name)

    t00 = time()

    src = det.source # - src (psana.Source)   - Source object

    nda_raw = kwa.get('nda_raw', None)
    arr = det.raw(evt) if nda_raw is None else nda_raw # shape:(<npanels>, 512, 1024) dtype:uint16
    if arr is None: return None

    t01 = time()

    #arr peds = np.array(det.rawpedestals(evt), dtype=np.float32)
    peds = det.pedestals(evt) #  # - 4d pedestals shape:(3, 1, 512, 1024) dtype:float32
    if peds is None: return None

    t02 = time()

    gain = det.gain(evt)      # - 4d gains
    offs = det.offset(evt)    # - 4d offset

    t03 = time()

    detname = string_from_source(det.source)
    cmp = det.common_mode(evt) if cmpars is None else cmpars

    t04 = time()

    if gain is None: gain = np.ones_like(peds)  # - 4d gains
    if offs is None: offs = np.zeros_like(peds) # - 4d gains

    # cache
    gfac = store.gfac.get(detname, None) # det.name
    if gfac is None:
       gfac = divide_protected(np.ones_like(peds), gain)
       store.gfac[detname] = gfac
       store.arr1 = np.ones_like(arr, dtype=np.int8)

    t05 = time()

    #print_ndarr(cmp,  'XXX: common mode parameters ')# Define bool arrays of ranges
    #print_ndarr(arr,  'XXX: calib_jungfrau arr ')# faster than bit operations
    #print_ndarr(peds, 'XXX: calib_jungfrau peds')
gr0 = arr <  BW1      #print_ndarr(gain, 'XXX: calib_jungfrau gain')
    #print_ndarr(offs, 'XXX: calib_jungfrau offs') # 490 us
    #print_ndarr(gfac, 'XXX: calib_jungfrau gfac')

    # make bool arrays of gain ranges shaped as data
    #abit15 = arr & BW1 # ~0.5ms
    #abit16 = arr & BW2gr1 =(arr >= BW1) & (arr<BW2) # 714 us
    gr2 = arr >= BW3              # 400 us

    #gr0t06 = np.logical_not(arr & BW3) #1.6mstime()

    # Subtract pedestals
    #gr1arrf = np.logical_and(abit15array(arr & MSK, dtype=np.logical_not(abit16))float32)

    #gr2t07 = np.logical_and(abit15, abit16time()

    #pro_den = np.select((den!=0,), (den,), default=1)
arrf[gr0] -= peds[0,gr0]
    # Define bool arrays of rangesarrf[gr1] -= peds[1,gr1] #- arrf[gr1]
    # faster than bit operationsarrf[gr2] -= peds[2,gr2] #- arrf[gr2]

    gr0t08 = arrtime()

 <  BW1 factor = np.select((gr0, gr1, gr2), (gfac[0,:], gfac[1,:],      gfac[2,:]), default=1) # 490 us2msec

    gr1t09 =(arr >= BW1) & (arr<BW2) # 714 us time()

    offset = np.select((gr0, gr1, gr2), (offs[0,:], offs[1,:], offs[2,:]), default=0)

    gr2t10 = arr >= BW3time()

    arrf -= offset # Apply        # 400 usoffset correction

    t06t11 = time()

    #gbitsif =store.mask raw>>14is # 00/01/11 - gain bits for mode 0,1,2
    #gr0, gr1, gr2 = gbits==0, gbits==1, gbits==3None:
       store.mask = det.mask_total(evt, **kwa)
    mask = store.mask

    #print_ndarr(gr0, 'XXX: calib_jungfrau gr0')
t12 = time()

    #print_ndarr(gr1, 'XXX: calib_jungfrau gr1')
if cmp is not None:
     #print_ndarr(gr2 mode, 'XXX: calib_jungfrau gr2')

cormax = int(cmp[1]), cmp[2]
    # Subtract pedestals
npixmin = cmp[3]  arrf = np.array(arr & MSK, dtype=np.float32)

    t07 = time()

    arrf[gr0] -= peds[0,gr0]if len(cmp)>3 else 10
      if mode>0:
        #arr1 = store.arr1
    arrf[gr1] -= peds[1,gr1] #- arrf[gr1]
    arrf[gr2] -= peds[2,gr2] #- arrf[gr2]

 #grhg = np.select((gr0,  gr1), (arr1, arr1), default=0)
      t08 = time()

 logger.debug(info_ndarr(gr0, 'gain group0'))
      #a = nplogger.selectdebug(info_ndarr(gr0mask, gr1, gr2), (gain[0,:], gain[1,:], gain[2,:]), default=1) # 2msec
'mask'))
        t0_sec_cm = time()
        factorgmask = np.select(bitwise_and(gr0, gr1, gr2mask), (gfac[0,:], gfac[1,:], gfac[2,:]), default=1) # 2msec

 if mask is not None else gr0
    t09 = time()

    offset#sh = np.select((gr0nsegs, gr1512, gr2), (offs[0,:], offs[1,:], offs[2,:]), default=0)

    t10 = time()
    #print_ndarr(factor, 'XXX: calib_jungfrau factor')
    #print_ndarr(offset, 'XXX: calib_jungfrau offset')

    arrf -= offset # Apply offset correction

    t11 = time()

    if store.mask is None:
       store.mask = det.mask_total(evt, **kwa1024)
        hrows = 256 #512/2
        for s in range(arrf.shape[0]):
          if mode & 4: # in banks: (512/2,1024/16) = (256,64) pixels # 100 ms
            common_mode_2d_hsplit_nbanks(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], nbanks=16, cormax=cormax, npix_min=npixmin)
    mask = store.mask

    t12 = time(  common_mode_2d_hsplit_nbanks(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], nbanks=16, cormax=cormax, npix_min=npixmin)

    if cmp is not None:
  if mode & 1: mode,# cormaxin = int(cmp[1]), cmp[2]
      npixmin = cmp[3] if len(cmp)>3 else 10rows per bank: 1024/16 = 64 pixels # 275 ms
      if mode>0:
     common_mode_rows_hsplit_nbanks(arrf[s,], mask=gmask[s,],  #arr1 = store.arr1nbanks=16, cormax=cormax, npix_min=npixmin)

        #grhg = np.select((gr0,  gr1), (arr1, arr1), default=0)
        logger.debug(info_ndarr(gr0, 'gain group0'))
        logger.debug(info_ndarr(mask, 'mask'))
 if mode & 2: # in cols per bank: 512/2 = 256 pixels  # 290 ms
            t0common_sec_cm = time()
        gmask = np.bitwise_and(gr0, mask) if mask is not None else gr0
        #sh = (nsegs, 512, 1024)mode_cols(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], cormax=cormax, npix_min=npixmin)
            common_mode_cols(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], cormax=cormax, npix_min=npixmin)

        hrows = 256 #512/2
        for s in range(arrf.shape[0]):logger.debug('TIME: common-mode correction time = %.6f sec' % (time()-t0_sec_cm))

    t13 = time()

    resp = arrf *  factor if modemask &is 4:None #else inarrf banks: (512/2,1024/16) = (256,64) pixels* factor * mask # 100gain mscorrection

    t14 = time()
    times  common_mode_2d_hsplit_nbanks(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], nbanks=16, cormax=cormax, npix_min=npixmin)
            common_mode_2d_hsplit_nbanks(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], nbanks=16, cormax=cormax, npix_min=npixmin= np.array((t00, t01, t02, t03, t04, t05, t06, t07, t08, t09, t10, t11, t12, t13, t14), dtype=np.float64)

    return      if mode & 1: # in rows per bank: 1024/16 = 64 pixels # 275 ms
            common_mode_rows_hsplit_nbanks(arrf[s,], mask=gmask[s,], nbanks=16, cormax=cormax, npix_min=npixmin)

          if mode & 2: # in cols per bank: 512/2 = 256 pixels  # 290 ms
            common_mode_cols(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], cormax=cormax, npix_min=npixmin)
            common_mode_cols(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], cormax=cormax, npix_min=npixmin)

        logger.debug('TIME: common-mode correction time = %.6f sec' % (time()-t0_sec_cm))

    t13 = time()

    resp = arrf * factor if mask is None else arrf * factor * mask # gain correction

    t14 = time()
    times = np.array((t00, t01, t02, t03, t04, t05, t06, t07, t08, t09, t10, t11, t12, t13, t14), dtype=np.float64)

    return resp, times

Major histograms

Results

resp, times

Major histograms

Image AddedImage AddedImage Added

Image AddedImage AddedImage Added

Image AddedImage AddedImage Added

Image AddedImage AddedImage Added

Results

t pointtime point mathpoint descriptiontime for rank 0/1rank 0/80rank 30/80rank 60/80

00-0 previous evttime per evt total, between det.calib115.4±3.9 ms335 ±16 ms307±14 mst pointtime point mathpoint descriptiontime for rank 0/1rank 0/80rank 30/80rank 60/8000-0 prev evttime per evt total, between det.calib115.4±3.9 ms335 ±16 ms307±14 ms398 ±32 ms

11-0det.raw0.8±0.2 ms4.0 ±0.6 ms3.2±0.4 ms3.5 ±0.8 ms

22-1det.pedestals15±3 μs36 ±10 μs31±6 μs39 ±17 μs

3...det.gain,offset15±2 μs27 ±4 μs26±4 μs27 ±6 μs

4
cmpars25±1 μs50 ±7 μs58±26 μs71 ±33 μs

5
gfac2±0 μs6 ±1 μs7±1 μs7 ±2 μs

6
gr0,1,21.3±0.2 ms10.5 ±1.1 ms7.0±0.9 ms9.7 ±1.6 ms

7
make arrf1.76±0.05 ms9.2 ±0.9 ms6.3±0.7 ms9.0 ±1.5 ms

8
subtract peds93.7±3.1 .76±0.05 ms9.2 ±0.9 ms6.3±0.7 ms9.0 ±1.5 ms8subtract peds93.7±3.1 ms191 ±11 ms181±15 ms259 ±26 ms9eval gain factor for gain ranges4.9±0.6 ms20.3 ±1.5 ms14.6±1.2 ms17.3 ±2.0 ms10eval offset for gain ranges6.2±0.4 ms18.5 ±1.3 ms18.4±1.4 ms19.2 ±2.1 ms11subtract offset1.0±0.2 ms6.0 ±0.7 ms5.3±0.6 ms6.2 ±1.2 ms12get mask3±2 μs6 ±2 μs6±2 μs7 ±2 μs13common mode turned off7±1 μs15 ±2 μs17±2 μs20 ±3 μs14apply gain factor and mask4.0±0.7 ms14.9 ±2.0 ms13.9±1.6 ms19.2 ±3.5 ms9914-0evt processing time, inside det.calib109.8±4.2 ms276 ±15 ms247±13 ms345 ±29 ms

...

ms191 ±11 ms181±15 ms259 ±26 ms

9
eval gain factor for gain ranges4.9±0.6 ms20.3 ±1.5 ms14.6±1.2 ms17.3 ±2.0 ms

10
eval offset for gain ranges6.2±0.4 ms18.5 ±1.3 ms18.4±1.4 ms19.2 ±2.1 ms

11
subtract offset1.0±0.2 ms6.0 ±0.7 ms5.3±0.6 ms6.2 ±1.2 ms

12
get mask3±2 μs6 ±2 μs6±2 μs7 ±2 μs

13
common mode turned off7±1 μs15 ±2 μs17±2 μs20 ±3 μs

14
apply gain factor and mask4.0±0.7 ms14.9 ±2.0 ms13.9±1.6 ms19.2 ±3.5 ms

9914-0evt processing time, inside det.calib109.8±4.2 ms276 ±15 ms247±13 ms345 ±29 ms

Summary

  • single core processing is faster than per/core time in 80–core case, factor 2.5-3 FOR ALL OPERATIONS
  • in 80-core case: time per core is consistent between cores
  • all constants are cashed and access to constants is fast at sub-milisecond level
  • common mode correction is turned off, as well as mask?
  • most time consuming operation is indexed pedestal subtraction
Code Block
titleindexed by gain ranges pedestal subtraction
    t07 = time()

    arrf[gr0] -= peds[0,gr0]
    arrf[gr1] -= peds[1,gr1]
    arrf[gr2] -= peds[2,gr2]

    t08 = time() 

References