Issue

2024-01-04 email from Chris

2024-01-04 1:22pm
Li, Yee Ting;
Kroeger, Wilko
Hi Mikhail, (cc: Yee, Wilko since they may have an interest in this)

When you have a chance, could you look at the scaling behavior of this script.
It tries to "prefetch" the data from disk (for 100 events per core) with det.raw,
and then times the behavior of det.calib:

~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py

To do this get exclusive access to a batch node with this command:

srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash

The in another window ssh to the node that you have been allocated
(the reason this is necessary is obscure: it has to do with an incompatible
version of pmix used by srun that our mpi is unhappy with).
In that new window run these two commands:

mpirun -n 1 python cxix1000021_scaling.py
mpirun -n 80 python cxix1000021_scaling.py

For me the first one shows about .11s per event while the second shows about .25s per event,
which is the poor scaling behavior we would like to understand.
You can see logfiles in ~cpo/problems/cxix1000021_scaling in interactive_1core.log and interactive_80core.log.

I think the fundamental question that I hope you could answer is:
do all numpy operations in det.calib slow down on 80 cores,
or is there some other cause for the poor scaling?
From discussions we've had with Yee we suspect the poor scaling may
be due to behavior of the hardware, but we don't know for sure.

Happy to discuss this with you on zoom at any time, of course.
Thanks for looking at this!

chris

p.s. following a suggestion by someone (Yee? Wilko?)
I watched the cpu clock speeds in /proc/cpuinfo while the 80 core job was running.
To my eye they seemed to vary between ~2.4MHz and ~3.0Mhz which didn't feel like quite enough

List of commands for this test

commands for test

> s3dflogin
> psana
srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash
Ex:
ana-4.0.58-py3 [dubrovin@sdfiana001:~/LCLS/con-py3]$ srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash
srun: job 37602834 queued and waiting for resources
srun: job 37602834 has been allocated resources
In .bashrc on host sdfiana001
In .bash_login
ana-4.0.58-py3 [dubrovin@sdfmilan031:~/LCLS/con-py3]$

grab example from Chris:
~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py
and modify it as explained below.

In other window
> s3dflogin
ssh -Y sdfmilan031
cd LCLS/con-py3
source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/psconda.sh
source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/setup_testrel

mpirun -n 1 python test-cxix1000021_scaling.py
mpirun -n 80 python test-cxix1000021_scaling.py

Code for time monitoring

calib_jungfrau with inserted timing point

def calib_jungfrau(det, evt, cmpars=(7,3,200,10), **kwa):
    """
    Returns calibrated jungfrau data

    - gets constants
    - gets raw data
    - evaluates (code - pedestal - offset)
    - applys common mode correction if turned on
    - apply gain factor

    Parameters

    - det (psana.Detector) - Detector object
    - evt (psana.Event)    - Event object
    - cmpars (tuple) - common mode parameters
        - cmpars[0] - algorithm # 7-for jungfrau
        - cmpars[1] - control bit-word 1-in rows, 2-in columns
        - cmpars[2] - maximal applied correction
    - **kwa - used here and passed to det.mask_v2 or det.mask_comb
      - nda_raw - if not None, substitutes evt.raw()
      - mbits - DEPRECATED parameter of the det.mask_comb(...)
      - mask - user defined mask passed as optional parameter
    """

    t00 = time()

    src = det.source # - src (psana.Source)   - Source object

    nda_raw = kwa.get('nda_raw', None)
    arr = det.raw(evt) if nda_raw is None else nda_raw # shape:(<npanels>, 512, 1024) dtype:uint16
    if arr is None: return None

    t01 = time()

    peds = det.pedestals(evt) # - 4d pedestals shape:(3, 1, 512, 1024) dtype:float32
    if peds is None: return None

    t02 = time()

    gain = det.gain(evt)      # - 4d gains
    offs = det.offset(evt)    # - 4d offset

    t03 = time()

    detname = string_from_source(det.source)
    cmp = det.common_mode(evt) if cmpars is None else cmpars

    t04 = time()

    if gain is None: gain = np.ones_like(peds)  # - 4d gains
    if offs is None: offs = np.zeros_like(peds) # - 4d gains

    # cache
    gfac = store.gfac.get(detname, None) # det.name
    if gfac is None:
       gfac = divide_protected(np.ones_like(peds), gain)
       store.gfac[detname] = gfac
       store.arr1 = np.ones_like(arr, dtype=np.int8)

    t05 = time()

    # Define bool arrays of ranges
    # faster than bit operations
    gr0 = arr <  BW1              # 490 us
    gr1 =(arr >= BW1) & (arr<BW2) # 714 us
    gr2 = arr >= BW3              # 400 us

    t06 = time()

    # Subtract pedestals
    arrf = np.array(arr & MSK, dtype=np.float32)

    t07 = time()

    arrf[gr0] -= peds[0,gr0]
    arrf[gr1] -= peds[1,gr1] #- arrf[gr1]
    arrf[gr2] -= peds[2,gr2] #- arrf[gr2]

    t08 = time()

    factor = np.select((gr0, gr1, gr2), (gfac[0,:], gfac[1,:], gfac[2,:]), default=1) # 2msec

    t09 = time()

    offset = np.select((gr0, gr1, gr2), (offs[0,:], offs[1,:], offs[2,:]), default=0)

    t10 = time()

    arrf -= offset # Apply offset correction

    t11 = time()

    if store.mask is None:
       store.mask = det.mask_total(evt, **kwa)
    mask = store.mask

    t12 = time()

    if cmp is not None:
      mode, cormax = int(cmp[1]), cmp[2]
      npixmin = cmp[3] if len(cmp)>3 else 10
      if mode>0:
        #arr1 = store.arr1
        #grhg = np.select((gr0,  gr1), (arr1, arr1), default=0)
        logger.debug(info_ndarr(gr0, 'gain group0'))
        logger.debug(info_ndarr(mask, 'mask'))
        t0_sec_cm = time()
        gmask = np.bitwise_and(gr0, mask) if mask is not None else gr0
        #sh = (nsegs, 512, 1024)
        hrows = 256 #512/2
        for s in range(arrf.shape[0]):
          if mode & 4: # in banks: (512/2,1024/16) = (256,64) pixels # 100 ms
            common_mode_2d_hsplit_nbanks(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], nbanks=16, cormax=cormax, npix_min=npixmin)
            common_mode_2d_hsplit_nbanks(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], nbanks=16, cormax=cormax, npix_min=npixmin)

          if mode & 1: # in rows per bank: 1024/16 = 64 pixels # 275 ms
            common_mode_rows_hsplit_nbanks(arrf[s,], mask=gmask[s,], nbanks=16, cormax=cormax, npix_min=npixmin)

          if mode & 2: # in cols per bank: 512/2 = 256 pixels  # 290 ms
            common_mode_cols(arrf[s,:hrows,:], mask=gmask[s,:hrows,:], cormax=cormax, npix_min=npixmin)
            common_mode_cols(arrf[s,hrows:,:], mask=gmask[s,hrows:,:], cormax=cormax, npix_min=npixmin)

        logger.debug('TIME: common-mode correction time = %.6f sec' % (time()-t0_sec_cm))

    t13 = time()

    resp = arrf * factor if mask is None else arrf * factor * mask # gain correction

    t14 = time()
    times = np.array((t00, t01, t02, t03, t04, t05, t06, t07, t08, t09, t10, t11, t12, t13, t14), dtype=np.float64)

    return resp, times

Timing histograms

Single core processing

80-core processing for ranks 0, 30, and 60

Results

Table shows mean time increment and their statistical uncertainty from associated histograms.

t point	time increment	point description	time for rank 0/1	rank 0/80	rank 30/80	rank 60/80
1	t1 - t0	det.raw	0.8±0.2 ms	4.0 ±0.6 ms	3.2±0.4 ms	3.5 ±0.8 ms
2	t2 - t1	det.pedestals	15±3 μs	36 ±10 μs	31±6 μs	39 ±17 μs
3	t3 - t2	det.gain,offset	15±2 μs	27 ±4 μs	26±4 μs	27 ±6 μs
4	...	cmpars	25±1 μs	50 ±7 μs	58±26 μs	71 ±33 μs
5		gfac	2±0 μs	6 ±1 μs	7±1 μs	7 ±2 μs
6		gr0,1,2	1.3±0.2 ms	10.5 ±1.1 ms	7.0±0.9 ms	9.7 ±1.6 ms
7		make arrf	1.76±0.05 ms	9.2 ±0.9 ms	6.3±0.7 ms	9.0 ±1.5 ms
8		subtract peds	93.7±3.1 ms	191 ±11 ms	181±15 ms	259 ±26 ms
9		eval gain factor for gain ranges	4.9±0.6 ms	20.3 ±1.5 ms	14.6±1.2 ms	17.3 ±2.0 ms
10		eval offset for gain ranges	6.2±0.4 ms	18.5 ±1.3 ms	18.4±1.4 ms	19.2 ±2.1 ms
11		subtract offset	1.0±0.2 ms	6.0 ±0.7 ms	5.3±0.6 ms	6.2 ±1.2 ms
12		get mask	3±2 μs	6 ±2 μs	6±2 μs	7 ±2 μs
13		common mode turned off	7±1 μs	15 ±2 μs	17±2 μs	20 ±3 μs
14	t14 - t13	apply gain factor and mask	4.0±0.7 ms	14.9 ±2.0 ms	13.9±1.6 ms	19.2 ±3.5 ms
99	t14 - t0	per evt time, inside det.calib	109.8±4.2 ms	276 ±15 ms	247±13 ms	345 ±29 ms
0	t0 - t0 previous evt	time between consecutive det.calib	115.4±3.9 ms	335 ±16 ms	307±14 ms	398 ±32 ms

Summary

single core processing is faster than per/core time in 80–core case, factor 2.5-3 FOR ALL OPERATIONS
in 80-core case: time per core is consistent between cores
all constants are cashed and access to constants is fast at sub-milisecond level
common mode correction is turned off, as well as mask?
most time consuming operation is indexed pedestal subtraction

indexed by gain ranges pedestal subtraction

    t07 = time()

    arrf[gr0] -= peds[0,gr0]
    arrf[gr1] -= peds[1,gr1]
    arrf[gr2] -= peds[2,gr2]

    t08 = time()

bad single-to-multicore scaling issue has nothing to do with particular algorithm, it is common problem for any algorithm

2024-01-16 Test for simulated events

Time consuming code

Time consuming code

def time_consuming_algorithm():
    sh2d = (1024,1024)
    sh3d = (3,) + sh2d
    a = random_standard(shape=sh2d, mu=10, sigma=2, dtype=np.float64)
    b = random_standard(shape=sh3d, mu=20, sigma=3, dtype=np.float64)
    gr1 = a>=11
    gr2 = (a>9) & (a<11)
    gr3 = a<=9
    a[gr1] -= b[0, gr1]
    a[gr2] -= b[1, gr2]
    a[gr3] -= b[2, gr3]

2024-01-16 Test Summary

runing on sdfmilan031 - shows minor ~ 5% the mean time difference between 1-core and 80-core processing is larger

2024-01-17 Test for simulated events with minor modifications

Test variation

array shape is changed from (1024, 1024) → (8*512, 1024) 4M
array generation is excluded from time measurement, only indexed subtraction contributes to time measurement
do not pre-create random arrays, it is generated in each core... requires too much memory for all cores.
run on sdfmilan047: ana-4.0.58-py3 [dubrovin@sdfmilan047:~/LCLS/con-py3]$ mpirun -n 80 python test-mpi-numpy-hist.py

Time consuming code

Time consuming code

def time_consuming_algorithm(a, b):
    gr1 = a>=11
    gr2 = (a>9) & (a<11)
    gr3 = a<=9
    t0_sec = time()
    a[gr1] -= b[0, gr1]
    a[gr2] -= b[1, gr2]
    a[gr3] -= b[2, gr3]
    return time() - t0_sec

Results

Single core processing

80-core processing for ranks 0, 30, and 60

2024-01-17 Test Summary

if random array generation is excluded from time measurement, the mean time difference between 1-core and 80-core processing is larger
the main contribution to time difference comes from reserved host...
- sdfmilan047 - looks like extremely "time-noisy" in multi-core processing...
- sdfmilan031 - in previous test looks pretty "calm"

2024-01-19 Test for a few hosts

The same simulation algorithm as before is used to compare processing time of different cores.
To get rid of obvious fluctuation off time measurement between cores, a sample of cores is extended from three, 0,30,60 to seven 0,10,20,30,40,50,60 of 80.
Mean processing time and its uncertainty in milliseconds for single core (core 0/1) and a few cores of 80 are shown in table.
Ratio represents the ratio of 80-core average over core 0/1.

host name	core 0/1	0/80	10/80	20/80	30/80	40/80	50/80	60/80	70/80	80-core average	Ratio	Largest ratio
sdfmilan027	192 ±6	241 ±19	430 ±32	336 ±24	287 ±16	245 ±17	412 ±30	342 ±24		327.6	1.71	2.24
sdfmilan049	210 ±1	280 ±19	369 ±28	245 ±45	282 ±17	290 ±24	367 ±28	242 ±37		296.4	1.41	1.75
sdfmilan056	191 ±3	328 ±21	269 ±40	258 ±32	347 ±28	324 ±22	254 ±34	346 ±30		303.7	1.59	1.82
sdfmilan057	190 ±4	262 ±20	292 ±21	439 ±26	226 ±12	256 ±19	297 ±20	467 ±36		319.9	1.68	2.46
sdfmilan065	203 ±7	281 ±15	345 ±23	245 ±22	248 ±20	276 ±15	353 ±24	250 ±22		285.4	1.41	1.74
sdfmilan105	190 ±2	261 ±36	277 ±20	282 ±15	276 ±14	263 ±48	273 ±19	284 ±26		273.7	1.44	1.49
sdfmilan130	190 ±3	394 ±26	270 ±23	323 ±20	230 ±10	379 ±27	276 ±24	322 ±20		313.4	1.65	2.07
sdfmilan202	210 ±0	346 ±32	351 ±28	281 ±16	286 ±16	355 ±34	345 ±28	297 ±16		323.0	1.54	1.69
sdfmilan203	191 ±4	281 ±26	355 ±31	288 ±26	332 ±44	271 ±32	347 ±26	282 ±21		308.0	1.61	1.86
sdfmilan225	204 ±7	409 ±29	296 ±23	230 ±10	230 ±10	398 ±26	289 ±24	232 ±9		297.7	1.46	2.00
sdfmilan213 cpu_num	293 6 0	339 ±18 13	355 ±22 45	605 ±104 74	388 ±27 103	337 ±18 5	361 ±21 44	613 ±95 72	422 ±27 106	427.5	1.46	2.09
sdfmilan047 cpu_num	210 1 0	639 ±34 8	483 ±219 32	474 ±284 65	744 ±75 104	637 ±43 40	588 ±266 36	443 ±202 67	722 ±75	590.8	2.81	3.54

2024-01-19 Time difference vs array size

shape is decreased in size from (8*512,1024) to (2*512,1024) 1M

host name	core 0/1	0/80	10/80	20/80	30/80	40/80	50/80	60/80	70/80	80-core average	Ratio	Largest ratio
sdfmilan039	47.6 ±0.7	56.0 ±2.5	59.2 ±4.7	59.9 ±6.3	52.8 ±3.4	56.3 ±2.4	58.6 ±5.0	51.2 ±9.9	51.3 ±3.3	55.7	1.17	1.26
sdfmilan225	47.6 ±0.5	56.6 ±3.2	55.4 ±4.2	52.8 ±2.4	52.3 ±2.3	55.8 ±3.8	55.3 ±3.5	52.5 ±2.5	52.0 ±2.4	54.1	1.14	1.19

shape is increased in size from (32*512,1024) 16M

host name	core 0/1	0/80	10/80	20/80	30/80	40/80	50/80	60/80	70/80	80-core average	Ratio	Largest ratio
sdfmilan225	766 ±20	1251 ±91	1099 ±108	958 ±174	955 ±199	1206 ±83	1086 ±106	949 ±181	952 ±231	1057	1.38	1.63
sdfmilan031	760 ±1	1141 ±81	1144 ±94	910 ±111	947 ±118	1221 ±74	1157 ±93	947 ±108	937 ±104	1051	1.38	1.61

Page tree

Issue

List of commands for this test

Code for time monitoring

Timing histograms

Single core processing

80-core processing for ranks 0, 30, and 60

Results

Summary

2024-01-16 Test for simulated events

Time consuming code

2024-01-16 Test Summary

2024-01-17 Test for simulated events with minor modifications

Test variation

Time consuming code

Results

Single core processing

80-core processing for ranks 0, 30, and 60

2024-01-17 Test Summary

2024-01-19 Test for a few hosts

2024-01-19 Time difference vs array size

References

Page tree

Scaling behavior of psana1 det.calib method in multicore processing with mpi

Issue

List of commands for this test

Code for time monitoring

Timing histograms

Single core processing

80-core processing for ranks 0, 30, and 60

Results

Summary

2024-01-16 Test for simulated events

Time consuming code

2024-01-16 Test Summary

2024-01-17 Test for simulated events with minor modifications

Test variation

Time consuming code

Results

Single core processing

80-core processing for ranks 0, 30, and 60

2024-01-17 Test Summary

2024-01-19 Test for a few hosts

2024-01-19 Time difference vs array size

References