Issue
2024-01-04 email from Chris
2024-01-04 1:22pm Li, Yee Ting; Kroeger, Wilko Hi Mikhail, (cc: Yee, Wilko since they may have an interest in this) When you have a chance, could you look at the scaling behavior of this script. It tries to "prefetch" the data from disk (for 100 events per core) with det.raw, and then times the behavior of det.calib: ~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py To do this get exclusive access to a batch node with this command: srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash The in another window ssh to the node that you have been allocated (the reason this is necessary is obscure: it has to do with an incompatible version of pmix used by srun that our mpi is unhappy with). In that new window run these two commands: mpirun -n 1 python cxix1000021_scaling.py mpirun -n 80 python cxix1000021_scaling.py For me the first one shows about .11s per event while the second shows about .25s per event, which is the poor scaling behavior we would like to understand. You can see logfiles in ~cpo/problems/cxix1000021_scaling in interactive_1core.log and interactive_80core.log. I think the fundamental question that I hope you could answer is: do all numpy operations in det.calib slow down on 80 cores, or is there some other cause for the poor scaling? From discussions we've had with Yee we suspect the poor scaling may be due to behavior of the hardware, but we don't know for sure. Happy to discuss this with you on zoom at any time, of course. Thanks for looking at this! chris p.s. following a suggestion by someone (Yee? Wilko?) I watched the cpu clock speeds in /proc/cpuinfo while the 80 core job was running. To my eye they seemed to vary between ~2.4MHz and ~3.0Mhz which didn't feel like quite enough
Test environment and commands
> s3dflogin > psana srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash Ex: ana-4.0.58-py3 [dubrovin@sdfiana001:~/LCLS/con-py3]$ srun --partition milano --account lcls:prjdat21 -n 1 --time=01:00:00 --exclusive --pty /bin/bash srun: job 37602834 queued and waiting for resources srun: job 37602834 has been allocated resources In .bashrc on host sdfiana001 In .bash_login ana-4.0.58-py3 [dubrovin@sdfmilan031:~/LCLS/con-py3]$ grab example from Chris: ~cpo/problems/cxix1000021_scaling/cxix1000021_scaling.py and modify it as explained below. In other window > s3dflogin ssh -Y sdfmilan031 cd LCLS/con-py3 source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/psconda.sh source /sdf/group/lcls/ds/ana/sw/conda1/manage/bin/setup_testrel mpirun -n 1 python test-cxix1000021_scaling.py mpirun -n 80 python test-cxix1000021_scaling.py
Code for time monitoring
Major histograms
Results
t point | time point math | point description | time for rank 0/1 | rank 0/80 | rank 30/80 | rank 60/80 | ||
---|---|---|---|---|---|---|---|---|
0 | 0-0 prev evt | time per evt total, between det.calib | 115.4±3.9 ms | 335 ±16 ms | 307±14 ms | 398 ±32 ms | ||
1 | 1-0 | det.raw | 0.8±0.2 ms | 4.0 ±0.6 ms | 3.2±0.4 ms | 3.5 ±0.8 ms | ||
2 | 2-1 | det.pedestals | 15±3 μs | 36 ±10 μs | 31±6 μs | 39 ±17 μs | ||
3 | ... | det.gain,offset | 15±2 μs | 27 ±4 μs | 26±4 μs | 27 ±6 μs | ||
4 | cmpars | 25±1 μs | 50 ±7 μs | 58±26 μs | 71 ±33 μs | |||
5 | gfac | 2±0 μs | 6 ±1 μs | 7±1 μs | 7 ±2 μs | |||
6 | gr0,1,2 | 1.3±0.2 ms | 10.5 ±1.1 ms | 7.0±0.9 ms | 9.7 ±1.6 ms | |||
7 | make arrf | 1.76±0.05 ms | 9.2 ±0.9 ms | 6.3±0.7 ms | 9.0 ±1.5 ms | |||
8 | subtract peds | 93.7±3.1 ms | 191 ±11 ms | 181±15 ms | 259 ±26 ms | |||
9 | eval gain factor for gain ranges | 4.9±0.6 ms | 20.3 ±1.5 ms | 14.6±1.2 ms | 17.3 ±2.0 ms | |||
10 | eval offset for gain ranges | 6.2±0.4 ms | 18.5 ±1.3 ms | 18.4±1.4 ms | 19.2 ±2.1 ms | |||
11 | subtract offset | 1.0±0.2 ms | 6.0 ±0.7 ms | 5.3±0.6 ms | 6.2 ±1.2 ms | |||
12 | get mask | 3±2 μs | 6 ±2 μs | 6±2 μs | 7 ±2 μs | |||
13 | common mode turned off | 7±1 μs | 15 ±2 μs | 17±2 μs | 20 ±3 μs | |||
14 | apply gain factor and mask | 4.0±0.7 ms | 14.9 ±2.0 ms | 13.9±1.6 ms | 19.2 ±3.5 ms | |||
99 | 14-0 | evt processing time, inside det.calib | 109.8±4.2 ms | 276 ±15 ms | 247±13 ms | 345 ±29 ms |
Conclusion
References
Overview
Content Tools